Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared
|
|
|
- Iris Williamson
- 10 years ago
- Views:
Transcription
1 Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 1 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
2 Guido Schmutz Working for Trivadis for more than 18 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Member of Trivadis Architecture Board Technology Trivadis More than 5 years of software development experience Contact: [email protected] Blog: Twitter: gschmutz Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
3 Our company Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria. We offer our services in the following strategic business fields: OPERATION Trivadis Services takes over the interacting operation of your IT systems. 3 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
4 Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 4 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
5 What is Stream? Infrastructure for continuous data processing Computational model can be as general as MapReduce but with the ability to produce low-latency results Data collected continuously is naturally processed continuously aka. Event / Complex Event (CEP) 5 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
6 Why Stream? RPC Stream Response latency Milliseconds to minutes Synchronous Later. Possibly much later. 6 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
7 How to design a Stream System? Event Stream event Collecting/ result Event Stream event Collecting event result Event Stream event Collecting event Queue (Persist) event result 7 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
8 How to scale a Stream System? event Collecting Thread 1 event event Thread 1 result Event Stream Queue (Persist) event event event result Collecting Thread Thread Collecting Thread n Thread n 8 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
9 How to scale a Stream System? Collecting Collecting Process 1 Thread 1 event Queue 1 (Persist) event Collecting Process 1 Process 1 result event Event Stream event Collecting Collecting Process 1 Thread 1 Collecting Process 1 Process 1 event event result Queue (Persist) Queue n (Persist) Collecting Process 1 Process 1 9 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
10 How to scale a Stream System? Collecting Collecting Process Process 11 e Q1 A A Thread Process 1 1 e Q1 B B Thread Process 1 1 Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process e Qn A A Thread Process n 10 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
11 How to make (stateful) Stream System reliable? Faults and stragglers inevitable in large clusters running big data applications Streaming applications must recover from them quickly Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process 11 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
12 How to make (stateful) Stream System reliable? Solution 1: using active/passive system (hot replication) Both systems process the full load In case of a failure, automatically switch and use the passive system Stragglers slow down both active and passive system Active Event Stream Collecting Collecting Process Process e Q A A Thread Process State Passive e Q B B Thread Process Collecting Collecting Process Process e Q A A Thread Process State e Q B B Thread Process 1 State Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture = State in-memory and/or on-disk
13 How to make (stateful) Stream System reliable? Solution : Upstream backup Nodes buffer sent messages and reply them to new node in case of failure Stragglers are treated as failures Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process State buffer State = Buffer for replay in-memory and/or on-disk = State in-memory and/or on-disk 13 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
14 Models Batch Familiar concept of processing data en masse Generally incurs a high-latency (Event-) Stream A one-at-a-time processing model A datum is processed as it arrives Sub-second latency Difficult to process state data efficiently Micro-Batching A special case of batch processing with very small batch sizes (tiny) A nice mix between batching and streaming At cost of latency Gives stateful computation, making windowing an easy task 14 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
15 Message Delivery Semantics At most once [0,1] Messages my be lost Messages never redelivered At least once [1.. n] Messages will never be lost but messages may be redelivered (might be ok if consumer can handle it) Exactly once [1] Messages are never lost Messages are never redelivered Perfect message delivery Incurs higher latency for transactional semantics 15 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
16 Requirements dictate the choice Latency Is performance of streaming application paramount Development Cost Is it desired to have similar code bases for batch and stream processing => lambda architecture Message Delivery Guarantees Is there high importance on processing every single record, or is some normal amount of data loss acceptable Process Fault Tolerance Is high-availability of primary concern 16 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
17 Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 17 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
18 Apache Storm A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. A highly distributed real-time computation system Provides general primitives to do real-time computation To simplify working with queues & workers scalable and fault-tolerant complementary to Hadoop Written in Clojure, supports Java, Clojure Originated at Backtype, acquired by Twitter in 011 Open Sourced late 011 Part of Apache Incubator since September Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
19 Apache Storm Core concepts Tuple Core data structure in storm Immutable Set of Key/value pairs You can think of Storm tuples as events Values must be serializable T T T T T T T T Stream Key abstraction of Storm an unbounded sequence of tuples that can be processed in parallel by Storm Each stream is given ID and bolts can produce and consume tuples from these streams on the basis of their ID Each stream also has an associated schema of the tuples that will flow through it 19 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
20 Apache Storm Core concepts Topology Wires data and functions via a DAG (directed acyclic graph) Executes on many machines similar to a MR job in Hadoop Spout Source of data streams (tuples) can be run in reliable and unreliable mode Bolt Consumes 1+ streams and potentially produces new streams Complex operations often require multiple steps and thus multiple bolts Calculate, Filter, Aggregate, Join, Talk to database Spout Source of Stream B Spout Subscribes: A Emits: C Bolt Subscribes: A Emits: D Bolt Subscribes: A & B Emits: - Bolt Subscribes: C & D Emits: - Bolt 0 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
21 Storm How does it work? NFL: Peyton Manning and Denver s elite offense fall flat in #Superbowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Split Sentence Split Sentence Peyton Superbowl Superbowl NFL Manning... Word Count Word Count 1 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm
22 Storm How does it work? NFL: Peyton Manning and Denver s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Split Sentence Split Sentence Peyton Superbowl Superbowl NFL Manning... Word Count Word Count INCR Peyton Peyton = 1 INCR Superbowl Superbowl = 1 INCR Superbowl INCR NFL INCR Manning NFL = 1 Manning = 1 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm
23 Storm How does it work? NFL: Peyton Manning and Denver s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Split Sentence Split Sentence Peyton Superbowl Superbowl NFL Manning... Word Count Word Count INCR Peyton Peyton = 1 INCR Superbowl Superbowl = 1 INCR Superbowl INCR NFL INCR Manning NFL = 1 Manning= 1 Report Peyton= 1 Superbowl = NFL = 1 Manning = 1 3 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm
24 Storm - Topology Twitter Spout Shuffle Split Sentence Split Sentence Fields Word Count Word Count Global Report Each Spout or Bolt are running N instances in parallel Shuffle grouping Fields grouping All grouping Global grouping None grouping Direct grouping Local or Shuffle grouping is random grouping is grouped by value, such that equal value results in equal task replicates to all tasks makes all tuples go to one task makes bolt run in the same thread as bolt/spout it subscribes to producer (task that emits) controls which consumer will receive similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Falls back to shuffle grouping behavior. 4 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm
25 Storm - Creating Topology 5 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm
26 Using a NoSQL database for storing results (keeping state with counter type columns) Twitter Stream NFL: Peyton Manning and Denver s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Hashtag Splitter Hashtag Splitter superbowl superbowl seahawks broncos Hashtag Counter Hashtag Counter INCR superbowl INCR superbowl INCR seahawks INCR broncos superbowl = 1 seahawks= 1 broncos = 1 6 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
27 Storm Trident High-Level abstraction on top of storm Simplifies building topologies Core data model is the stream Processed as a series of batches (micro-batches) Stream is partitioned among nodes in cluster 5 kinds of operations in Trident Operations that apply locally to each partition and cause no network transfer Repartitioning operations that don t change the contents Aggregation operations that do network transfer Operations on grouped streams Merges and Joins 7 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
28 Storm Trident - Creating Topology Twitter Stream tweet Twitter Spout tweet Bolt Hashtag Splitter hashtag local Hashtag Normalizer hashtag groupby Bolt Persistent Aggregate 8 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
29 Trident Concepts - Function takes in a set of input fields and emits zero or more tuples as output fields of the output tuple are appended to the original input tuple in the stream If a function emits no tuples, the original input tuple is filtered out Otherwise the input tuple is duplicated for each output tuple 9 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
30 Storm Core vs. Storm Trident Core Storm Storm Trident Community > 100 contributors > 100 contributors Adoption *** * Language Options Java, Clojure, Scala, Python, Ruby, Java, Clojure, Scala Models Event-Streaming Micro-Batching DSL No Yes Stateful Ops No Yes Distributed RPC Yes Yes Delivery Guarantees At most once / At least once Exactly Once Latency sub-second seconds Platform Storm Cluster, YARN Storm Cluster, YARN 30 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
31 Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 31 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
32 Apache Spark Apache Spark is a fast and general engine for large-scale data processing The hot trend in Big Data! Based on 007 Microsoft Dryad paper Written in Scala, supports Java, Python, SQL and R Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk Runs everywhere runs on Hadoop, Mesos, standalone or in the cloud One of the largest OSS communities in big data with over 00 contributors in 50+ organizations Originally developed 009 in UC Berkley s AMPLab Open Sourced in 010 since 014 part of Apache Software foundation 3 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
33 Apache Spark Spark Core General execution engine for the Spark platform In-memory computing capabilities deliver speed General execution model supports wide variety of use cases DAG-based Ease of development native APIs in Java, Scala and Python Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Batch size as low as ½ sec, latency of about 1 sec Exactly-once semantics Potential for combining batch and streaming processing in same system Started in 01, first alpha release in Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
34 Apache Spark - Generality Libraries Spark SQL (Batch ) Blink DB (Approximate Querying) Spark Streaming (Real-Time) MLLib, Spark R (Machine Learning) GraphX (Graph ) Core Runtime Spark Core API and Execution Model Cluster Resource Managers Data Stores Spark Standalone MESOS YARN HDFS Elastic Search Cassandra S3 / DynamoDB 34 Adapted from C. Fregly: Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
35 Apache Spark Core concepts Resilient Distributed Dataset (RDD) Core Spark abstraction Collections of objects (partitions) spread across cluster Partitions can be stored in-memory or on-disk (local) Enables parallel processing on data sets Build through parallel transformations Immutable, recomputable, fault tolerant Contains transformation history ( lineage ) for whole data set Operations Stateless Transformations (map, filter, groupby) Actions (count, collect, save) 35 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
36 RDD Lineage Example HDFS File Input 1 HadoopRDD SparkContext.hadoopFile() HDFS File Input Transformations (Lazy) filter() FilteredRDD map() MappedRDD SparkContext.hadoopFile() HadoopRDD map() MappedRDD join() ShuffledRDD Action (Execute Transformations) HDFS File Output SparkContext.saveAsHadoopFile() 36 Adapted from Chris Fregly: Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
37 RDD Execution Example FileRDD Partition 1 groupbykey() ShuffledRDD Partition 1 join() Partition. Partition 5 Partition. Partition 5 FileRDD Partition 1 Partition. Partition 5 ShuffledRDD Partition 1 Partition. Partition 5 join() ShuffledRDD Partition 1 Partition. Partition 5 FileRDD Partition 1 filter() FileRDD Partition 1 map() MappedRDD Partition 1 Partition Partition Partition 37 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
38 Apache Spark Streaming Core concepts Discretized Stream (DStream) Core Spark Streaming abstraction micro batches of RDD s Operations similar to RDD Input DStreams Represents the stream of raw data received from streaming sources Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc. Custom Sources can be easily written for custom data sources Operations Same as Spark Core Additional Stateful transformations (window, reducebywindow) 38 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
39 Discretized Stream (DStream) Time Increasing time 1 time time 3. time n Input Stream message message message message DStream Transformation Lineage DStream MappedDStream map() 1 message 1 message. message n 1 f(message 1) f(message ). f(message n) message 1 message. message n f(message 1) f(message ). f(message n) 3 message 1 message. message n 3 f(message 1) f(message ). f(message n) n message 1 message. message n n f(message 1) f(message ). f(message n) Actions Trigger Spark Jobs saveashadoopfiles() result 1 result. result n result 1 result. result n result 1 result. result n result 1 result. result n 39 Adapted from Chris Fregly: Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
40 Spark Streaming Example 40 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm
41 Storm Core vs. Storm Trident vs. Spark Streaming Core Storm Storm Trident Spark Streaming Community > 100 contributors > 100 contributors > 80 contributors Adoption *** * * Language Options Models Java, Clojure, Scala, Python, Ruby, Java, Clojure, Scala Java, Scala Python (coming) Event-Streaming Micro-Batching Micro-Batching Batch (Spark Core) DSL No Yes Yes Stateful Ops No Yes Yes Distributed RPC Yes Yes No Delivery Guarantees At most once / At least once Exactly Once Exactly Once Latency sub-second seconds seconds Platform Storm Cluster, YARN Storm Cluster, YARN YARN, Mesos Standalone, DataStax EE 41 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
42 Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 4 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
43 Unified Log That s what most people think about logs [0/Jul/01:13::6-0800] "GET /wp-admin/images/date-button.gif HTTP/1.1" [0/Jul/01:13::6-0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver= HTTP/1.1" [0/Jul/01:13::6-0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver= HTTP/1.1" [0/Jul/01:13::8-0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" [0/Jul/01:13:: ] "POST /wp-admin/post.php HTTP/1.1" [0/Jul/01:13:: ] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" [0/Jul/01:13:: ] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" [0/Jul/01:13:: ] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver= HTTP/1.1" [0/Jul/01:13:: ] "POST /wp-admin/admin-ajax.php HTTP/1.1" But this is what we mean here by Log a structured log (records are numbered beginning with 0 based on order they are written) aka. commit log or journal 1 st record Next record written Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
44 Central Unified Log for (real-time) subscription Take all the organization s data and put it into a central log for subscription Properties of the Unified Log: Unified: Enterprise, single deployment Append-Only: events are appended, no update in place => immutable Ordered: each event has an offset, which is unique within a shard Fast: should be able to handle thousands of messages / sec Distributed: lives on a cluster of machines Collector writes reads Consumer System A (time = 6) reads Consumer System B (time = 10) 44 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
45 Apache Kafka - Overview A distributed publish-subscribe messaging system Designed for processing of real time activity stream data (logs, metrics collections, social media streams, ) Initially developed at LinkedIn, now part of Apache Does not follow JMS Standards and does not use JMS API Kafka maintains feeds of messages in topics Producer Producer Producer Partition 0 Anatomy of a topic: Kafka Cluster Partition Writes Consumer Consumer Consumer Partition old new 1 45 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
46 Apache Kafka - Motivation LinkedIn s motivation for Kafka was: A unified platform for handling all the real-time data feeds a large company might have. Must haves High throughput to support high volume event feeds. Support real-time processing of these feeds to create new, derived feeds. Support large data backlogs to handle periodic ingestion from offline systems. Support low-latency delivery to handle more traditional messaging use cases. Guarantee fault-tolerance in the presence of machine failures. 46 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
47 Apache Kafka - Performance Kafka at LinkedIn 10+ billion writes per day 17k messages per second (average) 55+ billion messages per day to real-time consumers Up to million writes/sec on 3 cheap machines Using 3 producers on 3 different machines 47 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
48 Apache Kafka - Partition offsets Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset Consumers track their pointers via (offset, partition, topic) tuples Consumer group C1 48 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
49 Apache Kafka two Options for Log Cleanup Retaining a window of data Ideal for event data Window can be defined in time (days) or space (GBs) defaults to 1 week Retain a complete log (log compaction) Ideal for keyed data Keep a space-efficient complete log of changes Log compaction runs in the background Ensures that always at least the last known value for each message key within the log of data is retained 49 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
50 Data Flow Graphs using Unified Log Stream processing allows for computing feeds off of other feeds Meter Readings Collector Raw Meter Readings Derived feeds are no different than original feeds they are computed off Customer Enrich / Transform Aggregate by Minute Persist Single deployment of Unified Log but logically different feeds Meter with Customer Aggregate by Minute Meter by Minute Persist Raw Meter Readings Meter by Customer by Minute Meter by Minute 50 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
51 Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 6. Summary 51 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
52 Architectural Pattern: Standalone Event Stream Business Rule Management System Rules Social Media Streams Internet of Things Event Event Cloud Enterprise Event Bus (Ingress) Event (ESP / CEP) State Store / Event Store Result Store Enterprise Event Bus Enterprise Service Bus Analytical 5 Applications DB 5 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
53 Architectural Pattern: Event Stream as part of Lambda Architecture Hadoop Big Data Infrastructure Social Media Streams HDFS Map/ Reduce Result Store Enterprise Service Bus Internet of Things Event Event Cloud Enterprise Event Bus (Ingress) Event (ESP / CEP) State Store / Event Store Result Store Enterprise Event Bus Analytical 53 Applications DB 53 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
54 Architectural Pattern: Event Stream as part of Kappa Architecture Hadoop Big Data Infrastructure HDFS Replay Social Media Streams Internet of Things Event Event Cloud Enterprise Event Bus (Ingress) Event (ESP / CEP) Result Store Enterprise Service Bus Analytical 54 Applications DB State Store / Event Store 54 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture
55 Questions and answers... Guido Schmutz Technology Manager BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 55 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014
Real-time Big Data Analytics with Storm
Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Streaming items through a cluster with Spark Streaming
Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
SPARK USE CASE IN TELCO. Apache Spark Night 9-2-2014! Chance Coble!
SPARK USE CASE IN TELCO Apache Spark Night 9-2-2014! Chance Coble! Use Case Profile Telecommunications company Shared business problems/pain Scalable analytics infrastructure is a problem Pushing infrastructure
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure
Real Time Data Processing using Spark Streaming
Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O
Spark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island
Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Kafka & Redis for Big Data Solutions
Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015
Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours
Openbus Documentation
Openbus Documentation Release 1 Produban February 17, 2014 Contents i ii An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents:
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis
Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets Andrew Psaltis About Me Recently started working at Ensighten on Agile Maketing Platform Prior 4.5 years
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS
. 3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS Deliver fast actionable business insights for data scientists, rapid application creation for developers and enterprise-grade
WELCOME. Where and When should I use the Oracle Service Bus (OSB) Guido Schmutz. UKOUG Conference 2012 04.12.2012
WELCOME Where and When should I use the Oracle Bus () Guido Schmutz UKOUG Conference 2012 04.12.2012 BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 1
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
NOT IN KANSAS ANY MORE
NOT IN KANSAS ANY MORE How we moved into Big Data Dan Taylor - JDSU Dan Taylor Dan Taylor: An Engineering Manager, Software Developer, data enthusiast and advocate of all things Agile. I m currently lucky
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
Next-Gen Big Data Analytics using the Spark stack
Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our
Big Data and Fast Data combined is it possible?
Big Data and Fast Data combined is it possible? Ulises Fasoli DBTA Workshop 2014 - Bern BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 1 Ulises
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Conquering Big Data with BDAS (Berkeley Data Analytics)
UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?»
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
Big Data Analysis: Apache Storm Perspective
Big Data Analysis: Apache Storm Perspective Muhammad Hussain Iqbal 1, Tariq Rahim Soomro 2 Faculty of Computing, SZABIST Dubai Abstract the boom in the technology has resulted in emergence of new concepts
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Introducing Storm 1 Core Storm concepts Topology design
Storm Applied brief contents 1 Introducing Storm 1 2 Core Storm concepts 12 3 Topology design 33 4 Creating robust topologies 76 5 Moving from local to remote topologies 102 6 Tuning in Storm 130 7 Resource
BIG DATA ANALYTICS For REAL TIME SYSTEM
BIG DATA ANALYTICS For REAL TIME SYSTEM Where does big data come from? Big Data is often boiled down to three main varieties: Transactional data these include data from invoices, payment orders, storage
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Predictive Analytics with Storm, Hadoop, R on AWS
Douglas Moore Principal Consultant & Architect February 2013 Predictive Analytics with Storm, Hadoop, R on AWS Leading Provider Data Science and Engineering Services Accelerating Your Time to Value using
Rakam: Distributed Analytics API
Rakam: Distributed Analytics API Burak Emre Kabakcı May 30, 2014 Abstract Today, most of the big data applications needs to compute data in real-time since the Internet develops quite fast and the users
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Wisdom from Crowds of Machines
Wisdom from Crowds of Machines Analytics and Big Data Summit September 19, 2013 Chetan Conikee Irfan Ahmad About Us CloudPhysics' mission is to discover the underlying principles that govern systems behavior
Dominik Wagenknecht Accenture
Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
Apache Kafka Your Event Stream Processing Solution
01 0110 0001 01101 Apache Kafka Your Event Stream Processing Solution White Paper www.htcinc.com Contents 1. Introduction... 2 1.1 What are Business Events?... 2 1.2 What is a Business Data Feed?... 2
Using Kafka to Optimize Data Movement and System Integration. Alex Holmes @
Using Kafka to Optimize Data Movement and System Integration Alex Holmes @ https://www.flickr.com/photos/tom_bennett/7095600611 THIS SUCKS E T (circa 2560 B.C.E.) L a few years later... 2,014 C.E. i need
Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations
Beyond Lambda - how to get from logical to physical Artur Borycki, Director International Technology & Innovations Simplification & Efficiency Teradata believe in the principles of self-service, automation
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
CAPTURING & PROCESSING REAL-TIME DATA ON AWS
CAPTURING & PROCESSING REAL-TIME DATA ON AWS @ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent
FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara
CS535 Big Data - Fall 2015 W1.B.1 CS535 Big Data - Fall 2015 W1.B.2 CS535 BIG DATA FAQs Wait list Term project topics PART 0. INTRODUCTION 2. A PARADIGM FOR BIG DATA Sangmi Lee Pallickara Computer Science,
Big Data Processing. Patrick Wendell Databricks
Big Data Processing Patrick Wendell Databricks About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Introduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]
Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
Putting Apache Kafka to Use!
Putting Apache Kafka to Use! Building a Real-time Data Platform for Event Streams! JAY KREPS, CONFLUENT! A Couple of Themes! Theme 1: Rise of Events! Theme 2: Immutability Everywhere! Level! Example! Immutable
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Big Data Analytics with Cassandra, Spark & MLLib
Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE
Future Internet Technologies
Future Internet Technologies Big (?) Processing Dr. Dennis Pfisterer Institut für Telematik, Universität zu Lübeck http://www.itm.uni-luebeck.de/people/pfisterer FIT Until Now Architectures -Server SPDY
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera
Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane
BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements
WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka
WHITE PAPER Reference Guide for Deploying and Configuring Apache Kafka Revised: 02/2015 Table of Content 1. Introduction 3 2. Apache Kafka Technology Overview 3 3. Common Use Cases for Kafka 4 4. Deploying
From Spark to Ignition:
From Spark to Ignition: Fueling Your Business on Real-Time Analytics Eric Frenkiel, MemSQL CEO June 29, 2015 San Francisco, CA What s in Store For This Presentation? 1. MemSQL: A real-time database for
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
Big Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements
Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data
Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:
Cloud (data) management Ahmed Ali-Eldin First part: ZooKeeper (Yahoo!) Agenda A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination
In Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
SOUG-SIG Data Replication With Oracle GoldenGate Looking Behind The Scenes Robert Bialek Principal Consultant Partner
SOUG-SIG Data Replication With Oracle GoldenGate Looking Behind The Scenes Robert Bialek Principal Consultant Partner BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov
Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA by Christian Tzolov @christzolov Whoami Christian Tzolov Technical Architect at Pivotal, BigData, Hadoop, SpringXD,
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS
Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
Going Deep with Spark Streaming
Going Deep with Spark Streaming Andrew Psaltis (@itmdata) ApacheCon, April 16, 2015 Outline Introduction DStreams Thinking about time Recovery and Fault tolerance Conclusion About Me Andrew Psaltis Data
Creating Big Data Applications with Spring XD
Creating Big Data Applications with Spring XD Thomas Darimont @thomasdarimont THE FASTEST PATH TO NEW BUSINESS VALUE Journey Introduction Concepts Applications Outlook 3 Unless otherwise indicated, these
