Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Transcription

1 Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 1 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

2 Guido Schmutz Working for Trivadis for more than 18 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Member of Trivadis Architecture Board Technology Trivadis More than 5 years of software development experience Contact: [email protected] Blog: Twitter: gschmutz Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

3 Our company Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria. We offer our services in the following strategic business fields: OPERATION Trivadis Services takes over the interacting operation of your IT systems. 3 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

4 Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 4 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

5 What is Stream? Infrastructure for continuous data processing Computational model can be as general as MapReduce but with the ability to produce low-latency results Data collected continuously is naturally processed continuously aka. Event / Complex Event (CEP) 5 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

6 Why Stream? RPC Stream Response latency Milliseconds to minutes Synchronous Later. Possibly much later. 6 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

7 How to design a Stream System? Event Stream event Collecting/ result Event Stream event Collecting event result Event Stream event Collecting event Queue (Persist) event result 7 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

8 How to scale a Stream System? event Collecting Thread 1 event event Thread 1 result Event Stream Queue (Persist) event event event result Collecting Thread Thread Collecting Thread n Thread n 8 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

9 How to scale a Stream System? Collecting Collecting Process 1 Thread 1 event Queue 1 (Persist) event Collecting Process 1 Process 1 result event Event Stream event Collecting Collecting Process 1 Thread 1 Collecting Process 1 Process 1 event event result Queue (Persist) Queue n (Persist) Collecting Process 1 Process 1 9 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

10 How to scale a Stream System? Collecting Collecting Process Process 11 e Q1 A A Thread Process 1 1 e Q1 B B Thread Process 1 1 Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process e Qn A A Thread Process n 10 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

11 How to make (stateful) Stream System reliable? Faults and stragglers inevitable in large clusters running big data applications Streaming applications must recover from them quickly Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process 11 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

12 How to make (stateful) Stream System reliable? Solution 1: using active/passive system (hot replication) Both systems process the full load In case of a failure, automatically switch and use the passive system Stragglers slow down both active and passive system Active Event Stream Collecting Collecting Process Process e Q A A Thread Process State Passive e Q B B Thread Process Collecting Collecting Process Process e Q A A Thread Process State e Q B B Thread Process 1 State Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture = State in-memory and/or on-disk

13 How to make (stateful) Stream System reliable? Solution : Upstream backup Nodes buffer sent messages and reply them to new node in case of failure Stragglers are treated as failures Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process State buffer State = Buffer for replay in-memory and/or on-disk = State in-memory and/or on-disk 13 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

14 Models Batch Familiar concept of processing data en masse Generally incurs a high-latency (Event-) Stream A one-at-a-time processing model A datum is processed as it arrives Sub-second latency Difficult to process state data efficiently Micro-Batching A special case of batch processing with very small batch sizes (tiny) A nice mix between batching and streaming At cost of latency Gives stateful computation, making windowing an easy task 14 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

15 Message Delivery Semantics At most once [0,1] Messages my be lost Messages never redelivered At least once [1.. n] Messages will never be lost but messages may be redelivered (might be ok if consumer can handle it) Exactly once [1] Messages are never lost Messages are never redelivered Perfect message delivery Incurs higher latency for transactional semantics 15 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

16 Requirements dictate the choice Latency Is performance of streaming application paramount Development Cost Is it desired to have similar code bases for batch and stream processing => lambda architecture Message Delivery Guarantees Is there high importance on processing every single record, or is some normal amount of data loss acceptable Process Fault Tolerance Is high-availability of primary concern 16 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

18 Apache Storm A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. A highly distributed real-time computation system Provides general primitives to do real-time computation To simplify working with queues & workers scalable and fault-tolerant complementary to Hadoop Written in Clojure, supports Java, Clojure Originated at Backtype, acquired by Twitter in 011 Open Sourced late 011 Part of Apache Incubator since September Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

19 Apache Storm Core concepts Tuple Core data structure in storm Immutable Set of Key/value pairs You can think of Storm tuples as events Values must be serializable T T T T T T T T Stream Key abstraction of Storm an unbounded sequence of tuples that can be processed in parallel by Storm Each stream is given ID and bolts can produce and consume tuples from these streams on the basis of their ID Each stream also has an associated schema of the tuples that will flow through it 19 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

20 Apache Storm Core concepts Topology Wires data and functions via a DAG (directed acyclic graph) Executes on many machines similar to a MR job in Hadoop Spout Source of data streams (tuples) can be run in reliable and unreliable mode Bolt Consumes 1+ streams and potentially produces new streams Complex operations often require multiple steps and thus multiple bolts Calculate, Filter, Aggregate, Join, Talk to database Spout Source of Stream B Spout Subscribes: A Emits: C Bolt Subscribes: A Emits: D Bolt Subscribes: A & B Emits: - Bolt Subscribes: C & D Emits: - Bolt 0 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

21 Storm How does it work? NFL: Peyton Manning and Denver s elite offense fall flat in #Superbowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Split Sentence Split Sentence Peyton Superbowl Superbowl NFL Manning... Word Count Word Count 1 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

22 Storm How does it work? NFL: Peyton Manning and Denver s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Split Sentence Split Sentence Peyton Superbowl Superbowl NFL Manning... Word Count Word Count INCR Peyton Peyton = 1 INCR Superbowl Superbowl = 1 INCR Superbowl INCR NFL INCR Manning NFL = 1 Manning = 1 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

23 Storm How does it work? NFL: Peyton Manning and Denver s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Split Sentence Split Sentence Peyton Superbowl Superbowl NFL Manning... Word Count Word Count INCR Peyton Peyton = 1 INCR Superbowl Superbowl = 1 INCR Superbowl INCR NFL INCR Manning NFL = 1 Manning= 1 Report Peyton= 1 Superbowl = NFL = 1 Manning = 1 3 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

24 Storm - Topology Twitter Spout Shuffle Split Sentence Split Sentence Fields Word Count Word Count Global Report Each Spout or Bolt are running N instances in parallel Shuffle grouping Fields grouping All grouping Global grouping None grouping Direct grouping Local or Shuffle grouping is random grouping is grouped by value, such that equal value results in equal task replicates to all tasks makes all tuples go to one task makes bolt run in the same thread as bolt/spout it subscribes to producer (task that emits) controls which consumer will receive similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Falls back to shuffle grouping behavior. 4 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

25 Storm - Creating Topology 5 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

26 Using a NoSQL database for storing results (keeping state with counter type columns) Twitter Stream NFL: Peyton Manning and Denver s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Hashtag Splitter Hashtag Splitter superbowl superbowl seahawks broncos Hashtag Counter Hashtag Counter INCR superbowl INCR superbowl INCR seahawks INCR broncos superbowl = 1 seahawks= 1 broncos = 1 6 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

27 Storm Trident High-Level abstraction on top of storm Simplifies building topologies Core data model is the stream Processed as a series of batches (micro-batches) Stream is partitioned among nodes in cluster 5 kinds of operations in Trident Operations that apply locally to each partition and cause no network transfer Repartitioning operations that don t change the contents Aggregation operations that do network transfer Operations on grouped streams Merges and Joins 7 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

28 Storm Trident - Creating Topology Twitter Stream tweet Twitter Spout tweet Bolt Hashtag Splitter hashtag local Hashtag Normalizer hashtag groupby Bolt Persistent Aggregate 8 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

29 Trident Concepts - Function takes in a set of input fields and emits zero or more tuples as output fields of the output tuple are appended to the original input tuple in the stream If a function emits no tuples, the original input tuple is filtered out Otherwise the input tuple is duplicated for each output tuple 9 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

30 Storm Core vs. Storm Trident Core Storm Storm Trident Community > 100 contributors > 100 contributors Adoption *** * Language Options Java, Clojure, Scala, Python, Ruby, Java, Clojure, Scala Models Event-Streaming Micro-Batching DSL No Yes Stateful Ops No Yes Distributed RPC Yes Yes Delivery Guarantees At most once / At least once Exactly Once Latency sub-second seconds Platform Storm Cluster, YARN Storm Cluster, YARN 30 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

32 Apache Spark Apache Spark is a fast and general engine for large-scale data processing The hot trend in Big Data! Based on 007 Microsoft Dryad paper Written in Scala, supports Java, Python, SQL and R Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk Runs everywhere runs on Hadoop, Mesos, standalone or in the cloud One of the largest OSS communities in big data with over 00 contributors in 50+ organizations Originally developed 009 in UC Berkley s AMPLab Open Sourced in 010 since 014 part of Apache Software foundation 3 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

33 Apache Spark Spark Core General execution engine for the Spark platform In-memory computing capabilities deliver speed General execution model supports wide variety of use cases DAG-based Ease of development native APIs in Java, Scala and Python Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Batch size as low as ½ sec, latency of about 1 sec Exactly-once semantics Potential for combining batch and streaming processing in same system Started in 01, first alpha release in Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

34 Apache Spark - Generality Libraries Spark SQL (Batch ) Blink DB (Approximate Querying) Spark Streaming (Real-Time) MLLib, Spark R (Machine Learning) GraphX (Graph ) Core Runtime Spark Core API and Execution Model Cluster Resource Managers Data Stores Spark Standalone MESOS YARN HDFS Elastic Search Cassandra S3 / DynamoDB 34 Adapted from C. Fregly: Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

35 Apache Spark Core concepts Resilient Distributed Dataset (RDD) Core Spark abstraction Collections of objects (partitions) spread across cluster Partitions can be stored in-memory or on-disk (local) Enables parallel processing on data sets Build through parallel transformations Immutable, recomputable, fault tolerant Contains transformation history ( lineage ) for whole data set Operations Stateless Transformations (map, filter, groupby) Actions (count, collect, save) 35 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

36 RDD Lineage Example HDFS File Input 1 HadoopRDD SparkContext.hadoopFile() HDFS File Input Transformations (Lazy) filter() FilteredRDD map() MappedRDD SparkContext.hadoopFile() HadoopRDD map() MappedRDD join() ShuffledRDD Action (Execute Transformations) HDFS File Output SparkContext.saveAsHadoopFile() 36 Adapted from Chris Fregly: Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

37 RDD Execution Example FileRDD Partition 1 groupbykey() ShuffledRDD Partition 1 join() Partition. Partition 5 Partition. Partition 5 FileRDD Partition 1 Partition. Partition 5 ShuffledRDD Partition 1 Partition. Partition 5 join() ShuffledRDD Partition 1 Partition. Partition 5 FileRDD Partition 1 filter() FileRDD Partition 1 map() MappedRDD Partition 1 Partition Partition Partition 37 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

38 Apache Spark Streaming Core concepts Discretized Stream (DStream) Core Spark Streaming abstraction micro batches of RDD s Operations similar to RDD Input DStreams Represents the stream of raw data received from streaming sources Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc. Custom Sources can be easily written for custom data sources Operations Same as Spark Core Additional Stateful transformations (window, reducebywindow) 38 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

39 Discretized Stream (DStream) Time Increasing time 1 time time 3. time n Input Stream message message message message DStream Transformation Lineage DStream MappedDStream map() 1 message 1 message. message n 1 f(message 1) f(message ). f(message n) message 1 message. message n f(message 1) f(message ). f(message n) 3 message 1 message. message n 3 f(message 1) f(message ). f(message n) n message 1 message. message n n f(message 1) f(message ). f(message n) Actions Trigger Spark Jobs saveashadoopfiles() result 1 result. result n result 1 result. result n result 1 result. result n result 1 result. result n 39 Adapted from Chris Fregly: Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

40 Spark Streaming Example 40 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

41 Storm Core vs. Storm Trident vs. Spark Streaming Core Storm Storm Trident Spark Streaming Community > 100 contributors > 100 contributors > 80 contributors Adoption *** * * Language Options Models Java, Clojure, Scala, Python, Ruby, Java, Clojure, Scala Java, Scala Python (coming) Event-Streaming Micro-Batching Micro-Batching Batch (Spark Core) DSL No Yes Yes Stateful Ops No Yes Yes Distributed RPC Yes Yes No Delivery Guarantees At most once / At least once Exactly Once Exactly Once Latency sub-second seconds seconds Platform Storm Cluster, YARN Storm Cluster, YARN YARN, Mesos Standalone, DataStax EE 41 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

43 Unified Log That s what most people think about logs [0/Jul/01:13::6-0800] "GET /wp-admin/images/date-button.gif HTTP/1.1" [0/Jul/01:13::6-0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver= HTTP/1.1" [0/Jul/01:13::6-0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver= HTTP/1.1" [0/Jul/01:13::8-0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" [0/Jul/01:13:: ] "POST /wp-admin/post.php HTTP/1.1" [0/Jul/01:13:: ] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" [0/Jul/01:13:: ] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" [0/Jul/01:13:: ] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver= HTTP/1.1" [0/Jul/01:13:: ] "POST /wp-admin/admin-ajax.php HTTP/1.1" But this is what we mean here by Log a structured log (records are numbered beginning with 0 based on order they are written) aka. commit log or journal 1 st record Next record written Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

44 Central Unified Log for (real-time) subscription Take all the organization s data and put it into a central log for subscription Properties of the Unified Log: Unified: Enterprise, single deployment Append-Only: events are appended, no update in place => immutable Ordered: each event has an offset, which is unique within a shard Fast: should be able to handle thousands of messages / sec Distributed: lives on a cluster of machines Collector writes reads Consumer System A (time = 6) reads Consumer System B (time = 10) 44 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

45 Apache Kafka - Overview A distributed publish-subscribe messaging system Designed for processing of real time activity stream data (logs, metrics collections, social media streams, ) Initially developed at LinkedIn, now part of Apache Does not follow JMS Standards and does not use JMS API Kafka maintains feeds of messages in topics Producer Producer Producer Partition 0 Anatomy of a topic: Kafka Cluster Partition Writes Consumer Consumer Consumer Partition old new 1 45 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

46 Apache Kafka - Motivation LinkedIn s motivation for Kafka was: A unified platform for handling all the real-time data feeds a large company might have. Must haves High throughput to support high volume event feeds. Support real-time processing of these feeds to create new, derived feeds. Support large data backlogs to handle periodic ingestion from offline systems. Support low-latency delivery to handle more traditional messaging use cases. Guarantee fault-tolerance in the presence of machine failures. 46 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

47 Apache Kafka - Performance Kafka at LinkedIn 10+ billion writes per day 17k messages per second (average) 55+ billion messages per day to real-time consumers Up to million writes/sec on 3 cheap machines Using 3 producers on 3 different machines 47 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

48 Apache Kafka - Partition offsets Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset Consumers track their pointers via (offset, partition, topic) tuples Consumer group C1 48 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

49 Apache Kafka two Options for Log Cleanup Retaining a window of data Ideal for event data Window can be defined in time (days) or space (GBs) defaults to 1 week Retain a complete log (log compaction) Ideal for keyed data Keep a space-efficient complete log of changes Log compaction runs in the background Ensures that always at least the last known value for each message key within the log of data is retained 49 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

50 Data Flow Graphs using Unified Log Stream processing allows for computing feeds off of other feeds Meter Readings Collector Raw Meter Readings Derived feeds are no different than original feeds they are computed off Customer Enrich / Transform Aggregate by Minute Persist Single deployment of Unified Log but logically different feeds Meter with Customer Aggregate by Minute Meter by Minute Persist Raw Meter Readings Meter by Customer by Minute Meter by Minute 50 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

51 Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 6. Summary 51 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

52 Architectural Pattern: Standalone Event Stream Business Rule Management System Rules Social Media Streams Internet of Things Event Event Cloud Enterprise Event Bus (Ingress) Event (ESP / CEP) State Store / Event Store Result Store Enterprise Event Bus Enterprise Service Bus Analytical 5 Applications DB 5 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

53 Architectural Pattern: Event Stream as part of Lambda Architecture Hadoop Big Data Infrastructure Social Media Streams HDFS Map/ Reduce Result Store Enterprise Service Bus Internet of Things Event Event Cloud Enterprise Event Bus (Ingress) Event (ESP / CEP) State Store / Event Store Result Store Enterprise Event Bus Analytical 53 Applications DB 53 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

54 Architectural Pattern: Event Stream as part of Kappa Architecture Hadoop Big Data Infrastructure HDFS Replay Social Media Streams Internet of Things Event Event Cloud Enterprise Event Bus (Ingress) Event (ESP / CEP) Result Store Enterprise Service Bus Analytical 54 Applications DB State Store / Event Store 54 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

55 Questions and answers... Guido Schmutz Technology Manager BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 55 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014