Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared



Similar documents
Real-time Big Data Analytics with Storm

Beyond Hadoop with Apache Spark and BDAS

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Streaming items through a cluster with Spark Streaming

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Architectures for massive data management

Unified Big Data Processing with Apache Spark. Matei

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Real Time Data Processing using Spark Streaming

Spark: Making Big Data Interactive & Real-Time

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Scaling Out With Apache Spark. DTL Meeting Slides based on

Kafka & Redis for Big Data Solutions

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Openbus Documentation

Unified Big Data Analytics Pipeline. 连 城

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Architectures for massive data management

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

WELCOME. Where and When should I use the Oracle Service Bus (OSB) Guido Schmutz. UKOUG Conference

Hadoop Ecosystem B Y R A H I M A.

NOT IN KANSAS ANY MORE

Big Data Analytics. Lucas Rego Drumond

Next-Gen Big Data Analytics using the Spark stack

Big Data and Fast Data combined is it possible?

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Conquering Big Data with BDAS (Berkeley Data Analytics)

How To Create A Data Visualization With Apache Spark And Zeppelin

Big Data Analysis: Apache Storm Perspective

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Introducing Storm 1 Core Storm concepts Topology design

BIG DATA ANALYTICS For REAL TIME SYSTEM

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Moving From Hadoop to Spark

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Predictive Analytics with Storm, Hadoop, R on AWS

Rakam: Distributed Analytics API

Brave New World: Hadoop vs. Spark

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Wisdom from Crowds of Machines

Dominik Wagenknecht Accenture

CSE-E5430 Scalable Cloud Computing Lecture 11

Apache Kafka Your Event Stream Processing Solution

Using Kafka to Optimize Data Movement and System Integration. Alex

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

MapReduce with Apache Hadoop Analysing Big Data

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Big Data Processing. Patrick Wendell Databricks

Hadoop & Spark Using Amazon EMR

Introduction to Spark

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Putting Apache Kafka to Use!

Spark and the Big Data Library

Big Data Analytics with Cassandra, Spark & MLLib

Future Internet Technologies

Big Data Analytics Hadoop and Spark

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Big Data With Hadoop

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

From Spark to Ignition:

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data Analytics - Accelerated. stream-horizon.com

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Apache Flink Next-gen data analysis. Kostas

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

How To Scale Out Of A Nosql Database

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

In Memory Accelerator for MongoDB

SOUG-SIG Data Replication With Oracle GoldenGate Looking Behind The Scenes Robert Bialek Principal Consultant Partner

Large scale processing using Hadoop. Ján Vaňo

NoSQL and Hadoop Technologies On Oracle Cloud

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Ali Ghodsi Head of PM and Engineering Databricks

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Big Data Course Highlights

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Going Deep with Spark Streaming

Creating Big Data Applications with Spring XD

Transcription:

Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, 3.1.014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 1 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Guido Schmutz Working for Trivadis for more than 18 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Member of Trivadis Architecture Board Technology Manager @ Trivadis More than 5 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Twitter: gschmutz Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Our company Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria. We offer our services in the following strategic business fields: OPERATION Trivadis Services takes over the interacting operation of your IT systems. 3 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 4 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

What is Stream? Infrastructure for continuous data processing Computational model can be as general as MapReduce but with the ability to produce low-latency results Data collected continuously is naturally processed continuously aka. Event / Complex Event (CEP) 5 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Why Stream? RPC Stream Response latency Milliseconds to minutes Synchronous Later. Possibly much later. 6 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

How to design a Stream System? Event Stream event Collecting/ result Event Stream event Collecting event result Event Stream event Collecting event Queue (Persist) event result 7 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

How to scale a Stream System? event Collecting Thread 1 event event Thread 1 result Event Stream Queue (Persist) event event event result Collecting Thread Thread Collecting Thread n Thread n 8 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

How to scale a Stream System? Collecting Collecting Process 1 Thread 1 event Queue 1 (Persist) event Collecting Process 1 Process 1 result event Event Stream event Collecting Collecting Process 1 Thread 1 Collecting Process 1 Process 1 event event result Queue (Persist) Queue n (Persist) Collecting Process 1 Process 1 9 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

How to scale a Stream System? Collecting Collecting Process Process 11 e Q1 A A Thread Process 1 1 e Q1 B B Thread Process 1 1 Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process e Qn A A Thread Process n 10 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

How to make (stateful) Stream System reliable? Faults and stragglers inevitable in large clusters running big data applications Streaming applications must recover from them quickly Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process 11 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

How to make (stateful) Stream System reliable? Solution 1: using active/passive system (hot replication) Both systems process the full load In case of a failure, automatically switch and use the passive system Stragglers slow down both active and passive system Active Event Stream Collecting Collecting Process Process e Q A A Thread Process State Passive e Q B B Thread Process Collecting Collecting Process Process e Q A A Thread Process State e Q B B Thread Process 1 State Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture = State in-memory and/or on-disk

How to make (stateful) Stream System reliable? Solution : Upstream backup Nodes buffer sent messages and reply them to new node in case of failure Stragglers are treated as failures Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process State buffer State = Buffer for replay in-memory and/or on-disk = State in-memory and/or on-disk 13 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Models Batch Familiar concept of processing data en masse Generally incurs a high-latency (Event-) Stream A one-at-a-time processing model A datum is processed as it arrives Sub-second latency Difficult to process state data efficiently Micro-Batching A special case of batch processing with very small batch sizes (tiny) A nice mix between batching and streaming At cost of latency Gives stateful computation, making windowing an easy task 14 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Message Delivery Semantics At most once [0,1] Messages my be lost Messages never redelivered At least once [1.. n] Messages will never be lost but messages may be redelivered (might be ok if consumer can handle it) Exactly once [1] Messages are never lost Messages are never redelivered Perfect message delivery Incurs higher latency for transactional semantics 15 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Requirements dictate the choice Latency Is performance of streaming application paramount Development Cost Is it desired to have similar code bases for batch and stream processing => lambda architecture Message Delivery Guarantees Is there high importance on processing every single record, or is some normal amount of data loss acceptable Process Fault Tolerance Is high-availability of primary concern 16 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 17 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Apache Storm A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. A highly distributed real-time computation system Provides general primitives to do real-time computation To simplify working with queues & workers scalable and fault-tolerant complementary to Hadoop Written in Clojure, supports Java, Clojure Originated at Backtype, acquired by Twitter in 011 Open Sourced late 011 Part of Apache Incubator since September 013 18 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Apache Storm Core concepts Tuple Core data structure in storm Immutable Set of Key/value pairs You can think of Storm tuples as events Values must be serializable T T T T T T T T Stream Key abstraction of Storm an unbounded sequence of tuples that can be processed in parallel by Storm Each stream is given ID and bolts can produce and consume tuples from these streams on the basis of their ID Each stream also has an associated schema of the tuples that will flow through it 19 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Apache Storm Core concepts Topology Wires data and functions via a DAG (directed acyclic graph) Executes on many machines similar to a MR job in Hadoop Spout Source of data streams (tuples) can be run in reliable and unreliable mode Bolt Consumes 1+ streams and potentially produces new streams Complex operations often require multiple steps and thus multiple bolts Calculate, Filter, Aggregate, Join, Talk to database Spout Source of Stream B Spout Subscribes: A Emits: C Bolt Subscribes: A Emits: D Bolt Subscribes: A & B Emits: - Bolt Subscribes: C & D Emits: - Bolt 0 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Storm How does it work? NFL: Peyton Manning and Denver s elite offense fall flat in #Superbowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Split Sentence Split Sentence Peyton Superbowl Superbowl NFL Manning... Word Count Word Count 1 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

Storm How does it work? NFL: Peyton Manning and Denver s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Split Sentence Split Sentence Peyton Superbowl Superbowl NFL Manning... Word Count Word Count INCR Peyton Peyton = 1 INCR Superbowl Superbowl = 1 INCR Superbowl INCR NFL INCR Manning NFL = 1 Manning = 1 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

Storm How does it work? NFL: Peyton Manning and Denver s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Split Sentence Split Sentence Peyton Superbowl Superbowl NFL Manning... Word Count Word Count INCR Peyton Peyton = 1 INCR Superbowl Superbowl = 1 INCR Superbowl INCR NFL INCR Manning NFL = 1 Manning= 1 Report Peyton= 1 Superbowl = NFL = 1 Manning = 1 3 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

Storm - Topology Twitter Spout Shuffle Split Sentence Split Sentence Fields Word Count Word Count Global Report Each Spout or Bolt are running N instances in parallel Shuffle grouping Fields grouping All grouping Global grouping None grouping Direct grouping Local or Shuffle grouping is random grouping is grouped by value, such that equal value results in equal task replicates to all tasks makes all tuples go to one task makes bolt run in the same thread as bolt/spout it subscribes to producer (task that emits) controls which consumer will receive similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Falls back to shuffle grouping behavior. 4 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

Storm - Creating Topology 5 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

Using a NoSQL database for storing results (keeping state with counter type columns) Twitter Stream NFL: Peyton Manning and Denver s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Hashtag Splitter Hashtag Splitter superbowl superbowl seahawks broncos Hashtag Counter Hashtag Counter INCR superbowl INCR superbowl INCR seahawks INCR broncos superbowl = 1 seahawks= 1 broncos = 1 6 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Storm Trident High-Level abstraction on top of storm Simplifies building topologies Core data model is the stream Processed as a series of batches (micro-batches) Stream is partitioned among nodes in cluster 5 kinds of operations in Trident Operations that apply locally to each partition and cause no network transfer Repartitioning operations that don t change the contents Aggregation operations that do network transfer Operations on grouped streams Merges and Joins 7 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Storm Trident - Creating Topology Twitter Stream tweet Twitter Spout tweet Bolt Hashtag Splitter hashtag local Hashtag Normalizer hashtag groupby Bolt Persistent Aggregate 8 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Trident Concepts - Function takes in a set of input fields and emits zero or more tuples as output fields of the output tuple are appended to the original input tuple in the stream If a function emits no tuples, the original input tuple is filtered out Otherwise the input tuple is duplicated for each output tuple 9 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Storm Core vs. Storm Trident Core Storm Storm Trident Community > 100 contributors > 100 contributors Adoption *** * Language Options Java, Clojure, Scala, Python, Ruby, Java, Clojure, Scala Models Event-Streaming Micro-Batching DSL No Yes Stateful Ops No Yes Distributed RPC Yes Yes Delivery Guarantees At most once / At least once Exactly Once Latency sub-second seconds Platform Storm Cluster, YARN Storm Cluster, YARN 30 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 31 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Apache Spark Apache Spark is a fast and general engine for large-scale data processing The hot trend in Big Data! Based on 007 Microsoft Dryad paper Written in Scala, supports Java, Python, SQL and R Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk Runs everywhere runs on Hadoop, Mesos, standalone or in the cloud One of the largest OSS communities in big data with over 00 contributors in 50+ organizations Originally developed 009 in UC Berkley s AMPLab Open Sourced in 010 since 014 part of Apache Software foundation 3 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Apache Spark Spark Core General execution engine for the Spark platform In-memory computing capabilities deliver speed General execution model supports wide variety of use cases DAG-based Ease of development native APIs in Java, Scala and Python Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Batch size as low as ½ sec, latency of about 1 sec Exactly-once semantics Potential for combining batch and streaming processing in same system Started in 01, first alpha release in 013 33 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Apache Spark - Generality Libraries Spark SQL (Batch ) Blink DB (Approximate Querying) Spark Streaming (Real-Time) MLLib, Spark R (Machine Learning) GraphX (Graph ) Core Runtime Spark Core API and Execution Model Cluster Resource Managers Data Stores Spark Standalone MESOS YARN HDFS Elastic Search Cassandra S3 / DynamoDB 34 Adapted from C. Fregly: http://slidesha.re/11pp7fv Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Apache Spark Core concepts Resilient Distributed Dataset (RDD) Core Spark abstraction Collections of objects (partitions) spread across cluster Partitions can be stored in-memory or on-disk (local) Enables parallel processing on data sets Build through parallel transformations Immutable, recomputable, fault tolerant Contains transformation history ( lineage ) for whole data set Operations Stateless Transformations (map, filter, groupby) Actions (count, collect, save) 35 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

RDD Lineage Example HDFS File Input 1 HadoopRDD SparkContext.hadoopFile() HDFS File Input Transformations (Lazy) filter() FilteredRDD map() MappedRDD SparkContext.hadoopFile() HadoopRDD map() MappedRDD join() ShuffledRDD Action (Execute Transformations) HDFS File Output SparkContext.saveAsHadoopFile() 36 Adapted from Chris Fregly: http://slidesha.re/11pp7fv Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

RDD Execution Example FileRDD Partition 1 groupbykey() ShuffledRDD Partition 1 join() Partition. Partition 5 Partition. Partition 5 FileRDD Partition 1 Partition. Partition 5 ShuffledRDD Partition 1 Partition. Partition 5 join() ShuffledRDD Partition 1 Partition. Partition 5 FileRDD Partition 1 filter() FileRDD Partition 1 map() MappedRDD Partition 1 Partition Partition Partition 37 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Apache Spark Streaming Core concepts Discretized Stream (DStream) Core Spark Streaming abstraction micro batches of RDD s Operations similar to RDD Input DStreams Represents the stream of raw data received from streaming sources Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc. Custom Sources can be easily written for custom data sources Operations Same as Spark Core Additional Stateful transformations (window, reducebywindow) 38 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Discretized Stream (DStream) Time Increasing time 1 time time 3. time n Input Stream message message message message DStream Transformation Lineage DStream MappedDStream map() RDD @time 1 message 1 message. message n RDD @time 1 f(message 1) f(message ). f(message n) RDD @time message 1 message. message n RDD @time f(message 1) f(message ). f(message n) RDD @time 3 message 1 message. message n RDD @time 3 f(message 1) f(message ). f(message n) RDD @time n message 1 message. message n RDD @time n f(message 1) f(message ). f(message n) Actions Trigger Spark Jobs saveashadoopfiles() result 1 result. result n result 1 result. result n result 1 result. result n result 1 result. result n 39 Adapted from Chris Fregly: http://slidesha.re/11pp7fv Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Spark Streaming Example 40 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

Storm Core vs. Storm Trident vs. Spark Streaming Core Storm Storm Trident Spark Streaming Community > 100 contributors > 100 contributors > 80 contributors Adoption *** * * Language Options Models Java, Clojure, Scala, Python, Ruby, Java, Clojure, Scala Java, Scala Python (coming) Event-Streaming Micro-Batching Micro-Batching Batch (Spark Core) DSL No Yes Yes Stateful Ops No Yes Yes Distributed RPC Yes Yes No Delivery Guarantees At most once / At least once Exactly Once Exactly Once Latency sub-second seconds seconds Platform Storm Cluster, YARN Storm Cluster, YARN YARN, Mesos Standalone, DataStax EE 41 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 4 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Unified Log That s what most people think about logs 137.9.78.45 - - [0/Jul/01:13::6-0800] "GET /wp-admin/images/date-button.gif HTTP/1.1" 00 111 137.9.78.45 - - [0/Jul/01:13::6-0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-0805 HTTP/1.1" 00 13593 137.9.78.45 - - [0/Jul/01:13::6-0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver=349-0805 HTTP/1.1" 00 101114 137.9.78.45 - - [0/Jul/01:13::8-0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 00 30747 137.9.78.45 - - [0/Jul/01:13::40-0800] "POST /wp-admin/post.php HTTP/1.1" 30-137.9.78.45 - - [0/Jul/01:13::40-0800] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" 00 73160 137.9.78.45 - - [0/Jul/01:13::41-0800] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" 304-137.9.78.45 - - [0/Jul/01:13::41-0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-0805 HTTP/1.1" 304-137.9.78.45 - - [0/Jul/01:13::41-0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 00 30809 But this is what we mean here by Log a structured log (records are numbered beginning with 0 based on order they are written) aka. commit log or journal 1 st record Next record written 0 1 3 4 5 6 7 8 9 10 11 43 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Central Unified Log for (real-time) subscription Take all the organization s data and put it into a central log for subscription Properties of the Unified Log: Unified: Enterprise, single deployment Append-Only: events are appended, no update in place => immutable Ordered: each event has an offset, which is unique within a shard Fast: should be able to handle thousands of messages / sec Distributed: lives on a cluster of machines Collector writes 0 1 3 4 5 6 7 8 9 10 11 reads Consumer System A (time = 6) reads Consumer System B (time = 10) 44 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Apache Kafka - Overview A distributed publish-subscribe messaging system Designed for processing of real time activity stream data (logs, metrics collections, social media streams, ) Initially developed at LinkedIn, now part of Apache Does not follow JMS Standards and does not use JMS API Kafka maintains feeds of messages in topics Producer Producer Producer Partition 0 Anatomy of a topic: 0 1 3 4 5 6 7 8 9 1 0 1 1 1 Kafka Cluster Partition 1 0 1 3 4 5 6 7 8 9 Writes Consumer Consumer Consumer Partition 0 1 3 4 5 6 7 8 9 old 1 0 1 1 new 1 45 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Apache Kafka - Motivation LinkedIn s motivation for Kafka was: A unified platform for handling all the real-time data feeds a large company might have. Must haves High throughput to support high volume event feeds. Support real-time processing of these feeds to create new, derived feeds. Support large data backlogs to handle periodic ingestion from offline systems. Support low-latency delivery to handle more traditional messaging use cases. Guarantee fault-tolerance in the presence of machine failures. 46 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Apache Kafka - Performance Kafka at LinkedIn 10+ billion writes per day 17k messages per second (average) 55+ billion messages per day to real-time consumers Up to million writes/sec on 3 cheap machines Using 3 producers on 3 different machines http://engineering.linkedin.com/kafka/benchmarking-apache-kafka--million-writes-second-three-cheap-machines 47 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Apache Kafka - Partition offsets Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset Consumers track their pointers via (offset, partition, topic) tuples Consumer group C1 48 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Apache Kafka two Options for Log Cleanup Retaining a window of data Ideal for event data Window can be defined in time (days) or space (GBs) defaults to 1 week Retain a complete log (log compaction) Ideal for keyed data Keep a space-efficient complete log of changes Log compaction runs in the background Ensures that always at least the last known value for each message key within the log of data is retained 49 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Data Flow Graphs using Unified Log Stream processing allows for computing feeds off of other feeds Meter Readings Collector Raw Meter Readings Derived feeds are no different than original feeds they are computed off Customer Enrich / Transform Aggregate by Minute Persist Single deployment of Unified Log but logically different feeds Meter with Customer Aggregate by Minute Meter by Minute Persist Raw Meter Readings Meter by Customer by Minute Meter by Minute 50 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 6. Summary 51 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Architectural Pattern: Standalone Event Stream Business Rule Management System Rules Social Media Streams Internet of Things Event Event Cloud Enterprise Event Bus (Ingress) Event (ESP / CEP) State Store / Event Store Result Store Enterprise Event Bus Enterprise Service Bus Analytical 5 Applications DB 5 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Architectural Pattern: Event Stream as part of Lambda Architecture Hadoop Big Data Infrastructure Social Media Streams HDFS Map/ Reduce Result Store Enterprise Service Bus Internet of Things Event Event Cloud Enterprise Event Bus (Ingress) Event (ESP / CEP) State Store / Event Store Result Store Enterprise Event Bus Analytical 53 Applications DB 53 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Architectural Pattern: Event Stream as part of Kappa Architecture Hadoop Big Data Infrastructure HDFS Replay Social Media Streams Internet of Things Event Event Cloud Enterprise Event Bus (Ingress) Event (ESP / CEP) Result Store Enterprise Service Bus Analytical 54 Applications DB State Store / Event Store 54 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

Questions and answers... Guido Schmutz Technology Manager BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 55 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014