Streaming items through a cluster with Spark Streaming

Size: px

Start display at page:

Download "Streaming items through a cluster with Spark Streaming"

Annis Snow
10 years ago
Views:

1 Streaming items through a cluster with Spark Streaming Tathagata TD CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015

2 Who am I? > Project Management Committee (PMC) member of Apache Spark > Lead developer of Spark Streaming > Formerly in AMPLab, UC Berkeley > Software developer at Databricks > Databricks was started by creators of Spark to provide Spark-as-a-service in the cloud

3 Big Data

4 Big Streaming Data

5 Why process Big Streaming Data? Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets

6 How to Process Big Streaming Data Ingest data Process data Store results Raw Tweets > Ingest Receive and buffer the streaming data > Process Clean, extract, transform the data > Store Store transformed data for consumption

buffer the streaming data > Process Clean, extract,

7 How to Process Big Streaming Data Ingest data Process data Store results Raw Tweets > For big streams, every step requires a cluster > Every step requires a system that is designed for it

8 Stream Ingestion Systems Ingest data Process data Store results Raw Tweets Amazon Kinesis > Kafka popular distributed pub-sub system > Kinesis Amazon managed distributed pub-sub > Flume like a distributed data pipe

popular distributed pub-sub system > Kinesis Amazon

9 Stream Ingestion Systems Ingest data Process data Store results Raw Tweets > Spark Streaming most demanded > Storm most widely deployed (as of now ;) ) > Samza gaining popularity in certain scenarios

demanded > Storm most widely deployed (as of now

10 Stream Ingestion Systems Ingest data Process data Store results Raw Tweets > File systems HDFS, Amazon S3, etc. > Key-value stores HBase, Cassandra, etc. > Databases MongoDB, MemSQL, etc.

12 Producers and Consumers Producer 1 (topicx, data1) (topicy, data2) (topicx, data1) (topicx, data3) Topic X Consumer Producer 2 (topicx, data3) Kafka Cluster (topicy, data2) Topic Y Consumer > Producers publish data tagged by topic > Consumers subscribe to data of a particular topic

data3) Kafka Cluster (topicy, data2) Topic Y Consumer > Producers

13 Topics and Partitions > Topic = category of message, divided into partitions > Partition = ordered, numbered stream of messages > Producer decides which (topic, partition) to put each message in

14 Topics and Partitions > Topic = category of message, divided into partitions > Partition = ordered, numbered stream of messages > Producer decides which (topic, partition) to put each message in > Consumer decides which (topic, partition) to pull messages from - High-level consumer handles fault-recovery with Zookeeper - Simple consumer low-level API for greater control

message in > Consumer decides which (topic, partition) to pull messages from - High-level

15 How to process Kafka messages? Ingest data Process data Store results Raw Tweets > Incoming tweets received in distributed manner and buffered in Kafka > How to process them?

16 treaming

17 What is Spark Streaming? Scalable, fault-tolerant stream processing system High-level API joins, windows, often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrate with MLlib, SQL, DataFrames, GraphX Kafka Flume HDFS Kinesis Twitter File systems Databases Dashboards

windows, often 5x less code Fault-tolerant Exactly-once semantics, even for

18 How does Spark Streaming work? > Receivers chop up data streams into batches of few seconds > Spark processing engine processes each batch and pushes out the results to external data stores data streams Receivers batches as RDDs results as RDDs

> Spark processing engine processes each batch and pushes

19 Spark Programming Model > Resilient distributed datasets (RDDs) - Distributed, partitioned collection of objects - Manipulated through parallel transformations (map, filter, reducebykey, ) - All transformations are lazy, execution forced by actions (count, reduce, take, ) - Can be cached in memory across cluster - Automatically rebuilt on failure

filter, reducebykey, ) - All transformations are lazy, execution forced by actions

20 Spark Streaming Programming Model > Discretized Stream (DStream) - Represents a stream of data - Implemented as a infinite sequence of RDDs > DStreams API very similar to RDD API - Functional APIs in - Create input DStreams from Kafka, Flume, Kinesis, HDFS, - Apply transformations

RDDs > DStreams API very similar to RDD API - Functional APIs in -

21 Example Get hashtags from Twitter val ssc = new StreamingContext(conf, Seconds(1)) StreamingContext is the star)ng point of all streaming func)onality Batch interval, by which streams will be chopped up

22 Example Get hashtags from Twitter val ssc = new StreamingContext(conf, Seconds(1)) val tweets = TwitterUtils.createStream(ssc, auth) Input DStream TwiCer Streaming API t t+1 t+2 tweets DStream replicated and stored in memory as RDDs

23 Example Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None) val hashtags = tweets.flatmap(status => gettags(status)) transformed DStream transforma0on: modify data in one DStream to create another DStream t t+1 t+2 tweets DStream flatmap flatmap flatmap hashtags Dstream [#cat, #dog, ] new RDDs created for every batch

24 Example Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None) val hashtags = tweets.flatmap(status => gettags(status)) hashtags.saveastextfiles("hdfs://...") output opera0on: to push data to external storage tweets DStream t t+1 t+2 hashtags DStream flatmap flatmap flatmap save save save every batch saved to HDFS

25 Example Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None) val hashtags = tweets.flatmap(status => gettags(status)) hashtags.foreachrdd(hashtagrdd => {... }) foreachrdd: do whatever you want with the processed data tweets DStream t t+1 t+2 hashtags DStream flatmap flatmap flatmap foreach foreach foreach Write to a database, update analy)cs UI, do whatever you want

26 Example Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None) val hashtags = tweets.flatmap(status => gettags(status)) hashtags.foreachrdd(hashtagrdd => {... }) all of this was just setup for what to do when streaming data is receiver ssc.start() this actually starts the receiving and processing

27 What s going on inside? > Receiver buffers tweets in Executors memory > Spark Streaming Driver launches tasks to process tweets Raw Tweets Twitter Receiver Spark Cluster Buffered Tweets Buffered Tweets launch tasks to process tweets Driver running DStreams Executors

28 What s going on inside? Kafka Cluster Spark Cluster receive data in parallel Receiver Receiver Receiver Executors launch tasks to process data Driver running DStreams

29 Performance Can process 60M records/sec (6 GB/sec) on 100 nodes at sub-second latency Cluster Thhroughput (GB/s) Grep 1 sec 2 sec # Nodes in Cluster Cluster Throughput (GB/s) WordCount 1 sec 2 sec # Nodes in Cluster

30 Window-based Transformations val tweets = TwitterUtils.createStream(ssc, auth) val hashtags = tweets.flatmap(status => gettags(status)) val tagcounts = hashtags.window(minutes(1), Seconds(5)).countByValue() sliding window opera)on window length sliding interval window length DStream of data sliding interval

31 Arbitrary Stateful Computations Specify function to generate new state based on previous state and new data - Example: Maintain per-user mood as state, and update it with their tweets def updatemood(newtweets, lastmood) => newmood val moods = tweetsbyuser.updatestatebykey(updatemood _)

32 Integrates with Spark Ecosystem Spark Spark SQL MLlib GraphX Streaming Spark Core

33 Combine batch and streaming processing > Join data streams with static data sets // Create data set from Hadoop file val dataset = sparkcontext.hadoopfile( file ) // Join each batch in stream with dataset kafkastream.transform { batchrdd => Spark Spark SQL MLlib GraphX Streaming } batchrdd.join(dataset)filter(...) Spark Core

34 Combine machine learning with streaming > Learn models offline, apply them online // Learn model offline val model = KMeans.train(dataset,...) // Apply model online on stream Spark Spark SQL MLlib GraphX Streaming kafkastream.map { event => } model.predict(event.feature) Spark Core

35 Combine SQL with streaming > Interactively query streaming data with SQL // Register each batch in stream as table kafkastream.map { batchrdd => Spark Spark SQL MLlib GraphX batchrdd.registertemptable("latestevents") Streaming } // Interactively query table sqlcontext.sql("select * from latestevents") Spark Core

36 100+ known industry deployments

37 Why are they adopting Spark Streaming? Easy, high-level API Unified API across batch and streaming Integration with Spark SQL and MLlib Ease of operations 37

38 Freeman Lab, Janelia Farm Spark Streaming and MLlib to analyze neural activities Laser microscope scans Zebrafish brainà Spark Streaming à interactive visualization à laser ZAP to kill neurons!

39 Freeman Lab, Janelia Farm Streaming machine learning algorithms on time series data of every neuron Upto 2TB/hour and increasing with brain size Upto 80 HPC nodes

40 Streaming Machine Learning Algos > Streaming Linear Regression > Streaming Logistic Regression > Streaming KMeans

41 Okay okay, how do I start off? > Online Streaming Programming Guide > Streaming examples org/apache/spark/examples/streaming

Beyond Hadoop with Apache Spark and BDAS

Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared