Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O Reilly) 1
Motivation for Real-Time Stream Processing Data is being created at unprecedented rates Exponential data growth from mobile, web, social Connected devices: 9B in 2012 to 50B by 2020 Over 1 trillion sensors by 2020 Datacenter IP traffic growing at CAGR of 25% How can we harness it data in real-time? Value can quickly degrade capture value immediately From reactive analysis to direct operational impact Unlocks new competitive advantages Requires a completely new approach... 2
Use Cases Across Industries Credit Identify fraudulent transactions as soon as they occur. Healthcare Continuously monitor patient vital stats and proactively identify at-risk patients. Transportation Dynamic Re-routing Of traffic or Vehicle Fleet. Manufacturing Identify equipment failures and react instantly Perform Proactive maintenance. Retail Dynamic Inventory Management Real-time In-store Offers and recommendations Surveillance Identify threats and intrusions In real-time Consumer Internet & Mobile Optimize user engagement based on user s current behavior. Digital Advertising & Marketing Optimize and personalize content based on real-time information. 3
From Volume and Variety to Velocity Big Data has evolved Past Big-Data = Volume + Variety Present Hadoop Ecosystem evolves as well Big-Data = Volume + Variety + Velocity Batch Processing Past Time to insight of Hours Present Batch + Stream Processing Time to Insight of Seconds 4
Key Components of Streaming Architectures Data Ingestion & Transportation Service Real-Time Stream Processing Engine Real-Time Data Serving Kafka Flume Security System Management Data Management & Integration 5
Canonical Stream Processing Architecture Data Sources HDFS HBase Data Ingest Kafka Flume Kafka App 1 App 2... 6
Spark: Easy and Fast Big Data Easy to Develop Rich APIs in Java, Scala, Python Interactive shell 2-5 less code Fast to Run General execution graphs In-memory storage Up to 10 faster on disk, 100 in memory 7
Spark Architecture Worker RAM Driver Data Worker RAM Data Worker RAM Data 8
RDDs RDD = Resilient Distributed Datasets Immutable representation of data Operations on one RDD creates a new one Memory caching layer that stores data in a distributed, fault-tolerant cache Created by parallel transformations on data in stable storage Lazy materialization Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage 9
Spark Streaming Extension of Apache Spark s Core API, for Stream Processing. The Framework Provides Fault Tolerance Scalability High-Throughput 10
Spark Streaming Incoming data represented as Discretized Streams (DStreams) Stream is broken down into micro-batches Each micro-batch is an RDD can share code between batch and streaming 11
Micro-batch Architecture val tweets = ssc.twitterstream() val hashtags = tweets.flatmap (status => gettags(status)) hashtags.saveashadoopfiles("hdfs://...") tweets DStream hashtags DStream batch @ t batch @ t+1 batch @ t+2 flatmap flatmap flatmap Stream composed of small (1-10s) batch computations save save save 12
Use DStreams for Windowing Functions 13
Spark Streaming Runs as a Spark job YARN or standalone for scheduling YARN has KDC integration Use the same code for real-time Spark Streaming and for batch Spark jobs. Integrates natively with messaging systems such as Flume, Kafka, Zero MQ. Easy to write Receivers for custom messaging systems. 14
Sharing Code between Batch and Streaming Streaming generates RDDs periodically Any code that operates on RDDs can therefore be used in streaming as well Library that filters ERRORS def filtererrors (rdd: RDD[String]): RDD[String] = { } rdd.filter(s => s.contains( ERROR )) 15
Sharing Code between Batch and Streaming Spark: val lines = sc.textfile( ) val filtered = filtererrors(lines) filtered.saveastextfile(...) Spark Streaming: val dstream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435) val filtered = dstream.foreachrdd((rdd: RDD[String], time: Time) => { filtererrors(rdd) })) filtered.saveastextfiles( ) 16
Reliability Received data automatically persisted to HDFS Write Ahead Log to prevent data loss set spark.streaming.receiver.writeaheadlog.enable=true in spark conf When AM dies, the application is restarted by YARN Received, ack-ed and unprocessed data replayed from WAL (data that made it into blocks) Reliable Receivers can replay data from the original source, if required Un-acked data replayed from source. Kafka, Flume receivers bundled with Spark are examples Reliable Receivers + WAL = No data loss on driver or receiver failure! 17
Kafka Connectors Reliable Kafka DStream Stores received data to Write Ahead Log on HDFS for replay No data loss Stable and supported! Direct Kafka DStream Uses low level API to pull data from Kafka Replays from Kafka on driver failure No data loss Experimental 18
Flume Connector Flume Polling DStream Use Spark sink from Maven to Flume s plugin directory Flume Polling Receiver polls the sink to receive data Replays received data from WAL on HDFS No data loss Stable and Supported! 19
Spark Streaming Use-Cases Real-time dashboards Show approximate results in real-time Reconcile periodically with source-of-truth using Spark Joins of multiple streams Time-based or count-based windows Combine multiple sources of input to produce composite data Re-use RDDs created by Streaming in other Spark jobs. 20
What is coming? Run on Secure YARN for more than 7 days! Better Monitoring and alerting Batch-level and task-level monitoring SQL on Streaming Run SQL-like queries on top of Streaming (medium long term) Python! Limited support coming in Spark 1.3 21
Current Spark project status 400+ contributors and 50+ companies contributing Includes: Databricks, Cloudera, Intel, Yahoo! etc Dozens of production deployments Spark Streaming Survived Netflix Chaos Monkey production ready! Included in CDH! 22
More Info.. CDH Docs: http://www.cloudera.com/content/cloudera-content/clouderadocs/cdh5/latest/cdh5-installation-guide/cdh5ig_spark_installation.html Cloudera Blog: http://blog.cloudera.com/blog/category/spark/ Apache Spark homepage: http://spark.apache.org/ Github: https://github.com/apache/spark 23
Thank you hshreedharan@cloudera.com 15% Discount Code for Cloudera Training PNWCUG_15 university.cloudera.com 24