Real Time Data Processing using Spark Streaming

Size: px

Start display at page:

Download "Real Time Data Processing using Spark Streaming"

Doreen Boyd
10 years ago
Views:

Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera

1 Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O Reilly) 1

2 Motivation for Real-Time Stream Processing Data is being created at unprecedented rates Exponential data growth from mobile, web, social Connected devices: 9B in 2012 to 50B by 2020 Over 1 trillion sensors by 2020 Datacenter IP traffic growing at CAGR of 25% How can we harness it data in real-time? Value can quickly degrade capture value immediately From reactive analysis to direct operational impact Unlocks new competitive advantages Requires a completely new approach... 2

growing at CAGR of 25% How can we harness it data in real-time?

Manufacturing Identify equipment failures and react instantly Perform Proactive maintenance.

3 Use Cases Across Industries Credit Identify fraudulent transactions as soon as they occur. Healthcare Continuously monitor patient vital stats and proactively identify at-risk patients. Transportation Dynamic Re-routing Of traffic or Vehicle Fleet. Manufacturing Identify equipment failures and react instantly Perform Proactive maintenance. Retail Dynamic Inventory Management Real-time In-store Offers and recommendations Surveillance Identify threats and intrusions In real-time Consumer Internet & Mobile Optimize user engagement based on user s current behavior. Digital Advertising & Marketing Optimize and personalize content based on real-time information. 3

4 From Volume and Variety to Velocity Big Data has evolved Past Big-Data = Volume + Variety Present Hadoop Ecosystem evolves as well Big-Data = Volume + Variety + Velocity Batch Processing Past Time to insight of Hours Present Batch + Stream Processing Time to Insight of Seconds 4

Big-Data = Volume + Variety + Velocity Batch Processing Past Time to

Processing Engine Real-Time Data Serving Kafka Flume

5 Key Components of Streaming Architectures Data Ingestion & Transportation Service Real-Time Stream Processing Engine Real-Time Data Serving Kafka Flume Security System Management Data Management & Integration 5

6 Canonical Stream Processing Architecture Data Sources HDFS HBase Data Ingest Kafka Flume Kafka App 1 App

7 Spark: Easy and Fast Big Data Easy to Develop Rich APIs in Java, Scala, Python Interactive shell 2-5 less code Fast to Run General execution graphs In-memory storage Up to 10 faster on disk, 100 in memory 7

8 Spark Architecture Worker RAM Driver Data Worker RAM Data Worker RAM Data 8

9 RDDs RDD = Resilient Distributed Datasets Immutable representation of data Operations on one RDD creates a new one Memory caching layer that stores data in a distributed, fault-tolerant cache Created by parallel transformations on data in stable storage Lazy materialization Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage 9

by parallel transformations on data in stable storage Lazy materialization Two observations: a.

10 Spark Streaming Extension of Apache Spark s Core API, for Stream Processing. The Framework Provides Fault Tolerance Scalability High-Throughput 10

11 Spark Streaming Incoming data represented as Discretized Streams (DStreams) Stream is broken down into micro-batches Each micro-batch is an RDD can share code between batch and streaming 11

12 Micro-batch Architecture val tweets = ssc.twitterstream() val hashtags = tweets.flatmap (status => gettags(status)) hashtags.saveashadoopfiles("hdfs://...") tweets DStream hashtags DStream t t+1 t+2 flatmap flatmap flatmap Stream composed of small (1-10s) batch computations save save save 12

flatmap (status => gettags(status)) hashtags.saveashadoopfiles("hdfs://.

13 Use DStreams for Windowing Functions 13

14 Spark Streaming Runs as a Spark job YARN or standalone for scheduling YARN has KDC integration Use the same code for real-time Spark Streaming and for batch Spark jobs. Integrates natively with messaging systems such as Flume, Kafka, Zero MQ. Easy to write Receivers for custom messaging systems. 14

15 Sharing Code between Batch and Streaming Streaming generates RDDs periodically Any code that operates on RDDs can therefore be used in streaming as well Library that filters ERRORS def filtererrors (rdd: RDD[String]): RDD[String] = { } rdd.filter(s => s.contains( ERROR )) 15

streaming as well Library that filters ERRORS def filtererrors (rdd:

16 Sharing Code between Batch and Streaming Spark: val lines = sc.textfile( ) val filtered = filtererrors(lines) filtered.saveastextfile(...) Spark Streaming: val dstream = FlumeUtils.createStream(ssc, " ", 4435) val filtered = dstream.foreachrdd((rdd: RDD[String], time: Time) => { filtererrors(rdd) })) filtered.saveastextfiles( ) 16

..) Spark Streaming: val dstream = FlumeUtils.createStream(ssc, "34.23.46.

17 Reliability Received data automatically persisted to HDFS Write Ahead Log to prevent data loss set spark.streaming.receiver.writeaheadlog.enable=true in spark conf When AM dies, the application is restarted by YARN Received, ack-ed and unprocessed data replayed from WAL (data that made it into blocks) Reliable Receivers can replay data from the original source, if required Un-acked data replayed from source. Kafka, Flume receivers bundled with Spark are examples Reliable Receivers + WAL = No data loss on driver or receiver failure! 17

enable=true in spark conf When AM dies, the application is restarted by YARN Received, ack-ed and unprocessed data replayed from WAL

18 Kafka Connectors Reliable Kafka DStream Stores received data to Write Ahead Log on HDFS for replay No data loss Stable and supported! Direct Kafka DStream Uses low level API to pull data from Kafka Replays from Kafka on driver failure No data loss Experimental 18

19 Flume Connector Flume Polling DStream Use Spark sink from Maven to Flume s plugin directory Flume Polling Receiver polls the sink to receive data Replays received data from WAL on HDFS No data loss Stable and Supported! 19

Receiver polls the sink to receive data Replays

20 Spark Streaming Use-Cases Real-time dashboards Show approximate results in real-time Reconcile periodically with source-of-truth using Spark Joins of multiple streams Time-based or count-based windows Combine multiple sources of input to produce composite data Re-use RDDs created by Streaming in other Spark jobs. 20

multiple streams Time-based or count-based windows Combine multiple sources of

21 What is coming? Run on Secure YARN for more than 7 days! Better Monitoring and alerting Batch-level and task-level monitoring SQL on Streaming Run SQL-like queries on top of Streaming (medium long term) Python! Limited support coming in Spark

22 Current Spark project status 400+ contributors and 50+ companies contributing Includes: Databricks, Cloudera, Intel, Yahoo! etc Dozens of production deployments Spark Streaming Survived Netflix Chaos Monkey production ready! Included in CDH! 22

23 More Info.. CDH Docs: Cloudera Blog: Apache Spark homepage: Github: 23

24 Thank you 15% Discount Code for Cloudera Training PNWCUG_15 university.cloudera.com 24

Beyond Hadoop with Apache Spark and BDAS

Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared