Streaming items through a cluster with Spark Streaming

Similar documents

Beyond Hadoop with Apache Spark and BDAS

Architectures for massive data management

Real Time Data Processing using Spark Streaming

Going Deep with Spark Streaming

Unified Big Data Processing with Apache Spark. Matei

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Unified Big Data Analytics Pipeline. 连城

Big Data Analytics. Lucas Rego Drumond

Spark and the Big Data Library

Spark: Making Big Data Interactive & Real-Time

HiBench Introduction. Carson Wang Software & Services Group

Hadoop Ecosystem B Y R A H I M A.

Ali Ghodsi Head of PM and Engineering Databricks

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis

Introduction to Spark

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Machine- Learning Summer School

Big Data Analytics with Cassandra, Spark & MLLib

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Moving From Hadoop to Spark

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Big Data Processing. Patrick Wendell Databricks

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Kafka & Redis for Big Data Solutions

Conquering Big Data with BDAS (Berkeley Data Analytics)

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Databricks. A Primer

Databricks. A Primer

CSE-E5430 Scalable Cloud Computing Lecture 11

BIG DATA ANALYTICS For REAL TIME SYSTEM

Hadoop & Spark Using Amazon EMR

How To Create A Data Visualization With Apache Spark And Zeppelin

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Chapter 5: Stream Processing. Big Data Management and Analytics 193

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Big Data Research in the AMPLab: BDAS and Beyond

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Introduction to Big Data with Apache Spark UC BERKELEY

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Apache Flink Next-gen data analysis. Kostas

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Apache Spark and Distributed Programming

From Spark to Ignition:

Dell In-Memory Appliance for Cloudera Enterprise

Upcoming Announcements

Brave New World: Hadoop vs. Spark

SEIZE THE DATA SEIZE THE DATA. 2015

Productionizing a 24/7 Spark Streaming Service on YARN

The Internet of Things and Big Data: Intro

Hybrid Software Architectures for Big

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Scaling Out With Apache Spark. DTL Meeting Slides based on

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Big Data Frameworks: Scala and Spark Tutorial

Spark: Cluster Computing with Working Sets

How Companies are! Using Spark

Next-Gen Big Data Analytics using the Spark stack

Big Data Analysis: Apache Storm Perspective

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Big Data and Scripting Systems beyond Hadoop

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Big Data and Scripting Systems build on top of Hadoop

CRITEO INTERNSHIP PROGRAM 2015/2016

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Complete Java Classes Hadoop Syllabus Contact No:

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Workshop on Hadoop with Big Data

Chase Wu New Jersey Ins0tute of Technology

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

HADOOP. Revised 10/19/2015

How To Use Big Data For Telco (For A Telco)

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

GROW WITH BIG DATA Third Eye Consulting Services & Solutions LLC.

Introduc8on to Apache Spark

Big Data Analytics Hadoop and Spark

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Hadoop implementation of MapReduce computational model. Ján Vaňo

Putting Apache Kafka to Use!

Creating Big Data Applications with Spring XD

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Transcription:

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015

Who am I? > Project Management Committee (PMC) member of Apache Spark > Lead developer of Spark Streaming > Formerly in AMPLab, UC Berkeley > Software developer at Databricks > Databricks was started by creators of Spark to provide Spark-as-a-service in the cloud

Big Data

Big Streaming Data

Why process Big Streaming Data? Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets

How to Process Big Streaming Data Ingest data Process data Store results Raw Tweets > Ingest Receive and buffer the streaming data > Process Clean, extract, transform the data > Store Store transformed data for consumption

How to Process Big Streaming Data Ingest data Process data Store results Raw Tweets > For big streams, every step requires a cluster > Every step requires a system that is designed for it

Stream Ingestion Systems Ingest data Process data Store results Raw Tweets Amazon Kinesis > Kafka popular distributed pub-sub system > Kinesis Amazon managed distributed pub-sub > Flume like a distributed data pipe

Stream Ingestion Systems Ingest data Process data Store results Raw Tweets > Spark Streaming most demanded > Storm most widely deployed (as of now ;) ) > Samza gaining popularity in certain scenarios

Stream Ingestion Systems Ingest data Process data Store results Raw Tweets > File systems HDFS, Amazon S3, etc. > Key-value stores HBase, Cassandra, etc. > Databases MongoDB, MemSQL, etc.

Producers and Consumers Producer 1 (topicx, data1) (topicy, data2) (topicx, data1) (topicx, data3) Topic X Consumer Producer 2 (topicx, data3) Kafka Cluster (topicy, data2) Topic Y Consumer > Producers publish data tagged by topic > Consumers subscribe to data of a particular topic

Topics and Partitions > Topic = category of message, divided into partitions > Partition = ordered, numbered stream of messages > Producer decides which (topic, partition) to put each message in

Topics and Partitions > Topic = category of message, divided into partitions > Partition = ordered, numbered stream of messages > Producer decides which (topic, partition) to put each message in > Consumer decides which (topic, partition) to pull messages from - High-level consumer handles fault-recovery with Zookeeper - Simple consumer low-level API for greater control

How to process Kafka messages? Ingest data Process data Store results Raw Tweets > Incoming tweets received in distributed manner and buffered in Kafka > How to process them?

treaming

What is Spark Streaming? Scalable, fault-tolerant stream processing system High-level API joins, windows, often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrate with MLlib, SQL, DataFrames, GraphX Kafka Flume HDFS Kinesis Twitter File systems Databases Dashboards

How does Spark Streaming work? > Receivers chop up data streams into batches of few seconds > Spark processing engine processes each batch and pushes out the results to external data stores data streams Receivers batches as RDDs results as RDDs

Spark Programming Model > Resilient distributed datasets (RDDs) - Distributed, partitioned collection of objects - Manipulated through parallel transformations (map, filter, reducebykey, ) - All transformations are lazy, execution forced by actions (count, reduce, take, ) - Can be cached in memory across cluster - Automatically rebuilt on failure

Spark Streaming Programming Model > Discretized Stream (DStream) - Represents a stream of data - Implemented as a infinite sequence of RDDs > DStreams API very similar to RDD API - Functional APIs in - Create input DStreams from Kafka, Flume, Kinesis, HDFS, - Apply transformations

Example Get hashtags from Twitter val ssc = new StreamingContext(conf, Seconds(1)) StreamingContext is the star)ng point of all streaming func)onality Batch interval, by which streams will be chopped up

Example Get hashtags from Twitter val ssc = new StreamingContext(conf, Seconds(1)) val tweets = TwitterUtils.createStream(ssc, auth) Input DStream TwiCer Streaming API batch @ t batch @ t+1 batch @ t+2 tweets DStream replicated and stored in memory as RDDs

Example Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None) val hashtags = tweets.flatmap(status => gettags(status)) transformed DStream transforma0on: modify data in one DStream to create another DStream batch @ t batch @ t+1 batch @ t+2 tweets DStream flatmap flatmap flatmap hashtags Dstream [#cat, #dog, ] new RDDs created for every batch

Example Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None) val hashtags = tweets.flatmap(status => gettags(status)) hashtags.saveastextfiles("hdfs://...") output opera0on: to push data to external storage tweets DStream batch @ t batch @ t+1 batch @ t+2 hashtags DStream flatmap flatmap flatmap save save save every batch saved to HDFS

Example Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None) val hashtags = tweets.flatmap(status => gettags(status)) hashtags.foreachrdd(hashtagrdd => {... }) foreachrdd: do whatever you want with the processed data tweets DStream batch @ t batch @ t+1 batch @ t+2 hashtags DStream flatmap flatmap flatmap foreach foreach foreach Write to a database, update analy)cs UI, do whatever you want

Example Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None) val hashtags = tweets.flatmap(status => gettags(status)) hashtags.foreachrdd(hashtagrdd => {... }) all of this was just setup for what to do when streaming data is receiver ssc.start() this actually starts the receiving and processing

What s going on inside? > Receiver buffers tweets in Executors memory > Spark Streaming Driver launches tasks to process tweets Raw Tweets Twitter Receiver Spark Cluster Buffered Tweets Buffered Tweets launch tasks to process tweets Driver running DStreams Executors

What s going on inside? Kafka Cluster Spark Cluster receive data in parallel Receiver Receiver Receiver Executors launch tasks to process data Driver running DStreams

Performance Can process 60M records/sec (6 GB/sec) on 100 nodes at sub-second latency Cluster Thhroughput (GB/s) 7 6 5 4 3 2 1 0 Grep 1 sec 2 sec 0 50 100 # Nodes in Cluster Cluster Throughput (GB/s) 3.5 3 2.5 2 1.5 1 0.5 0 WordCount 1 sec 2 sec 0 50 100 # Nodes in Cluster

Window-based Transformations val tweets = TwitterUtils.createStream(ssc, auth) val hashtags = tweets.flatmap(status => gettags(status)) val tagcounts = hashtags.window(minutes(1), Seconds(5)).countByValue() sliding window opera)on window length sliding interval window length DStream of data sliding interval

Arbitrary Stateful Computations Specify function to generate new state based on previous state and new data - Example: Maintain per-user mood as state, and update it with their tweets def updatemood(newtweets, lastmood) => newmood val moods = tweetsbyuser.updatestatebykey(updatemood _)

Integrates with Spark Ecosystem Spark Spark SQL MLlib GraphX Streaming Spark Core

Combine batch and streaming processing > Join data streams with static data sets // Create data set from Hadoop file val dataset = sparkcontext.hadoopfile( file ) // Join each batch in stream with dataset kafkastream.transform { batchrdd => Spark Spark SQL MLlib GraphX Streaming } batchrdd.join(dataset)filter(...) Spark Core

Combine machine learning with streaming > Learn models offline, apply them online // Learn model offline val model = KMeans.train(dataset,...) // Apply model online on stream Spark Spark SQL MLlib GraphX Streaming kafkastream.map { event => } model.predict(event.feature) Spark Core

Combine SQL with streaming > Interactively query streaming data with SQL // Register each batch in stream as table kafkastream.map { batchrdd => Spark Spark SQL MLlib GraphX batchrdd.registertemptable("latestevents") Streaming } // Interactively query table sqlcontext.sql("select * from latestevents") Spark Core

100+ known industry deployments

Why are they adopting Spark Streaming? Easy, high-level API Unified API across batch and streaming Integration with Spark SQL and MLlib Ease of operations 37

Neuroscience @ Freeman Lab, Janelia Farm Spark Streaming and MLlib to analyze neural activities Laser microscope scans Zebrafish brainà Spark Streaming à interactive visualization à laser ZAP to kill neurons! http://www.jeremyfreeman.net/share/talks/spark-summit-2014/

Neuroscience @ Freeman Lab, Janelia Farm Streaming machine learning algorithms on time series data of every neuron Upto 2TB/hour and increasing with brain size Upto 80 HPC nodes http://www.jeremyfreeman.net/share/talks/spark-summit-2014/

Streaming Machine Learning Algos > Streaming Linear Regression > Streaming Logistic Regression > Streaming KMeans http://www.jeremyfreeman.net/share/talks/spark-summit-east-2015/#/algorithms-repeat

Okay okay, how do I start off? > Online Streaming Programming Guide http://spark.apache.org/docs/latest/streaming-programming-guide.html > Streaming examples https://github.com/apache/spark/tree/master/examples/src/main/scala/ org/apache/spark/examples/streaming