Going Deep with Spark Streaming



Similar documents
Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis

Streaming items through a cluster with Spark Streaming

Beyond Hadoop with Apache Spark and BDAS

Real Time Data Processing using Spark Streaming

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Productionizing a 24/7 Spark Streaming Service on YARN

Chase Wu New Jersey Ins0tute of Technology

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Scaling Out With Apache Spark. DTL Meeting Slides based on

CSE-E5430 Scalable Cloud Computing Lecture 11

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

CDH AND BUSINESS CONTINUITY:

Spark and the Big Data Library

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Unified Big Data Processing with Apache Spark. Matei

Architectures for massive data management

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Hadoop Ecosystem B Y R A H I M A.

Unified Big Data Analytics Pipeline. 连 城

Big Data Analytics with Cassandra, Spark & MLLib

Apache Spark and Distributed Programming

Kafka & Redis for Big Data Solutions

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

GROW WITH BIG DATA Third Eye Consulting Services & Solutions LLC.

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

How to Choose Between Hadoop, NoSQL and RDBMS

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Unified Batch & Stream Processing Platform

Spark: Making Big Data Interactive & Real-Time

Rakam: Distributed Analytics API

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Big Data Analytics. Lucas Rego Drumond

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Big Data With Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Processing. Patrick Wendell Databricks

Apache Flink Next-gen data analysis. Kostas

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

How To Write A Trusted Analytics Platform (Tap)

Brave New World: Hadoop vs. Spark

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Introduction to Spark

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

The Stratosphere Big Data Analytics Platform

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Hadoop Architecture. Part 1

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Deploying Hadoop with Manager

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Big Data Course Highlights

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

The Future of Data Management

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks

Certification Study Guide. MapR Certified Spark Developer Study Guide

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Building Real-Time Analytics Into Big Data Applications

Big Data on Google Cloud

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Real Time Big Data Processing

Hybrid Software Architectures for Big

Hadoop & Spark Using Amazon EMR

Moving From Hadoop to Spark

Apache Hadoop. Alexandru Costan

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Conquering Big Data with BDAS (Berkeley Data Analytics)

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop IST 734 SS CHUNG

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Big Data Analytics. Lucas Rego Drumond

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

SEIZE THE DATA SEIZE THE DATA. 2015

Big Data Driven Knowledge Discovery for Autonomic Future Internet

Spark: Cluster Computing with Working Sets

Transcription:

Going Deep with Spark Streaming Andrew Psaltis (@itmdata) ApacheCon, April 16, 2015

Outline Introduction DStreams Thinking about time Recovery and Fault tolerance Conclusion

About Me Andrew Psaltis Data Engineer @ Shutterstock Fun outside of Shutterstock: Sometimes ramble here: @itmdata Author of Streaming Data Dreaming about streaming since 2008 Conference Speaker Content provider for SkillSoft Lacrosse crazed

Introduction

Why Streaming? Without stream processing there s no big data and no Internet of Things Dana Sandu, SQLstream Operational Efficiency - 1 extra mph for a locomotive on it s daily route can lead to $200M in saving (Norfolk Southern) Tracking Behavior - McDonalds (Netherlands) realized a 700% increase in offer redemptions using personalized advertising based on location, weather, previous purchase, and preference. Predict machine failure - GE monitors over 5500 assets from 70+ customer sites globally. Can predict failure and determine when something needs maintenance Improving Traffic Safety and Efficiency According to EU Commission congestion in EU urban areas costs ~ 100 billion or 1 percent of EU GDP annually

What is Spark Streaming? Provides efficient, fault-tolerant stateful stream processing Provides a simple API for implementing complex algorithms Integrates with Spark s batch and interactive processing Integrates with other Spark extensions

High-level Architecture

DStreams

Discretized Streams (DStreams) The basic abstraction provided by Spark Streaming Continuous series of RDDs

DStreams 3 Things we want to do Ingest Transform Output

Input DStreams (Ingestion) There are 3 ways to get data in: Basic sources Advanced sources Custom Sources

Basic Input DStreams Basic sources Built-in (file system, socket, Akka actors) Non-built in (Avro, CSV, ) Not reliable

Advanced Input DStreams Advanced sources Twitter, Kafka, Flume, Kinesis, MQTT,. Require external library Maybe reliable or unreliable

Custom Input DStreams Implement two classes Input DStream Receiver

Custom Input DStream

Custom Receiver

Receiver Reliability Two types of receivers Unreliable Receiver Reliable Receiver

Receiver Reliability Unreliable Receiver Simple to implement No fault-tolerance Data loss when receiver fails

Receiver Reliability Reliable Receiver Complexity depends on the source Strong fault-tolerance guarantees (zero data loss) Data source must support acknowledgement

Input DStream and Receiver

Creating DStreams 2 Ways to create Dstreams Input a streaming source Transforming a DStream

Creating a DStream via Transformation Transformations modify data from one DStream to another map Input DStream Output DStream Two general classifications: Standard RDD operations map, countbyvalue, reducebykey, join, Stateful operations window, updatestatebykey, transform, countbyvalueandwindow,

Transforming the input - Standard Operation val mystream = createcustomstream(streamingcontext) val events = mystream.map(.) MyReceiver batch @ t batch @ t+1 batch @ t+2 My InputDStream map map map events DStream

Stateful Operation - UpdateStateByKey Provides a way for you to maintain arbitrary state while continuously updating it. For example In-Session Advertising, Tracking twitter sentiment

Stateful Operation - UpdateStateByKey Need to do two things to leverage it: Define the state this can be any arbitrary data Define the update function this needs to know how to update the state using the previous state and new values Requires Checkpoint to be configured

Using updatestatebykey Maintain per-user mood as state, and update it with his/her tweets moods = tweets.updatestatebykey(tweet => updatemood(tweet)) updatemood(newtweets, lastmood) => newmood tweets t- 1 t t+1 t+2 t+3 moods

Transform Allows arbitrary RDD-to-RDD functions to be applied on a DStream transform (transformfunc: RDD[T] => RDD[U]): DStream[U] Example: We want to eliminate noise words from crawled documents: val noisewordrdd = ssc.sparkcontext.newapihadooprdd(...) val cleaneddstream = crawledcorpus.transform(rdd => { rdd.join(noisewordrdd).filter(...)})

Outputting data val mystream = createcustomstream(streamingcontext) val events = mystream.map(.) events.countbyvalue().foreachrdd{ } Custom Receiver batch @ t batch @ t+1 batch @ t+2 mystream events DStream map map map foreachrdd* foreachrdd* foreachrdd*

From Streaming Program to Spark jobs mystream = createcustomstream DStream Graph T Block RDD with data received in last batch interval RDD Graph B events = mystream.map( ) events.countbyvalue().foreachrdd{ } M C FE 1 Spark job M C A

Thinking about time

Thinking about time Windowing Tumbling, Sliding Stream time vs. Event time Out of order data

Windowing Common Types Tumbling Sliding

Tumbling (Count) Windowing

Tumbling (temporal) Windowing

Sliding Window

Spark Streaming -- Sliding Windowing Two types supported: Incremental Non-Incremental

Non-Incremental Sliding Windowing reducebykeyandwindow((a,b)=>(a + b),seconds(5), Seconds(1))

Incremental Sliding Windowing reducebykeyandwindow((a,b) => (a + b), (a,b) => (a-b), Seconds(5), Seconds(1))

More thinking about time Stream time vs. Event time Stream time -- the time when the record arrives into the streaming system. Event time the time that the event was generated, not when it entered the system. Spark Streaming uses stream time Out of order data Does it matter to your application? How do you deal with it?

Handling Out of Order Data Imagine we want to track ad impressions between time t and t +1 Continuous Analytics Over Discontinuous Streams http://www.eecs.berkeley.edu/~franklin/papers/sigmod10krishnamurthy.pdf

Recovery and Fault Tolerance

Recovery Checkpointing Metadata checkpointing Data checkpointing

Recovery Without With

Recovery Too frequent: HDFS writing will slow things down Too infrequent: Lineage and task sizes grow Default setting: Multiple of batch interval at least 10 seconds Recommendation: checkpoint interval of 5-10 times of sliding interval

Fault Tolerance All properties of RDDs still apply We are trying to protect two things Failure of a Worker Failure of the Driver Node Semantics At most once At least once Exactly once Where we need to think about it Receivers Transformations Output

Conclusion Introduction High-level Architecture DStreams Thinking about time Recovery and Fault tolerance

Thank y u Andrew Psaltis @itmdata psaltis.andrew@gmail.com https://www.linkedin.com/in/andrewpsaltis