Real Time Data Processing using Spark Streaming



Similar documents
Beyond Hadoop with Apache Spark and BDAS

Unified Big Data Processing with Apache Spark. Matei

Streaming items through a cluster with Spark Streaming

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Introduction to Spark

Architectures for massive data management

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Spark and the Big Data Library

Apache Flink Next-gen data analysis. Kostas

Scaling Out With Apache Spark. DTL Meeting Slides based on

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop Ecosystem B Y R A H I M A.

Spark: Making Big Data Interactive & Real-Time

Dell In-Memory Appliance for Cloudera Enterprise

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

How To Create A Data Visualization With Apache Spark And Zeppelin

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

The Future of Data Management

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Upcoming Announcements

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Introduc8on to Apache Spark

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

<Insert Picture Here> Big Data

Hadoop & Spark Using Amazon EMR

Apache Spark and Distributed Programming

Databricks. A Primer

Brave New World: Hadoop vs. Spark

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Moving From Hadoop to Spark

Big Data Analytics with Cassandra, Spark & MLLib

HDP Hadoop From concept to deployment.

Unified Big Data Analytics Pipeline. 连 城

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Architectural Considerations for Hadoop Applications

Databricks. A Primer

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

How To Make Data Streaming A Real Time Intelligence

How Companies are! Using Spark

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Unified Batch & Stream Processing Platform

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

HDP Enabling the Modern Data Architecture

Big Data and Industrial Internet

CSE-E5430 Scalable Cloud Computing Lecture 11

Making big data simple with Databricks

The Future of Data Management with Hadoop and the Enterprise Data Hub

Kafka & Redis for Big Data Solutions

Next-Gen Big Data Analytics using the Spark stack

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

MySQL and Hadoop Big Data Integration

Big Data Processing. Patrick Wendell Databricks

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

COMP9321 Web Application Engineering

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Big Data Analytics. Lucas Rego Drumond

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Apache Flink. Fast and Reliable Large-Scale Data Processing

Interactive data analytics drive insights

Big Data With Hadoop

Going Deep with Spark Streaming

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Shark Installation Guide Week 3 Report. Ankush Arora

Introduction to Big Data with Apache Spark UC BERKELEY

Hadoop: The Definitive Guide

Big Data Analytics Hadoop and Spark

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Why Spark on Hadoop Matters

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Hadoop and Map-Reduce. Swati Gore

BIG DATA TRENDS AND TECHNOLOGIES

Big Data and Analytics: Challenges and Opportunities

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Big Data Analysis: Apache Storm Perspective

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Conquering Big Data with BDAS (Berkeley Data Analytics)

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Driving Growth in Insurance With a Big Data Architecture

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

#TalendSandbox for Big Data

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Comprehensive Analytics on the Hortonworks Data Platform

TRAINING PROGRAM ON BIGDATA/HADOOP

Transcription:

Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O Reilly) 1

Motivation for Real-Time Stream Processing Data is being created at unprecedented rates Exponential data growth from mobile, web, social Connected devices: 9B in 2012 to 50B by 2020 Over 1 trillion sensors by 2020 Datacenter IP traffic growing at CAGR of 25% How can we harness it data in real-time? Value can quickly degrade capture value immediately From reactive analysis to direct operational impact Unlocks new competitive advantages Requires a completely new approach... 2

Use Cases Across Industries Credit Identify fraudulent transactions as soon as they occur. Healthcare Continuously monitor patient vital stats and proactively identify at-risk patients. Transportation Dynamic Re-routing Of traffic or Vehicle Fleet. Manufacturing Identify equipment failures and react instantly Perform Proactive maintenance. Retail Dynamic Inventory Management Real-time In-store Offers and recommendations Surveillance Identify threats and intrusions In real-time Consumer Internet & Mobile Optimize user engagement based on user s current behavior. Digital Advertising & Marketing Optimize and personalize content based on real-time information. 3

From Volume and Variety to Velocity Big Data has evolved Past Big-Data = Volume + Variety Present Hadoop Ecosystem evolves as well Big-Data = Volume + Variety + Velocity Batch Processing Past Time to insight of Hours Present Batch + Stream Processing Time to Insight of Seconds 4

Key Components of Streaming Architectures Data Ingestion & Transportation Service Real-Time Stream Processing Engine Real-Time Data Serving Kafka Flume Security System Management Data Management & Integration 5

Canonical Stream Processing Architecture Data Sources HDFS HBase Data Ingest Kafka Flume Kafka App 1 App 2... 6

Spark: Easy and Fast Big Data Easy to Develop Rich APIs in Java, Scala, Python Interactive shell 2-5 less code Fast to Run General execution graphs In-memory storage Up to 10 faster on disk, 100 in memory 7

Spark Architecture Worker RAM Driver Data Worker RAM Data Worker RAM Data 8

RDDs RDD = Resilient Distributed Datasets Immutable representation of data Operations on one RDD creates a new one Memory caching layer that stores data in a distributed, fault-tolerant cache Created by parallel transformations on data in stable storage Lazy materialization Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage 9

Spark Streaming Extension of Apache Spark s Core API, for Stream Processing. The Framework Provides Fault Tolerance Scalability High-Throughput 10

Spark Streaming Incoming data represented as Discretized Streams (DStreams) Stream is broken down into micro-batches Each micro-batch is an RDD can share code between batch and streaming 11

Micro-batch Architecture val tweets = ssc.twitterstream() val hashtags = tweets.flatmap (status => gettags(status)) hashtags.saveashadoopfiles("hdfs://...") tweets DStream hashtags DStream batch @ t batch @ t+1 batch @ t+2 flatmap flatmap flatmap Stream composed of small (1-10s) batch computations save save save 12

Use DStreams for Windowing Functions 13

Spark Streaming Runs as a Spark job YARN or standalone for scheduling YARN has KDC integration Use the same code for real-time Spark Streaming and for batch Spark jobs. Integrates natively with messaging systems such as Flume, Kafka, Zero MQ. Easy to write Receivers for custom messaging systems. 14

Sharing Code between Batch and Streaming Streaming generates RDDs periodically Any code that operates on RDDs can therefore be used in streaming as well Library that filters ERRORS def filtererrors (rdd: RDD[String]): RDD[String] = { } rdd.filter(s => s.contains( ERROR )) 15

Sharing Code between Batch and Streaming Spark: val lines = sc.textfile( ) val filtered = filtererrors(lines) filtered.saveastextfile(...) Spark Streaming: val dstream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435) val filtered = dstream.foreachrdd((rdd: RDD[String], time: Time) => { filtererrors(rdd) })) filtered.saveastextfiles( ) 16

Reliability Received data automatically persisted to HDFS Write Ahead Log to prevent data loss set spark.streaming.receiver.writeaheadlog.enable=true in spark conf When AM dies, the application is restarted by YARN Received, ack-ed and unprocessed data replayed from WAL (data that made it into blocks) Reliable Receivers can replay data from the original source, if required Un-acked data replayed from source. Kafka, Flume receivers bundled with Spark are examples Reliable Receivers + WAL = No data loss on driver or receiver failure! 17

Kafka Connectors Reliable Kafka DStream Stores received data to Write Ahead Log on HDFS for replay No data loss Stable and supported! Direct Kafka DStream Uses low level API to pull data from Kafka Replays from Kafka on driver failure No data loss Experimental 18

Flume Connector Flume Polling DStream Use Spark sink from Maven to Flume s plugin directory Flume Polling Receiver polls the sink to receive data Replays received data from WAL on HDFS No data loss Stable and Supported! 19

Spark Streaming Use-Cases Real-time dashboards Show approximate results in real-time Reconcile periodically with source-of-truth using Spark Joins of multiple streams Time-based or count-based windows Combine multiple sources of input to produce composite data Re-use RDDs created by Streaming in other Spark jobs. 20

What is coming? Run on Secure YARN for more than 7 days! Better Monitoring and alerting Batch-level and task-level monitoring SQL on Streaming Run SQL-like queries on top of Streaming (medium long term) Python! Limited support coming in Spark 1.3 21

Current Spark project status 400+ contributors and 50+ companies contributing Includes: Databricks, Cloudera, Intel, Yahoo! etc Dozens of production deployments Spark Streaming Survived Netflix Chaos Monkey production ready! Included in CDH! 22

More Info.. CDH Docs: http://www.cloudera.com/content/cloudera-content/clouderadocs/cdh5/latest/cdh5-installation-guide/cdh5ig_spark_installation.html Cloudera Blog: http://blog.cloudera.com/blog/category/spark/ Apache Spark homepage: http://spark.apache.org/ Github: https://github.com/apache/spark 23

Thank you hshreedharan@cloudera.com 15% Discount Code for Cloudera Training PNWCUG_15 university.cloudera.com 24