Chapter 5: Stream Processing. Big Data Management and Analytics 193

Similar documents

Architectures for massive data management

Streaming items through a cluster with Spark Streaming

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data Analytics Hadoop and Spark

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Architectures for massive data management

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Unified Big Data Analytics Pipeline. 连城

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

CS Introduction to Data Mining Instructor: Abdullah Mueen

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Resource Aware Scheduler for Storm. Software Design Document. Date: 09/18/2015

Spark and the Big Data Library

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Big Data Analytics. Lucas Rego Drumond

Real-time Big Data Analytics with Storm

The basic data mining algorithms introduced may be enhanced in a number of ways.

Openbus Documentation

Machine- Learning Summer School

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

SoSe 2014: M-TANI: Big Data Analytics

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Big Data Analytics with Cassandra, Spark & MLLib

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Big Data Frameworks: Scala and Spark Tutorial

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Processing. Patrick Wendell Databricks

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Big Data Analysis: Apache Storm Perspective

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Rakam: Distributed Analytics API

Unified Batch & Stream Processing Platform

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Data Science in the Wild

Apache Flink Next-gen data analysis. Kostas

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Introduction to Big Data with Apache Spark UC BERKELEY

Big Data & Scripting Part II Streaming Algorithms

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Unified Big Data Processing with Apache Spark. Matei

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Log Mining Based on Hadoop s Map and Reduce Technique

Spark. Fast, Interactive, Language- Integrated Cluster Computing

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

BIG DATA ANALYTICS For REAL TIME SYSTEM

Real Time Data Processing using Spark Streaming

Introduction to Spark

Mining Social Network Graphs

Big Data Systems CS 5965/6965 FALL 2015

THEMIS: Fairness in Data Stream Processing under Overload

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Kafka & Redis for Big Data Solutions

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Data Management in the Cloud

Online and Scalable Data Validation in Advanced Metering Infrastructures

Future Internet Technologies

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

GROW WITH BIG DATA Third Eye Consulting Services & Solutions LLC.

Brave New World: Hadoop vs. Spark

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Hadoop Ecosystem B Y R A H I M A.

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Going Deep with Spark Streaming

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Apache Kafka Your Event Stream Processing Solution

Amazon Kinesis and Apache Storm

Big Data Management and Analytics

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Estimating PageRank Values of Wikipedia Articles using MapReduce

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

Beyond Hadoop with Apache Spark and BDAS

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Introducing Storm 1 Core Storm concepts Topology design

HiBench Introduction. Carson Wang Software & Services Group

MapReduce: Algorithm Design Patterns

Big Data Analytics - Accelerated. stream-horizon.com

Challenges for Data Driven Systems

Unsupervised Data Mining (Clustering)

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

HADOOP. Revised 10/19/2015

SEIZE THE DATA SEIZE THE DATA. 2015

Introduction to Parallel Programming and MapReduce

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Transcription:

Chapter 5: Big Data Management and Analytics 193

Today s Lesson Data Streams & Data Stream Management System Data Stream Models Insert-Only Insert-Delete Additive Streaming Methods Sliding Windows & Ageing Data Synopsis Concepts & Tools Micro-Batching with Apache Spark Streaming Real-time with Apache Storm Big Data Management and Analytics 194

Data Streams Big Data Management and Analytics 195

Data Streams Definition: A data stream can be seen as a continuous and potentially infinite stochastic process in which events occur independently from another Huge amount of data Data objects cannot be stored Single scan Big Data Management and Analytics 196

Data Streams Key Characteristics The data elements in the stream arrive on-line The system has no control over the order in which data elements arrive (either within a data stream or across multiple data streams) Data streams are potentially unbound in size Once an element has been processed it is discarded or archived Big Data Management and Analytics 197

Data Stream Management System Ad-hoc queries Data Streams time Stream Processor Standing query Output Streams Limited working storage Archival Storage Big Data Management and Analytics 198

Data Stream Models Insert-Only Model Once an element is seen, it cannot be changed Stream Processor Stream Processor time Big Data Management and Analytics 199

Data Stream Models Insert-Delete Model Elements can be deleted or updated 3 2 4 4 Stream Processor 3 2 Stream Processor 4 4 time Big Data Management and Analytics 200

Data Stream Models Additive Model Each element is an increment to the previous version of the given data object 2 3 Stream Processor 2 Stream Processor 3 time 1 5 11 4 4 5 11 4 Big Data Management and Analytics 201

Streaming Methods Huge amount of data vs. limited resources in space impractical to store all data Solutions: Storing summaries of previously seen data Forgetting stale data But: Trade-off between storage space and the ability to provide precise query answers Big Data Management and Analytics 202

Streaming Methods Sliding Windows Idea: Keep most recent stream elements in main memory and discard older ones Timestamp-based: Data Stream Sliding interval Window length Big Data Management and Analytics 203

Streaming Methods Sliding Windows Idea: Keep most recent stream elements in main memory and discard older ones Sequence-based: Data Stream Sliding interval Window length Big Data Management and Analytics 204

Streaming Methods Ageing Idea: Keep only the summary in main memory and discard objects as soon as they are processed Data Stream Multiply the summary with a decay factor after each time epoche, resp. after a certain amount of occuring elements Big Data Management and Analytics 205

Streaming Methods High velocity of incoming data vs. limited resources in time impossible to process all data Solutions: Data reduction Data approximation But: Trade-off between processing speed and the ability to provide precise query answers Big Data Management and Analytics 206

Streaming Methods Sampling Select a subset of the data Reduce the amount of data to process Difficulty: Obtaining a representative sample Simplest form: random sampling Reservoir Sampling Min-Wise Sampling Reservoir Sampling Algorithm input: Stream, Size of reservoir begin Insert first objects into reservoir; foreach do Let be the position of ; random integer in range 1.. ; if M then Insert into reservoir; Delete an instance from the reservoir at random; Load Shedding: Discard some fractions of data if the arrival rate of the stream might overload the system Big Data Management and Analytics 207

Streaming Methods Data Synopsis & Histograms Summaries of data objects oftenly used to reduce the amount of data e.g. Microclusters that describe groups of similar objects Histograms are used to approximate the frequency distribution of element values Commonly used for query optimizers (e.g. range queries) Big Data Management and Analytics 208

Overview of techniques to build a summary (reduced representation) of a sequence of numeric attributes: 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 DFT DWT SVD APCA PAA PLA Big Data Management and Analytics 209

Diskrete Wavelet Transformation (DWT) Idea: Sequence represented as linear combination of basic wavelet functions Wavelet transformation decomposes a signal into several groups of coefficients at different scales Small coefficients can be eliminated Small errors when reconstructing the signal Take only the first function coefficents Often: Haar-wavelets used (easy to implement) DWT 0 20 40 60 80 100 120 140 Big Data Management and Analytics 210 X X' Haar 0 Haar 1 Haar 2 Haar 3 Haar 4 Haar 5 Haar 6 Haar 7

Example: Step-wise transformation of sequence(stream) X=<8,4,1,3> into Haar-wavelet representation H=[4,2,2,-1] h X = {8, 4, 1, 3} 3 = 2 = h 4 = -1 = (8-4)/2 (1-3)/2 8 7 6 5 4 3 2 1 h 1 = 4 = mean(8,4,1, 3) h 2 = 2 = mean(8,4) - h 1 (Lossless) Reconstruction of original sequence (stream) from Haar-wavelet representation: 8 7 6 5 4 3 2 1 h 1 = 4 h 2 = 2 h 3 = 2 h 4 = -1 X = {8, 4, 1, 3} Big Data Management and Analytics 211

Haar Wavelet Transformation Input sequence: Haar Wavelet Transform Algorithm input: Sequence S,,,, of even length output: Sequence of wavelet coefficients begin Transform into a sequence of two component vectors,,,, where 1 1 1 1 ; Separate the sequences and ; Recursively transform sequence ; Step 1: 2 5,8 9,7 4, 1 1 /2, 2 5,8 9,7 4, 1 1 /2 3.5, 8.5, 5.5, 0, 1.5, 0.5, 1.5, 1 Step 2: 3.5 8.5, 5.5 0 /2, 3.5 8.5, 5.5 0 /2 6, 2.75, 2.5,2.75 Step 3: 6 2.75 /2, 6 2.75 /2 4.375, 1.625 Wavelet coefficients 4.375, 1.625, 2.5, 2.75, 1.5, 0.5, 1.5, 1 Big Data Management and Analytics 212

Spark Streaming Spark s Streaming Framework build on top of Spark s Core API Data ingestion from several different data sources Stream processing might be combined with other Spark libraries (e.g. Spark Mllib) Big Data Management and Analytics 213

Spark Streaming Spark s Streaming Workflow: Streaming engine receives data from input streams Data stream is divided into several microbatches, i.e. sequences of RDDs Microbatches are processed by Spark engine The result is a data stream of batches of processed data Big Data Management and Analytics 214

Spark Streaming DStreams (Discretized Streams) as basic abstraction Any operation applied on a DStream translates to operations on the underlying RDDs (computed by Spark Engine) StreamingContext objects as starting points sc = SparkContext(master, appname) ssc = StreamingContext(sc, 1) #params: SparkContext, time interval Big Data Management and Analytics 215

Spark Streaming General schedule for a Spark Streaming application: 1. Define the StreamingContext ssc 2. Define the input sources by creating input DStreams 3. Define the streaming computations by applying transformations and output operations to Dstreams 4. Start receiving data and processing it using ssc.start() 5. Wait for the processing to be stopped (manually or due to any error) using ssc.awaittermination() 6. The processing can be manually stopped using ssc.stop() Big Data Management and Analytics 216

Spark Streaming #Create a local StreamingContext with two working threads and batch #interval of 1 sec sc = SparkContext( local[2], NetworkWordCount ) ssc = StreamingContext(sc, 1) #Create a DStream that will connect to localhost:9999 lines = ssc.sockettextstream( localhost, 9999) #Split each line into words words = lines.flatmap(lambda line: line.split( )) #Count each word in each batch pairs = words.map(lambda word: (word,1)) wordcounts = pairs.reducebykey(lambda x, y: x + y) #Print the first ten elements of each RDD of this DStream to the console wordcounts.pprint() #Start the computation and wait for it to terminate ssc.start() ssc.awaittermination() Big Data Management and Analytics 217

Spark Streaming Support of window operations Two basic parameters: windowlength slideinterval Support of many transformations for windowed DStreams #Reduce last 30 sec of data, every 10 sec winwordcounts = pairs.reducebykeyandwindow(lambda x,y: x+y, 30, 10) Big Data Management and Analytics 218

Apache Storm Alternative to Spark Streaming Support of Real-time Processing Three abstractions: Spouts Bolts Topologies Big Data Management and Analytics 219

Apache Storm Spouts: Source of streams Typically reads from queuing brokers (e.g. Kafka, RabbitMQ) Can also generate its own data or read from external sources (e.g. Twitter) Bolts: Processes any number of input streams Produces any number of output streams Holds most of the logic of the computations (functions, filters, ) Big Data Management and Analytics 220

Apache Storm Topologies: Network of spouts and bolts Each edge represents a bolt subscribing to the output stream of some other spout or bolt A topology is an arbitrarily complex multi-stage stream computation Big Data Management and Analytics 221

Apache Storm Streams: Core abstraction in Storm A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion Tuples can contain standard types like integers, floats, shorts, booleans, strings and so on Custom types can be used if a own serializer is defined A stream grouping defines how that stream should be partitioned among the bolt's tasks Big Data Management and Analytics 222

Apache Storm Config conf = new Config(); conf.setnumworkers(2); // use two worker processes Spout Bolt Bolt topologybuilder.setspout("blue-spout", new BlueSpout(), 2); // set parallelism hint to 2 topologybuilder.setbolt("green-bolt", new GreenBolt(), 2).setNumTasks(4).shuffleGrouping("blue-spout"); // 4 Tasks spread across 2 Executors and the // tuples shall be randomly distributed across // the bolt s tasks, each bolt shall get an // equal number of tuples topologybuilder.setbolt("yellow-bolt", new YellowBolt(), 6).shuffleGrouping("green-bolt"); StormSubmitter.submitTopology( "mytopology", conf, topologybuilder.createtopology() ); Worker Process Executor Task Executor Task Executor Task Executor Task Task Executor Task TOPOLOGY Worker Process Executor Task Executor Task Executor Task Executor Task Task Executor Task Big Data Management and Analytics 223

Further Reading Joao Gama: Knowledge Discovery from Data Streams (http://www.liaad.up.pt/area/jgama/datastreamscrc.pdf) Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia: Learning Spark - Lightning-Fast Big Data Analysis http://spark.apache.org/docs/latest/streaming-programmingguide.html http://storm.apache.org/documentation/concepts.html Big Data Management and Analytics 224