Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis



Similar documents
Going Deep with Spark Streaming

Streaming items through a cluster with Spark Streaming

Beyond Hadoop with Apache Spark and BDAS

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Big Data Analytics. Lucas Rego Drumond

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Kafka & Redis for Big Data Solutions

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Real Time Data Processing using Spark Streaming

Certification Study Guide. MapR Certified Spark Developer Study Guide

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Big Data Analytics - Accelerated. stream-horizon.com

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Wisdom from Crowds of Machines

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Productionizing a 24/7 Spark Streaming Service on YARN

CSE-E5430 Scalable Cloud Computing Lecture 11

Architectures for massive data management

Spark: Making Big Data Interactive & Real-Time

Architectures for massive data management

How To Choose A Data Flow Pipeline From A Data Processing Platform

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data Analytics with Cassandra, Spark & MLLib

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Kafka: a Distributed Messaging System for Log Processing

Comprehensive Analytics on the Hortonworks Data Platform

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Moving From Hadoop to Spark

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Future Internet Technologies

Creating Big Data Applications with Spring XD

Next-Gen Big Data Analytics using the Spark stack

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Data Pipeline with Kafka

Apache Flink Next-gen data analysis. Kostas

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

How To Create A Data Visualization With Apache Spark And Zeppelin

Apache Kafka Your Event Stream Processing Solution

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Unified Big Data Processing with Apache Spark. Matei

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data Processing. Patrick Wendell Databricks

Apache Spark and Distributed Programming

Upcoming Announcements

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Hadoop Ecosystem B Y R A H I M A.

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

HDP Hadoop From concept to deployment.

Unified Big Data Analytics Pipeline. 连 城

Rakam: Distributed Analytics API

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Introduction to Big Data with Apache Spark UC BERKELEY

REAL TIME ANALYTICS WITH SPARK AND KAFKA

Amazon Kinesis and Apache Storm

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Introducing Storm 1 Core Storm concepts Topology design

Openbus Documentation

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Big Data and Scripting Systems beyond Hadoop

HADOOP. Revised 10/19/2015

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

HiBench Introduction. Carson Wang Software & Services Group

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Architectural Considerations for Hadoop Applications

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Putting Apache Kafka to Use!

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Conquering Big Data with BDAS (Berkeley Data Analytics)

[Hadoop, Storm and Couchbase: Faster Big Data]

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Introduction to Spark

Big Data Web Analytics Platform on AWS for Yottaa

Functional Kafka. Bachelor Thesis Spring Semester 2015 Computer Science Department HSR - University of Applied Sciences Rapperswil

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

BIG DATA ANALYTICS For REAL TIME SYSTEM

Transcription:

Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets Andrew Psaltis

About Me Recently started working at Ensighten on Agile Maketing Platform Prior 4.5 years worked Webtrends on Streaming and Realtime Visitor Analytics where I first fell in love with Spark

Where are we going? Why Spark and Spark Streaming or How I fell for Spark. Give a brief overview of architecture in mind Give birds-eye view of Kafka Discuss Spark Streaming Walk through some Click Stream examples Discuss getting data out of Spark Streaming

How I fell for Spark On Oct 14th/15th 2012 three worlds collided: Felix Baumgartner jumped from Space (Oct 14, 2012) Viewing of jump resulted in the single largest hour in 15 year history -- Analytics engines crashed analyzing data. Spark Standalone Announced no longer requiring Mesos. Quickly tested Spark and it was love at first sight.

Why Spark Streaming? Already had Storm running in production providing an event analytics stream Wanted to deliver an aggregate analytics stream Wanted exactly-once semantics OK with second-scale latency Wanted state management for computations Wanted to combine with Spark RDD s

Generic Streaming Data Pipeline Browser Browser Collection Tier Message Queueing Tier Analysis Tier In-Memory Data Store Data Access Tier

Demo Streaming Data Pipeline Browser (msnbc) Browser Log Replayer Kafka Spark Streaming Kafka WebSocket Server

Apache Kafka Overview An Apache project initially developed at LinkedIn Distributed publish-subscribe messaging system Specifically designed for real time activity streams Does not follow JMS Standards nor uses JMS APIs Key Features Persistent messaging High throughput, low overhead Uses ZooKeeper for forming a cluster of nodes Supports both queue and topic semantics

Kafka decouples data-pipelines

Kafka decouples data-pipelines

What is Spark Streaming? Extends Spark for doing large scale stream processing Efficient and fault-tolerant stateful stream processing Integrates with Spark s batch and interactive processing Provides a simple batch-like API for implementing complex algorithms

Programming Model A Discretized Stream or DStream is a series of RDDs representing a stream of data API very similar to RDDs Input - DStreams can be created Either from live streaming data Or by transforming other Dstreams Operations Transformations Output Operations

Input -DStream Data Sources Many sources out of the box HDFS Kafka Flume Twitter TCP sockets Akka actor ZeroMQ Easy to add your own

Operations - Transformations Allows you to build new streams from existing streams RDD-like operations map, flatmap, filter, countbyvalue, reduce, groupbykey, reducebykey, sortbykey, join etc. Window and stateful operations window, countbywindow, reducebywindow countbyvalueandwindow, reducebykeyandwindow updatestatebykey etc.

Operations - Output Operations Your way to send data to the outside world. Out of the box support for: print - prints on the driver s screen foreachrdd - arbitrary operation on every RDD saveasobjectfiles saveastextfiles saveashadoopfiles

Discretized Stream Processing a streaming computation as a series of very small, deterministic batch jobs Browser (msnbc) Chopped live stream, batch sizes down to 1/2 sec. Processed Results Browser Log Replayer Kafka Spark Streaming Kafka WebSocket Server Spark

Clickstream Examples PageViews per batch PageViews by Url over time Top N PageViews over time Keeping a current session up to date Joining the current session with historical

Example Create Stream from Kafka JavaPairD Stream < String,String> m essages = KafkaU tils.createstream (.); JavaD Stream < Tuple2< String,String> > events = m essages.m ap(new Function< Tuple2< String,String>,Tuple2< String,String> > () { @ O verride public Tuple2< String,String> call(tuple2< String,String> tuple2) { Kafka Consumer messages DStream events DStream String parts[] = tuple2._2().split("\t"); return new Tuple2< > (parts[0],parts[1]); }}); batch @ batch @ t t+1 createstrea m createstrea m batch @ t+2 map map map createstrea m stored in memory as an RDD (immutable, distributed)

Example PageViews per batch JavaPairD Stream < String,Long> pagecounts = events.m ap(new Function< Tuple2< String,String>,String> () { @ O verride public String call(tuple2< String,String> pageview ) { events DStream return pageview._2(); }}).countbyvalue(); batch @ t batch @ t+1 batch @ t+2 map map map countbyvalue countbyvalue countbyvalue pagecounts DStream

Example PageViews per URL over time Window-based Transformations JavaPairD Stream < String,Long> slidingpagec ounts = events.m ap(new Function< Tuple2< String,String>,String> () { @ O verride public String call(tuple2< String,String> pageview ) { return pageview._2(); }}).countb yvaluea ndw indow (new D uration(30000),new D u ration(5000)).reducebykey(new Function2< Long,Long,Long> () { @ O verride public Long call(long along,long along2) { window length return along + along2; }}); DStream of data sliding interval

Example Top N PageViews JavaPairD Stream < Long,String> sw appedcounts = slidingpagec ounts.m ap( new PairFunction< Tuple2< String,Long>,Long,String> () { public Tuple2< Long,String> call(tuple2< String,Long> in) { return in.sw ap(); }}); JavaPairD Stream < Long,String> sortedcounts = sw appedc ounts.transform ( new Function< JavaPairRD D < Long,String>,JavaPairRD D < Long,String> > () { public JavaPairRD D < Long,String> call(javapairrd D < Long,String> in) { }}); return in.sortbykey(false);

Example Updating Current Session Specify function to generate new state based on previous state and new data Function2< List< PageView >,O ptional< PageView >,O ptional< Session> > updatefunction = new Function2< List< PageView >,O ptional< PageView >, O ptional< Session> > () { @ O verride public O ptional< Session> call(list< PageView > values, state) { O ptional< Session> Session updatedsession =... //update the session return O ptional.of(updatedsession) } } JavaPairD Stream < String,Session> currentsessions = pageview.updatestatebykey(u p d atefu n ction);

Example Join current session with history JavaPairDStream<String, Session> currentsessions =... JavaPairDStream<String, Session> historicalsessions = <RDD from Spark> currentsessions looks like ---- Tuple2<String, Sesssion>("visitorId-1", "{Current-Session}") Tuple2<String, Sesssion>("visitorId-2", "{Current-Session}")) historicalsessions looks like ---- Tuple2<String, Sesssion>("visitorId-1", "{Historical-Session}") Tuple2<String, Sesssion>("visitorId-2", "{Historical-Session}")) JavaPairDStream<String, Tuple2<Sesssion, Sesssion>> joined = currentsessions.join(historicalsessions);

Where are we? Browser (msnbc) Chopped live stream, batch sizes down to 1/2 sec. Processed Results Browser Log Replayer Kafka Spark Streaming Kafka WebSocket Server Spark

Getting the data out Spark Streaming currently only supports: print, foreachrdd, saveasobjectfiles, saveastextfiles, saveashadoopfiles Brow ser (m snbc) Processed R esults Brow ser Log R eplayer K afka Spark Stream ing Kafka W ebso cket S erve r

Example - foreachrdd sortedcounts.foreach R D D ( new Function< JavaPairRD D < Long,String>,Void> () { public Void call(javapairrd D < Long,String> rdd) { M ap< String,Long> top10 = new H ashm ap< > (); for (Tuple2< Long,String> t :rdd.take(10)) { top10list.put(t._2(),t._1()); } return null; } } ); kafkaproducer.sendtopn (top10list);

WebSockets Provides a standard way to get data out When the client connects Read from Kafka and start streaming When they disconnect Close Kafka Consumer

Summary Spark Streaming works well for ClickStream Analytics But Still no good out of the box output operations for a stream. Multi tenancy needs to be thought through. How do you stop a Job?