Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis

Size: px

Start display at page:

Download "Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis"

Franklin Jennings
10 years ago
Views:

1 Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets Andrew Psaltis

2 About Me Recently started working at Ensighten on Agile Maketing Platform Prior 4.5 years worked Webtrends on Streaming and Realtime Visitor Analytics where I first fell in love with Spark

3 Where are we going? Why Spark and Spark Streaming or How I fell for Spark. Give a brief overview of architecture in mind Give birds-eye view of Kafka Discuss Spark Streaming Walk through some Click Stream examples Discuss getting data out of Spark Streaming

4 How I fell for Spark On Oct 14th/15th 2012 three worlds collided: Felix Baumgartner jumped from Space (Oct 14, 2012) Viewing of jump resulted in the single largest hour in 15 year history -- Analytics engines crashed analyzing data. Spark Standalone Announced no longer requiring Mesos. Quickly tested Spark and it was love at first sight.

in 15 year history -- Analytics engines crashed analyzing data.

5 Why Spark Streaming? Already had Storm running in production providing an event analytics stream Wanted to deliver an aggregate analytics stream Wanted exactly-once semantics OK with second-scale latency Wanted state management for computations Wanted to combine with Spark RDD s

stream Wanted to deliver an aggregate analytics stream Wanted

6 Generic Streaming Data Pipeline Browser Browser Collection Tier Message Queueing Tier Analysis Tier In-Memory Data Store Data Access Tier

7 Demo Streaming Data Pipeline Browser (msnbc) Browser Log Replayer Kafka Spark Streaming Kafka WebSocket Server

8 Apache Kafka Overview An Apache project initially developed at LinkedIn Distributed publish-subscribe messaging system Specifically designed for real time activity streams Does not follow JMS Standards nor uses JMS APIs Key Features Persistent messaging High throughput, low overhead Uses ZooKeeper for forming a cluster of nodes Supports both queue and topic semantics

not follow JMS Standards nor uses JMS APIs Key Features Persistent messaging High

9 Kafka decouples data-pipelines

10 Kafka decouples data-pipelines

11 What is Spark Streaming? Extends Spark for doing large scale stream processing Efficient and fault-tolerant stateful stream processing Integrates with Spark s batch and interactive processing Provides a simple batch-like API for implementing complex algorithms

and fault-tolerant stateful stream processing Integrates with

12 Programming Model A Discretized Stream or DStream is a series of RDDs representing a stream of data API very similar to RDDs Input - DStreams can be created Either from live streaming data Or by transforming other Dstreams Operations Transformations Output Operations

- DStreams can be created Either from live streaming data Or by

13 Input -DStream Data Sources Many sources out of the box HDFS Kafka Flume Twitter TCP sockets Akka actor ZeroMQ Easy to add your own

14 Operations - Transformations Allows you to build new streams from existing streams RDD-like operations map, flatmap, filter, countbyvalue, reduce, groupbykey, reducebykey, sortbykey, join etc. Window and stateful operations window, countbywindow, reducebywindow countbyvalueandwindow, reducebykeyandwindow updatestatebykey etc.

groupbykey, reducebykey, sortbykey, join etc.

15 Operations - Output Operations Your way to send data to the outside world. Out of the box support for: print - prints on the driver s screen foreachrdd - arbitrary operation on every RDD saveasobjectfiles saveastextfiles saveashadoopfiles

16 Discretized Stream Processing a streaming computation as a series of very small, deterministic batch jobs Browser (msnbc) Chopped live stream, batch sizes down to 1/2 sec. Processed Results Browser Log Replayer Kafka Spark Streaming Kafka WebSocket Server Spark

live stream, batch sizes down to 1/2 sec.

17 Clickstream Examples PageViews per batch PageViews by Url over time Top N PageViews over time Keeping a current session up to date Joining the current session with historical

18 Example Create Stream from Kafka JavaPairD Stream < String,String> m essages = KafkaU tils.createstream (.); JavaD Stream < Tuple2< String,String> > events = m essages.m ap(new Function< Tuple2< String,String>,Tuple2< String,String> > () O verride public Tuple2< String,String> call(tuple2< String,String> tuple2) { Kafka Consumer messages DStream events DStream String parts[] = tuple2._2().split("\t"); return new Tuple2< > (parts[0],parts[1]); }}); t t+1 createstrea m createstrea m t+2 map map map createstrea m stored in memory as an RDD (immutable, distributed)

$tuple2) { Kafka Consumer messages DStream events DStream String parts[] = tuple2._2().$

19 Example PageViews per batch JavaPairD Stream < String,Long> pagecounts = events.m ap(new Function< Tuple2< String,String>,String> () O verride public String call(tuple2< String,String> pageview ) { events DStream return pageview._2(); }}).countbyvalue(); t t+1 t+2 map map map countbyvalue countbyvalue countbyvalue pagecounts DStream

$call(tuple2< String,String> pageview ) { events DStream return pageview._2(); }}).$

20 Example PageViews per URL over time Window-based Transformations JavaPairD Stream < String,Long> slidingpagec ounts = events.m ap(new Function< Tuple2< String,String>,String> () O verride public String call(tuple2< String,String> pageview ) { return pageview._2(); }}).countb yvaluea ndw indow (new D uration(30000),new D u ration(5000)).reducebykey(new Function2< Long,Long,Long> () O verride public Long call(long along,long along2) { window length return along + along2; }}); DStream of data sliding interval

pageview._2(); }}).countb yvaluea ndw indow (new D uration(30000),new D u ration(5000)).

21 Example Top N PageViews JavaPairD Stream < Long,String> sw appedcounts = slidingpagec ounts.m ap( new PairFunction< Tuple2< String,Long>,Long,String> () { public Tuple2< Long,String> call(tuple2< String,Long> in) { return in.sw ap(); }}); JavaPairD Stream < Long,String> sortedcounts = sw appedc ounts.transform ( new Function< JavaPairRD D < Long,String>,JavaPairRD D < Long,String> > () { public JavaPairRD D < Long,String> call(javapairrd D < Long,String> in) { }}); return in.sortbykey(false);

22 Example Updating Current Session Specify function to generate new state based on previous state and new data Function2< List< PageView >,O ptional< PageView >,O ptional< Session> > updatefunction = new Function2< List< PageView >,O ptional< PageView >, O ptional< Session> > () O verride public O ptional< Session> call(list< PageView > values, state) { O ptional< Session> Session updatedsession =... //update the session return O ptional.of(updatedsession) } } JavaPairD Stream < String,Session> currentsessions = pageview.updatestatebykey(u p d atefu n ction);

23 Example Join current session with history JavaPairDStream<String, Session> currentsessions =... JavaPairDStream<String, Session> historicalsessions = <RDD from Spark> currentsessions looks like ---- Tuple2<String, Sesssion>("visitorId-1", "{Current-Session}") Tuple2<String, Sesssion>("visitorId-2", "{Current-Session}")) historicalsessions looks like ---- Tuple2<String, Sesssion>("visitorId-1", "{Historical-Session}") Tuple2<String, Sesssion>("visitorId-2", "{Historical-Session}")) JavaPairDStream<String, Tuple2<Sesssion, Sesssion>> joined = currentsessions.join(historicalsessions);

24 Where are we? Browser (msnbc) Chopped live stream, batch sizes down to 1/2 sec. Processed Results Browser Log Replayer Kafka Spark Streaming Kafka WebSocket Server Spark

25 Getting the data out Spark Streaming currently only supports: print, foreachrdd, saveasobjectfiles, saveastextfiles, saveashadoopfiles Brow ser (m snbc) Processed R esults Brow ser Log R eplayer K afka Spark Stream ing Kafka W ebso cket S erve r

26 Example - foreachrdd sortedcounts.foreach R D D ( new Function< JavaPairRD D < Long,String>,Void> () { public Void call(javapairrd D < Long,String> rdd) { M ap< String,Long> top10 = new H ashm ap< > (); for (Tuple2< Long,String> t :rdd.take(10)) { top10list.put(t._2(),t._1()); } return null; } } ); kafkaproducer.sendtopn (top10list);

27 WebSockets Provides a standard way to get data out When the client connects Read from Kafka and start streaming When they disconnect Close Kafka Consumer

28 Summary Spark Streaming works well for ClickStream Analytics But Still no good out of the box output operations for a stream. Multi tenancy needs to be thought through. How do you stop a Job?

Going Deep with Spark Streaming

Going Deep with Spark Streaming Andrew Psaltis (@itmdata) ApacheCon, April 16, 2015 Outline Introduction DStreams Thinking about time Recovery and Fault tolerance Conclusion About Me Andrew Psaltis Data