Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses Peter Pietzuch prp@doc.ic.ac.uk Large-Scale Distributed Systems Group Department of Computing, Imperial College London Peter http://lsds.doc.ic.ac.uk R. Pietzuch prp@doc.ic.ac.uk EIT Digital Summer School on Cloud and Big Data 2015 Stockholm, Sweden
Growth of Big Data Analytics Big Data Analytics: gaining value from data Web analytics, fraud detection, system management, networking monitoring, business dashboard, Need to enable more users to perform data analytics 2
Programming Language Popularity 3
Programming Models For Big Data? Distributed dataflow frameworks tend to favour functional, declarative programming models MapReduce, SQL, PIG, DryadLINQ, Spark, Facilitates consistency and fault tolerance issues Domain experts tend to write imperative programs Java, Matlab, C++, R, Python, Fortran,
Example: Recommender Systems Recommendations based on past user behaviour through collaborative filtering (cf. Netflix, Amazon, ): User A Rating: Item: 3 iphone Rating: 5 Customer activity on website User A Recommend: Apple Watch Up-to-date recommendations Distributed dataflow graph (eg MapReduce, Hadoop, Spark, Dryad, Naiad, ) Exploits data-parallelism on cluster of machines
Collaborative Filtering in Java Update with new ratings Item-A Item-B User-A 4 5 User-B 0 5 User-Item matrix (UI) Multiply for recommendation Matrix useritem = new Matrix(); Matrix coocc = new Matrix(); void addrating(int user, int item, int rating) { useritem.setelement(user, item, rating); updatecooccurrence(coocc, useritem); } Vector getrec(int user) { Vector userrow = useritem.getrow(user); Vector userrec = coocc.multiply(userrow); return userrec; } User-B 1 2 x Item-A Item-B Item-A 1 1 Item-B 1 2 Co-Occurrence matrix (CO) 6
Collaborative Filtering in Spark (Java) // Build the recommendation model using ALS int rank = 10; int numiterations = 20; MatrixFactorizationModel model = ALS.train(JavaRDD.toRDD(ratings), rank, numiterations, 0.01); // Evaluate the model on rating data JavaRDD<Tuple2<Object, Object>> userproducts = ratings.map( new Function<Rating, Tuple2<Object, Object>>() { public Tuple2<Object, Object> call(rating r) { return new Tuple2<Object, Object>(r.user(), r.product()); } } ); JavaPairRDD<Tuple2<Integer, Integer>, Double> predictions = JavaPairRDD.fromJavaRDD( model.predict(javardd.tordd(userproducts)).tojavardd().map( new Function<Rating, Tuple2<Tuple2<Integer, Integer>, Double>>() { public Tuple2<Tuple2<Integer, Integer>, Double> call(rating r){ return new Tuple2<Tuple2<Integer, Integer>, Double>( new Tuple2<Integer, Integer>(r.user(), r.product()), r.rating()); } } )); JavaRDD<Tuple2<Double, Double>> ratesandpreds = JavaPairRDD.fromJavaRDD(ratings.map( new Function<Rating, Tuple2<Tuple2<Integer, Integer>, Double>>() { public Tuple2<Tuple2<Integer, Integer>, Double> call(rating r){ return new Tuple2<Tuple2<Integer, Integer>, Double>( new Tuple2<Integer, Integer>(r.user(), r.product()), r.rating()); } } )).join(predictions).values(); 7
Collaborative Filtering in Spark (Scala) // Build the recommendation model using ALS val rank = 10 val numiterations = 20 val model = ALS.train(ratings, rank, numiterations, 0.01) // Evaluate the model on rating data val usersproducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersproducts).map { case Rating(user, product, rate) => ((user, product), rate) } val ratesandpreds = ratings.map { case Rating(user, product, rate) => ((user, product), rate) }.join(predictions) All data immutable No fine-grained model updates 8
Stateless MapReduce Model Data model: (key, value) pairs reduce shuffle map R R R M M M Two processing functions: map(k 1,v 1 ) à list(k 2,v 2 ) reduce(k 2, list(v 2 )) à list (v 3 ) Benefits: Simple programming model Transparent parallelisation Fault-tolerant processing partitioned data on distributed file system 9
Big Data Programming for the Masses Our goals: Imperative Java programming model for big data apps High throughput through data-parallel execution on cluster Fault tolerance against node failures System Mutable State Large State Low Latency Iteration MapReduce No n/a No No Spark No n/a No Yes Storm No n/a Yes No Naiad Yes No Yes Yes SDG Yes Yes Yes Yes 10
Stateful Dataflow Graphs (SDGs) 1 2 3 Annotated Java program SEEP distributed dataflow framework (@Partitioned, @Partial, @Global, ) Program.java Static program analysis Data-parallel Stateful Dataflow Graph (SDG) Dynamic scale out & checkpoint-based fault tolerance Cluster 4 Experimental evaluation results 11
State as First Class Citizen Tasks process data User A User B Item 1 Item 2 2 5 4 1 State Elements (SEs) represent state Dataflows represent data Tasks have access to arbitrary state State elements (SEs) represent in-memory data structures SEs are mutable Tasks have local access to SEs SEs can be shared between tasks 12
Challenges with Large State Mutable state leads to concise algorithms but complicates scaling and fault tolerance Matrix useritem = new Matrix(); Matrix coocc = new Matrix(); Big Data problem: Matrices become large State will not fit into single node Challenge: Handling of distributed state? 13
Distributed Mutable State State Elements support two abstractions for distributed mutable state: Partitioned SEs: Tasks access partitioned state by key Partial SEs: Tasks can access replicated state 14
(I) Partitioned State Elements Partitioned SE split into disjoint partitions [0-k] Key space: [0-N] [(k+1)-n] Access by key User-Item matrix (UI) Item-A Item-B User-A 4 5 User-B 0 5 hash(msg.id) State partitioned according to partitioning key Dataflow routed according to hash function 15
(II) Partial State Elements Partial SEs are replicated (when partitioning is impossible) Tasks have local access Access to partial SEs either local or global Local access: Data sent to one Global access: Data sent to all 16
State Synchronisation with Partial SEs Reading all partial SE instances results in set of partial values Merge logic Requires application-specific merge logic Merge task reconciles state and updates partial SEs 17
State Synchronisation with Partial SEs Reading all partial SE instances results in set of partial values Merge logic Multiple partial values 18
State Synchronisation with Partial SEs Reading all partial SE instances results in set of partial values Merge logic Multiple partial values Collect partial values Barrier collects partial state 19
SDG for Collaborative Filtering new rating n 1 n 2 updateuseritem updatecoocc Task Element (TE) rec request getuservec user Item State Element (SE) dataflow getrecvec coocc n 3 merge rec result Figure 1: Stateful dataflow graph for CF algorithm it does not require global coordination between nodes during recovery, and it uses dirty state to minimise 20 the
SDG for Logistic Regression items train merge weights item classify result Requires support for iteration 21
Stateful Dataflow Graphs (SDGs) 2 SEEP distributed dataflow framework Annotated Java program (@Partitioned, @Partial, @Global, ) Program.java Static program analysis Data-parallel Stateful Dataflow Graph (SDG) Dynamic scale out & checkpoint-based fault tolerance Cluster 22
Partitioned State Annotation @Partition field annotation indicates partitioned state @Partitioned Matrix useritem = new Matrix(); Matrix coocc = new Matrix(); void addrating(int user, int item, int rating) { useritem.setelement(user, item, rating); updatecooccurrence(coocc, useritem); } Vector getrec(int user) { Vector userrow = useritem.getrow(user); Vector userrec = coocc.multiply(userrow); return userrec; } hash(msg.id) 23
Partial State and Global Annotations @Partitioned Matrix useritem = new Matrix(); @Partial Matrix coocc = new Matrix(); void addrating(int user, int item, int rating) { useritem.setelement(user, item, rating); updatecooccurrence(@global coocc, useritem); } @Partial field annotation indicates partial state @Global annotates variable to indicate access to all partial instances 24
Partial and Collection Annotation @Partitioned Matrix useritem = new Matrix(); @Partial Matrix coocc = new Matrix(); Vector getrec(int user) { Vector userrow = useritem.getrow(user); @Partial Vector purec = @Global coocc.multiply(userrow); Vector userrec = merge(purec); return userrec; } Vector merge(@collection Vector[] v){ /* */ } @Collection annotation indicates merge logic 25
Java2SDG: Translation Process Program.java Annotated Program.java Extract TEs, SEs and accesses Live variable analysis SOOT Framework Extract state and state access patterns through static code analysis TE and SE access code assembly SEEP runnable Javassist Generation of runnable code using TE and SE connections 26
Stateful Dataflow Graphs (SDGs) 3 SEEP distributed dataflow framework Annotated Java program (@Partitioned, @Partial, @Global, ) Program.java Static program analysis Data-parallel Stateful Dataflow Graph (SDG) Dynamic scale out & checkpoint-based fault tolerance Cluster 27
Scale Out and Fault Tolerance for SDGs High/bursty input rates è Exploit data-parallelism 100% 50% Partitioning of state 0% Large scale deployment è Handle node failures Loss of state after node failure 28
Dataflow Framework Managing State E Expose state as external entity to be managed by the distributed dataflow framework Framework has state management primitives to: Backup and recover state elements Partition state elements Integrated mechanism for scale out and failure recovery Node recovery and scale out with state support 29
What is State? Processing state Buffer state User A User B Item 1 2 Item 2 5 4 1 Data ts1 Data ts2 Data ts3 Data ts4 B A C 30
State Management Primitives Checkpoint ts - Makes state available to framework - Attaches last processed data timestamp ts Backup - Moves copy of state from one node to another Restore Partition A1 A2 A - Splits state to scale out tasks 31
State Primitive: Checkpointing Challenge: Efficient checkpointing of large state in Java? No updates allowed while state is being checkpointed Checkpointing state should not impact data processing path Dirty state Asynchronous, lock-free checkpointing 1. Freeze mutable state for checkpointing 2. Dirty state supports updates concurrently 3. Reconcile dirty state 32
State Primitives: Backup and Restore Backup Checkpoint t2 Data Data t1 t2 Data t4 A t2 Data Data t3 t2 Restore Data t1 Data t2 Data t1 B 33
State Primitives: Partition Processing state modeled as (key, value) dictionary State partitioned according to key k Same key used to partition streams userid 0-n userid 0-n 0-x A1 0-x 0-n userid x-n A2 x-n 34
Failure Recovery and Scale Out A B Two cases: - Node B fails è Recover - Node B becomes bottleneck è Scale out 35
Recovering Failed Nodes Periodically, stateful tasks checkpoint and back up state to designated upstream backup node Use backed up state to recover quickly E Backup B B B E Restore B B E Checkpoint B New node B B State restored and unprocessed data replayed from buffer 36
Scaling Out Tasks Finally, upstream node replays unprocessed data to update checkpointed state B A B B1 E Restore B B E Partition New node B1 For scale out, backup node already has state elements to be parallelised B1 37
Distributed M-to-N Backup/Recovery Challenge: Fast recovery? Backups large and cannot be stored in memory Large writes to disk through network have high cost M to N distributed backup and parallel recovery Partition state and backup to multiple nodes Recover state to multiple nodes 38
Stateful Dataflow Graphs (SDGs) SEEP distributed dataflow framework Annotated Java program (@Partitioned, @Partial, @Global, ) Program.java Static program analysis Data-parallel Stateful Dataflow Graph (SDG) Dynamic scale out & checkpoint-based fault tolerance Cluster 4 Experimental evaluation results 39
Throughput: Logistic Regression 100 GB training dataset for classification Deployed on Amazon EC2 ( m1.xlarge VMs with 4 vcpus and 16 GB RAM) Throughput (GB/s) 60 50 40 30 20 10 SDG Spark SDG Spark 0 25 50 75 100 Number of nodes SDGs have comparable throughput to Spark despite mutable state 40
Mutable State Access: Collaborative Filtering Collaborative filtering, while changing read/write ratio (add/getrating) Private cluster (4-core 3.4 GHz Intel Xeon servers with 8 GB RAM ) Throughput (1000 requests/s) 20 15 10 5 0 Throughput Latency 1:5 1:2 1:1 2:1 5:1 1000 100 Latency (ms) Workload (state read/write ratio) SDGs serve fresh results over large mutable state 41
Elasticity: Linear Road Benchmark Linear Road Benchmark [VLDB 04] Network of toll roads of size L Input rate increases over time SLA: results < 5 secs Deployed on Amazon EC2 (c1 & m1 xlarge instances) Scales to L=350 with 60 VMs L=512 highest reported result in literature [VLDB 12] Tuples/s (x100k) 7 6 5 4 3 2 1 0 Throughput (tuples/s) Input rate (tuples/s) Num. of VMs 0 500 1000 1500 2000 Time (seconds) SDGs can scale dynamically based on workload 60 55 50 45 40 35 30 25 20 15 10 5 0 42 Number of VMs
Large State Size: Key/Value Store Increase state size in distributed key/value store Throughput (million requests/s) 2 1.5 1 0.5 0 Throughput Latency 50 100 150 200 1000 100 10 1 Latency (ms) Aggregated memory (GB) SDGs can support online services with mutable state 43
Summary Programming models for Big Data matter Logic increasingly pushed into bespoke APIs Existing models do not support fine-grained mutable state Stateful Dataflow Graphs support mutable state Automatic translation of annotated Java programs to SDGs SDGs introduce new challenges in terms of parallelism and failure recovery Automatic state partitioning and checkpoint-based recovery SEEP available on GitHub: https://github.com/lsds/seep/ Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch, "Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management, SIGMOD 13 Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch, Making State Explicit for Imperative Big Data Processing, USENIX ATC 14 Thank you! Any Questions? Peter Pietzuch <prp@doc.ic.ac.uk> http://lsds.doc.ic.ac.uk 44