Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses

Similar documents

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Challenges for Data Driven Systems

Unified Big Data Analytics Pipeline. 连城

Scaling Out With Apache Spark. DTL Meeting Slides based on

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Spark and the Big Data Library

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark: Cluster Computing with Working Sets

Apache Flink Next-gen data analysis. Kostas

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Unified Big Data Processing with Apache Spark. Matei

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Moving From Hadoop to Spark

Large-Scale Data Processing

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Hadoop & Spark Using Amazon EMR

A1 and FARM scalable graph database on top of a transactional memory layer

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

A programming model in Cloud: MapReduce

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

In Memory Accelerator for MongoDB

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Disco: Beyond MapReduce

Map Reduce & Hadoop Recommended Text:

Hadoop Architecture. Part 1

Big Data and Apache Hadoop s MapReduce

Introduction to Spark

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Processing with Google s MapReduce. Alexandru Costan

The Stratosphere Big Data Analytics Platform

Streaming items through a cluster with Spark Streaming

Cloud Computing using MapReduce, Hadoop, Spark

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Scalable Architecture on Amazon AWS Cloud

Big Data looks Tiny from the Stratosphere

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

CSE-E5430 Scalable Cloud Computing Lecture 2

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Architectures for massive data management

GraySort on Apache Spark by Databricks

Big Data Analytics with Cassandra, Spark & MLLib

ZooKeeper. Table of contents

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Beyond Hadoop with Apache Spark and BDAS

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

CSE-E5430 Scalable Cloud Computing Lecture 11

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Energy Efficient MapReduce

Brave New World: Hadoop vs. Spark

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Fast Analytics on Big Data with H20

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

JBösen: A Java Bounded Asynchronous Key-Value Store for Big ML Yihua Fang

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Data Management in the Cloud

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Duke University

Big Data Analytics. Chances and Challenges. Volker Markl

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Parameterizable benchmarking framework for designing a MapReduce performance model

Chapter 7. Using Hadoop Cluster and MapReduce

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

PipeCloud : Using Causality to Overcome Speed-of-Light Delays in Cloud-Based Disaster Recovery. Razvan Ghitulete Vrije Universiteit

A Brief Introduction to Apache Tez

Map Reduce / Hadoop / HDFS

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data and Analytics: Challenges and Opportunities

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Advanced In-Database Analytics

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Architectures for Big Data Analytics A database perspective

Big Data Analytics. Lucas Rego Drumond

Introduction to Hadoop

MapReduce (in the cloud)

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Developing MapReduce Programs

Introduction to Cloud Computing

Real Time Data Processing using Spark Streaming

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Big Data Analytics Hadoop and Spark

MLlib: Scalable Machine Learning on Spark

Herodotos Herodotou Shivnath Babu. Duke University

Cloud Computing Summary and Preparation for Examination

Testing Big data is one of the biggest

Hadoop Parallel Data Processing

Transcription:

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses Peter Pietzuch prp@doc.ic.ac.uk Large-Scale Distributed Systems Group Department of Computing, Imperial College London Peter http://lsds.doc.ic.ac.uk R. Pietzuch prp@doc.ic.ac.uk EIT Digital Summer School on Cloud and Big Data 2015 Stockholm, Sweden

Growth of Big Data Analytics Big Data Analytics: gaining value from data Web analytics, fraud detection, system management, networking monitoring, business dashboard, Need to enable more users to perform data analytics 2

Programming Language Popularity 3

Programming Models For Big Data? Distributed dataflow frameworks tend to favour functional, declarative programming models MapReduce, SQL, PIG, DryadLINQ, Spark, Facilitates consistency and fault tolerance issues Domain experts tend to write imperative programs Java, Matlab, C++, R, Python, Fortran,

Example: Recommender Systems Recommendations based on past user behaviour through collaborative filtering (cf. Netflix, Amazon, ): User A Rating: Item: 3 iphone Rating: 5 Customer activity on website User A Recommend: Apple Watch Up-to-date recommendations Distributed dataflow graph (eg MapReduce, Hadoop, Spark, Dryad, Naiad, ) Exploits data-parallelism on cluster of machines

Collaborative Filtering in Java Update with new ratings Item-A Item-B User-A 4 5 User-B 0 5 User-Item matrix (UI) Multiply for recommendation Matrix useritem = new Matrix(); Matrix coocc = new Matrix(); void addrating(int user, int item, int rating) { useritem.setelement(user, item, rating); updatecooccurrence(coocc, useritem); } Vector getrec(int user) { Vector userrow = useritem.getrow(user); Vector userrec = coocc.multiply(userrow); return userrec; } User-B 1 2 x Item-A Item-B Item-A 1 1 Item-B 1 2 Co-Occurrence matrix (CO) 6

Collaborative Filtering in Spark (Java) // Build the recommendation model using ALS int rank = 10; int numiterations = 20; MatrixFactorizationModel model = ALS.train(JavaRDD.toRDD(ratings), rank, numiterations, 0.01); // Evaluate the model on rating data JavaRDD<Tuple2<Object, Object>> userproducts = ratings.map( new Function<Rating, Tuple2<Object, Object>>() { public Tuple2<Object, Object> call(rating r) { return new Tuple2<Object, Object>(r.user(), r.product()); } } ); JavaPairRDD<Tuple2<Integer, Integer>, Double> predictions = JavaPairRDD.fromJavaRDD( model.predict(javardd.tordd(userproducts)).tojavardd().map( new Function<Rating, Tuple2<Tuple2<Integer, Integer>, Double>>() { public Tuple2<Tuple2<Integer, Integer>, Double> call(rating r){ return new Tuple2<Tuple2<Integer, Integer>, Double>( new Tuple2<Integer, Integer>(r.user(), r.product()), r.rating()); } } )); JavaRDD<Tuple2<Double, Double>> ratesandpreds = JavaPairRDD.fromJavaRDD(ratings.map( new Function<Rating, Tuple2<Tuple2<Integer, Integer>, Double>>() { public Tuple2<Tuple2<Integer, Integer>, Double> call(rating r){ return new Tuple2<Tuple2<Integer, Integer>, Double>( new Tuple2<Integer, Integer>(r.user(), r.product()), r.rating()); } } )).join(predictions).values(); 7

Collaborative Filtering in Spark (Scala) // Build the recommendation model using ALS val rank = 10 val numiterations = 20 val model = ALS.train(ratings, rank, numiterations, 0.01) // Evaluate the model on rating data val usersproducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersproducts).map { case Rating(user, product, rate) => ((user, product), rate) } val ratesandpreds = ratings.map { case Rating(user, product, rate) => ((user, product), rate) }.join(predictions) All data immutable No fine-grained model updates 8

Stateless MapReduce Model Data model: (key, value) pairs reduce shuffle map R R R M M M Two processing functions: map(k 1,v 1 ) à list(k 2,v 2 ) reduce(k 2, list(v 2 )) à list (v 3 ) Benefits: Simple programming model Transparent parallelisation Fault-tolerant processing partitioned data on distributed file system 9

Big Data Programming for the Masses Our goals: Imperative Java programming model for big data apps High throughput through data-parallel execution on cluster Fault tolerance against node failures System Mutable State Large State Low Latency Iteration MapReduce No n/a No No Spark No n/a No Yes Storm No n/a Yes No Naiad Yes No Yes Yes SDG Yes Yes Yes Yes 10

Stateful Dataflow Graphs (SDGs) 1 2 3 Annotated Java program SEEP distributed dataflow framework (@Partitioned, @Partial, @Global, ) Program.java Static program analysis Data-parallel Stateful Dataflow Graph (SDG) Dynamic scale out & checkpoint-based fault tolerance Cluster 4 Experimental evaluation results 11

State as First Class Citizen Tasks process data User A User B Item 1 Item 2 2 5 4 1 State Elements (SEs) represent state Dataflows represent data Tasks have access to arbitrary state State elements (SEs) represent in-memory data structures SEs are mutable Tasks have local access to SEs SEs can be shared between tasks 12

Challenges with Large State Mutable state leads to concise algorithms but complicates scaling and fault tolerance Matrix useritem = new Matrix(); Matrix coocc = new Matrix(); Big Data problem: Matrices become large State will not fit into single node Challenge: Handling of distributed state? 13

Distributed Mutable State State Elements support two abstractions for distributed mutable state: Partitioned SEs: Tasks access partitioned state by key Partial SEs: Tasks can access replicated state 14

(I) Partitioned State Elements Partitioned SE split into disjoint partitions [0-k] Key space: [0-N] [(k+1)-n] Access by key User-Item matrix (UI) Item-A Item-B User-A 4 5 User-B 0 5 hash(msg.id) State partitioned according to partitioning key Dataflow routed according to hash function 15

(II) Partial State Elements Partial SEs are replicated (when partitioning is impossible) Tasks have local access Access to partial SEs either local or global Local access: Data sent to one Global access: Data sent to all 16

State Synchronisation with Partial SEs Reading all partial SE instances results in set of partial values Merge logic Requires application-specific merge logic Merge task reconciles state and updates partial SEs 17

State Synchronisation with Partial SEs Reading all partial SE instances results in set of partial values Merge logic Multiple partial values 18

State Synchronisation with Partial SEs Reading all partial SE instances results in set of partial values Merge logic Multiple partial values Collect partial values Barrier collects partial state 19

SDG for Collaborative Filtering new rating n 1 n 2 updateuseritem updatecoocc Task Element (TE) rec request getuservec user Item State Element (SE) dataflow getrecvec coocc n 3 merge rec result Figure 1: Stateful dataflow graph for CF algorithm it does not require global coordination between nodes during recovery, and it uses dirty state to minimise 20 the

SDG for Logistic Regression items train merge weights item classify result Requires support for iteration 21

Stateful Dataflow Graphs (SDGs) 2 SEEP distributed dataflow framework Annotated Java program (@Partitioned, @Partial, @Global, ) Program.java Static program analysis Data-parallel Stateful Dataflow Graph (SDG) Dynamic scale out & checkpoint-based fault tolerance Cluster 22

Partitioned State Annotation @Partition field annotation indicates partitioned state @Partitioned Matrix useritem = new Matrix(); Matrix coocc = new Matrix(); void addrating(int user, int item, int rating) { useritem.setelement(user, item, rating); updatecooccurrence(coocc, useritem); } Vector getrec(int user) { Vector userrow = useritem.getrow(user); Vector userrec = coocc.multiply(userrow); return userrec; } hash(msg.id) 23

Partial State and Global Annotations @Partitioned Matrix useritem = new Matrix(); @Partial Matrix coocc = new Matrix(); void addrating(int user, int item, int rating) { useritem.setelement(user, item, rating); updatecooccurrence(@global coocc, useritem); } @Partial field annotation indicates partial state @Global annotates variable to indicate access to all partial instances 24

Partial and Collection Annotation @Partitioned Matrix useritem = new Matrix(); @Partial Matrix coocc = new Matrix(); Vector getrec(int user) { Vector userrow = useritem.getrow(user); @Partial Vector purec = @Global coocc.multiply(userrow); Vector userrec = merge(purec); return userrec; } Vector merge(@collection Vector[] v){ /* */ } @Collection annotation indicates merge logic 25

Java2SDG: Translation Process Program.java Annotated Program.java Extract TEs, SEs and accesses Live variable analysis SOOT Framework Extract state and state access patterns through static code analysis TE and SE access code assembly SEEP runnable Javassist Generation of runnable code using TE and SE connections 26

Stateful Dataflow Graphs (SDGs) 3 SEEP distributed dataflow framework Annotated Java program (@Partitioned, @Partial, @Global, ) Program.java Static program analysis Data-parallel Stateful Dataflow Graph (SDG) Dynamic scale out & checkpoint-based fault tolerance Cluster 27

Scale Out and Fault Tolerance for SDGs High/bursty input rates è Exploit data-parallelism 100% 50% Partitioning of state 0% Large scale deployment è Handle node failures Loss of state after node failure 28

Dataflow Framework Managing State E Expose state as external entity to be managed by the distributed dataflow framework Framework has state management primitives to: Backup and recover state elements Partition state elements Integrated mechanism for scale out and failure recovery Node recovery and scale out with state support 29

What is State? Processing state Buffer state User A User B Item 1 2 Item 2 5 4 1 Data ts1 Data ts2 Data ts3 Data ts4 B A C 30

State Management Primitives Checkpoint ts - Makes state available to framework - Attaches last processed data timestamp ts Backup - Moves copy of state from one node to another Restore Partition A1 A2 A - Splits state to scale out tasks 31

State Primitive: Checkpointing Challenge: Efficient checkpointing of large state in Java? No updates allowed while state is being checkpointed Checkpointing state should not impact data processing path Dirty state Asynchronous, lock-free checkpointing 1. Freeze mutable state for checkpointing 2. Dirty state supports updates concurrently 3. Reconcile dirty state 32

State Primitives: Backup and Restore Backup Checkpoint t2 Data Data t1 t2 Data t4 A t2 Data Data t3 t2 Restore Data t1 Data t2 Data t1 B 33

State Primitives: Partition Processing state modeled as (key, value) dictionary State partitioned according to key k Same key used to partition streams userid 0-n userid 0-n 0-x A1 0-x 0-n userid x-n A2 x-n 34

Failure Recovery and Scale Out A B Two cases: - Node B fails è Recover - Node B becomes bottleneck è Scale out 35

Recovering Failed Nodes Periodically, stateful tasks checkpoint and back up state to designated upstream backup node Use backed up state to recover quickly E Backup B B B E Restore B B E Checkpoint B New node B B State restored and unprocessed data replayed from buffer 36

Scaling Out Tasks Finally, upstream node replays unprocessed data to update checkpointed state B A B B1 E Restore B B E Partition New node B1 For scale out, backup node already has state elements to be parallelised B1 37

Distributed M-to-N Backup/Recovery Challenge: Fast recovery? Backups large and cannot be stored in memory Large writes to disk through network have high cost M to N distributed backup and parallel recovery Partition state and backup to multiple nodes Recover state to multiple nodes 38

Stateful Dataflow Graphs (SDGs) SEEP distributed dataflow framework Annotated Java program (@Partitioned, @Partial, @Global, ) Program.java Static program analysis Data-parallel Stateful Dataflow Graph (SDG) Dynamic scale out & checkpoint-based fault tolerance Cluster 4 Experimental evaluation results 39

Throughput: Logistic Regression 100 GB training dataset for classification Deployed on Amazon EC2 ( m1.xlarge VMs with 4 vcpus and 16 GB RAM) Throughput (GB/s) 60 50 40 30 20 10 SDG Spark SDG Spark 0 25 50 75 100 Number of nodes SDGs have comparable throughput to Spark despite mutable state 40

Mutable State Access: Collaborative Filtering Collaborative filtering, while changing read/write ratio (add/getrating) Private cluster (4-core 3.4 GHz Intel Xeon servers with 8 GB RAM ) Throughput (1000 requests/s) 20 15 10 5 0 Throughput Latency 1:5 1:2 1:1 2:1 5:1 1000 100 Latency (ms) Workload (state read/write ratio) SDGs serve fresh results over large mutable state 41

Elasticity: Linear Road Benchmark Linear Road Benchmark [VLDB 04] Network of toll roads of size L Input rate increases over time SLA: results < 5 secs Deployed on Amazon EC2 (c1 & m1 xlarge instances) Scales to L=350 with 60 VMs L=512 highest reported result in literature [VLDB 12] Tuples/s (x100k) 7 6 5 4 3 2 1 0 Throughput (tuples/s) Input rate (tuples/s) Num. of VMs 0 500 1000 1500 2000 Time (seconds) SDGs can scale dynamically based on workload 60 55 50 45 40 35 30 25 20 15 10 5 0 42 Number of VMs

Large State Size: Key/Value Store Increase state size in distributed key/value store Throughput (million requests/s) 2 1.5 1 0.5 0 Throughput Latency 50 100 150 200 1000 100 10 1 Latency (ms) Aggregated memory (GB) SDGs can support online services with mutable state 43

Summary Programming models for Big Data matter Logic increasingly pushed into bespoke APIs Existing models do not support fine-grained mutable state Stateful Dataflow Graphs support mutable state Automatic translation of annotated Java programs to SDGs SDGs introduce new challenges in terms of parallelism and failure recovery Automatic state partitioning and checkpoint-based recovery SEEP available on GitHub: https://github.com/lsds/seep/ Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch, "Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management, SIGMOD 13 Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch, Making State Explicit for Imperative Big Data Processing, USENIX ATC 14 Thank you! Any Questions? Peter Pietzuch <prp@doc.ic.ac.uk> http://lsds.doc.ic.ac.uk 44