CIEL A universal execution engine for distributed data-flow computing



Similar documents
Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

MBrace: Cloud Computing with Monads

Apache Hama Design Document v0.6

Big Data Processing with Google s MapReduce. Alexandru Costan

The Stratosphere Big Data Analytics Platform

Jeffrey D. Ullman slides. MapReduce for data intensive computing

BIG DATA What it is and how to use?

Big Data and Apache Hadoop s MapReduce

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Challenges for Data Driven Systems

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Spark: Cluster Computing with Working Sets

Apache Flink Next-gen data analysis. Kostas

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Hadoop. Sunday, November 25, 12

Unified Big Data Analytics Pipeline. 连 城

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Map-Reduce for Machine Learning on Multicore

CSE-E5430 Scalable Cloud Computing Lecture 11

Scaling Out With Apache Spark. DTL Meeting Slides based on

MapReduce (in the cloud)

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Disco: Beyond MapReduce

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Large scale processing using Hadoop. Ján Vaňo

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

CSE-E5430 Scalable Cloud Computing Lecture 2

Map Reduce / Hadoop / HDFS

Introduction to Cloud Computing

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data and Scripting Systems beyond Hadoop

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop and Map-Reduce. Swati Gore

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Big Data and Industrial Internet

BIG DATA SOLUTION DATA SHEET

Introduction to Parallel Programming and MapReduce

Analysis of Web Archives. Vinay Goel Senior Data Engineer

BSPCloud: A Hybrid Programming Library for Cloud Computing *

16.1 MAPREDUCE. For personal use only, not for distribution. 333

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

ITG Software Engineering

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Big Data With Hadoop


Optimization and analysis of large scale data sorting algorithm based on Hadoop

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Ecosystem B Y R A H I M A.

Introduction to Spark

Unified Big Data Processing with Apache Spark. Matei

Putting Apache Kafka to Use!

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

How To Scale Out Of A Nosql Database

Deploying Hadoop with Manager

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA TRENDS AND TECHNOLOGIES

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Architectures for massive data management

Beyond Hadoop with Apache Spark and BDAS

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Hadoop IST 734 SS CHUNG

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Integrating Big Data into the Computing Curricula

Evaluating partitioning of big graphs

Data-Intensive Computing with Map-Reduce and Hadoop

Case Study : 3 different hadoop cluster deployments

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

6. How MapReduce Works. Jari-Pekka Voutilainen

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

CDH AND BUSINESS CONTINUITY:

HiBench Installation. Sunil Raiyani, Jayam Modi

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Amazon EC2 Product Details Page 1 of 5

Infrastructures for big data

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

MapReduce and Hadoop Distributed File System V I J A Y R A O

Big Data Analytics. Lucas Rego Drumond

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Machine- Learning Summer School

Transcription:

Reviewing: CIEL A universal execution engine for distributed data-flow computing Presented by Niko Stahl for R202

Outline 1. Motivation 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Related Work 7. Conclusion

Motivation MapReduce Dryad

Motivation MapReduce/Dryad have shortcomings: 1. Designed to maximize throughput, not to minimize latency. 2. Perform scheduling before running the algorithm. The resulting schedule is static. These makes MapReduce/Dryad inappropriate for iterative algorithms.

Goals Design a distributed execution framework that can 1. efficiently run iterative algorithms 2. provide a simple interface 3. offer transparent fault tolerance

Outline 1. Motivation 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Related Work 7. Conclusion

CIEL s Computation Model The key feature of CIEL is a dynamic task graph. Primitives of the model: 1. Object: An unstructured sequence of bytes (code, libraries, data, etc.) 2. Reference: The location where an object is stored 3. Task: A computation that executes completely on a single machine. Tasks can publish results and spawn other tasks.

An example task graph

System Architecture Master maintains current state of task graph in the object and task tables. Master does scheduling by lazily evaluating output objects, and pairs runnable tasks with idle workers. Workers execute tasks and store objects.

Skywriting A simple programming interface to CIEL

Task Creation in Skywriting Task creation is the distinctive feature that facilitates datadependent control flow. Two essential ways to create tasks in Skywriting: 1. spawn(f, args = [...]) spawns a child task that computes and returns a pointer to f(args). Explicit task creation. 2. * (unary dereference operator that applies to a ref) Loads the referenced data and evaluates to the resulting data structure. Implicit task creation.

Implicit Task Creation with * Problem: CIEL tasks are non-blocking, but dereferencing future objects will require waiting for tasks to complete. Solution: Implicit creation of continuation task, which depends on dereferenced object and current execution stack.

Running a simple script

Outline 1. Motivation 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Related Work 7. Conclusion

Fault Tolerance Client: Trivial since no driver program is required. Worker: Monitored by master (similar to Dryad) Master: Master state can be derived from the set of active jobs. This is accomplished with persistent logging, and object table reconstruction by workers

Outline 1. Motivation 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Related Work 7. Conclusion

Experiment I: Grep How does CIEL compare to Hadoop? Hadoop polls for tasks once every 5 seconds. [this has changed since 2011. See patch: MAPREDUCE-1906] Hadoop runs mandatory setup and cleanup for each job Note Hadoop s weaker performance for small tasks.

Experiment II: k-means How does CIEL compare to Hadoop (Apache Mahout) for iterative algorithms? Hadoop does not perform cross-job optimisations. Each iteration is an independent job. CIEL prefers workers that have consumed the same data for previous iterations, which leads to better datalocality.

Experiment III: DP CIEL can distribute partially parallelizable tasks that do not cleanly fall into the MapReduce format.

Goals (revisited) Design a distributed execution framework that can 1. efficiently run iterative algorithms [dynamic task graph] 2. provide a simple interface [Skywriting] 3. offer transparent fault tolerance [Master fault tolerance]

Related Work Pregel: Google s distributed execution engine for graph algorithms [designed primarily for graph algorithms] HaLoop: task scheduler is made loop-aware by adding caching mechanisms [lacks fault tolerance] Apache Mahout: Uses Hadoop as its execution engine and a driver program runs iterative algorithms. [lacks master fault tolerance + requires driver program]

Conclusion What are CIEL s significant contributions? Iterative Algorithms can be a single job. Therefore, there is no driver program running outside of the cluster. Dynamic Task Graph: Task spawns Task Fault tolerance for Master