Workflow Management System for Stratosphere



Similar documents
Massive scale analytics with Stratosphere using R

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark. Fast, Interactive, Language- Integrated Cluster Computing

SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM

Early Cloud Experiences with the Kepler Scientific Workflow System

CA Compiler Construction

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

The Stratosphere Big Data Analytics Platform

Comparison of Distributed Data-Parallelization Patterns for Big Data Analysis: A Bioinformatics Case Study

Big Data looks Tiny from the Stratosphere

Big Data Research in Berlin BBDC and Apache Flink


SURVEY ON THE ALGORITHMS FOR WORKFLOW PLANNING AND EXECUTION

HIGH PERFORMANCE BIG DATA ANALYTICS

Spark: Making Big Data Interactive & Real-Time

Spark: Cluster Computing with Working Sets

A Multi-layered Domain-specific Language for Stencil Computations

Big Data Analytics. Chances and Challenges. Volker Markl

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

n Introduction n Art of programming language design n Programming language spectrum n Why study programming languages? n Overview of compilation

Advanced compiler construction. General course information. Teacher & assistant. Course goals. Evaluation. Grading scheme. Michel Schinz

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

A Brief Introduction to Apache Tez

Ontology construction on a cloud computing platform

A Framework for Distributed Data-Parallel Execution in the Kepler Scientific Workflow System

Data processing goes big

Using Eclipse CDT/PTP for Static Analysis

LDIF - Linked Data Integration Framework

Map Reduce Workflows

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Detection of DOM-based Cross-Site Scripting by Analyzing Dynamically Extracted Scripts

A Scala DSL for Rete-based Runtime Verification

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Apache Flink Next-gen data analysis. Kostas

1/20/2016 INTRODUCTION

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

Language Processing Systems

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG

Database Application Developer Tools Using Static Analysis and Dynamic Profiling

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

The Stratosphere platform for big data analytics

AUTOMATED TEST GENERATION FOR SOFTWARE COMPONENTS

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Spark Application Carousel. Spark Summit East 2015

Supporting Software Development Process Using Evolution Analysis : a Brief Survey

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

Classification of Natural Language Interfaces to Databases based on the Architectures

Optimizations. Optimization Safety. Optimization Safety. Control Flow Graphs. Code transformations to improve program

Workshop on Hadoop with Big Data

HYBRID WORKFLOW POLICY MANAGEMENT FOR HEART DISEASE IDENTIFICATION DONG-HYUN KIM *1, WOO-RAM JUNG 1, CHAN-HYUN YOUN 1

Expanding the CASEsim Framework to Facilitate Load Balancing of Social Network Simulations

WHITE PAPER. Peter Drucker. intentsoft.com 2014, Intentional Software Corporation

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Component visualization methods for large legacy software in C/C++

PHP FRAMEWORK FOR DATABASE MANAGEMENT BASED ON MVC PATTERN

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Hadoop Ecosystem B Y R A H I M A.

Securing PHP Based Web Application Using Vulnerability Injection

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Apache Flink. Fast and Reliable Large-Scale Data Processing

Big Data for Investment Research Management

Log Mining Based on Hadoop s Map and Reduce Technique

What is Analytic Infrastructure and Why Should You Care?

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

Case Study : 3 different hadoop cluster deployments

Technical paper review. Program visualization and explanation for novice C programmers by Matthew Heinsen Egan and Chris McDonald.

Moving From Hadoop to Spark

Semester Review. CSC 301, Fall 2015

Voice Driven Animation System

How To Optimize Data Processing In A Distributed System

Big Data for the JVM developer. Costin Leau,

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Spark and the Big Data Library

Big Data Analytics Nokia

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

A visual DSL toolkit in Lua

Shark Installation Guide Week 3 Report. Ankush Arora

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Deploying Hadoop with Manager

Semantic Workflows and the Wings Workflow System

The Internet of Things and Big Data: Intro

Data Science in the Wild

GSiB: PSE Infrastructure for Dynamic Service-oriented Grid Applications

Transcription:

Workflow Management System for Stratosphere 1 THESIS PRESENTATION BY SURYAMITA HARINDRARI SEPTEMBER 5 TH, 2014 THESIS ADVISOR: ASTERIOS KATSIFODIMOS, PHD THESIS SUPERVISOR: PROF. DR. VOLKER MARKL DATABASE & INFORMATION MANAGEMENT (DIMA) TECHNISCHE UNIVERSITÄT BERLIN

Background Agenda Workflow & Workflow Management System Control Flow vs Data Flow Related Work Motivation Approach Stage 1: Translating AST to Control Flow Graph Abstract Syntax Tree (AST) Control Flow Graph Stage 2: Adding Data Flow to the Control Flow Graph Data Flow Analysis Stage 3: Generate Code for Underlying System Evaluation: Productivity & Generality Conclusion Future Work 2

Workflows & Workflow Management System 3 Big Data Analytics à Complex applications to process large datasets on distributed resources Workflow: Automate procedures that otherwise needed to be carried out manually [Deelman et al, 2009] Sequence of steps or computation [Crobak, 2012] Workflow Management System (WMS): Defines, manages and executes workflows Order of execution is driven by a computer representation of the workflow logic [Hollingsworth et al, 1993]

Simple Workflow vs Complex Workflow 4 Promoter Identification Workflow [Ludäscher et al, 2005] ETL Process Workflow [Crobak, 2012]

Taxonomy of a Workflow Workflow Taxonomy [Yu et al, 2005] 5

Data Flow Data Flow vs Control Flow Related Work on Data Flow Systems: Hadoop MR, Stratosphere, Pig, Hive, Jet Limitations: Does not support control structures Low level optimized code à reduce productivity High overhead in learning new language i.e. Pig Latin Control Flow Related Work on Workflow Systems: Oozie, Luigi, Azkaban, Kepler, Spark Limitations: Markup languages à cumbersome Graphical representation à limited Tasks & Data dependencies defined manually 6

Problem Motivation Stratosphere à does not support control flow outside UDFs Existing workflow systems à dependencies specified manually Solution WMS that automatically detects the control flow and data dependencies between tasks from pure program code Intuitive way for the programmer to define the workflow Goals Design and develop a WMS that works on top of Stratosphere Define a workflow domain specific language (DSL) to make defining workflows easier 7

Workflow Design: Our Taxonomy The Design of Our Workflow System 8

Approach 9 Translate the program code into target code: Translate user program to Intermediate Representation (IR) Control Flow Graph (CFG) Add data flow to the CFG Generate code for underlying system WMS execute the jobs

Stage 1 Part 1: Translate User Program to AST 10 Compiler constructs a sequence of Intermediate Representations (IR) which can have a variety of forms Abstract Syntax Trees (AST) à data structure that represents program constructs. Each node in AST represents operator Children of a node in AST represent the operands of the operator

Grammar Definition & AST Representation Grammar Definition supported by our DSL 11

Our Tool: Scala AST Reuse the Scala AST given freely by the Scala compiler Scala Macros Compile time metaprogramming Expand trees at compile time enabling programmers to hack and manipulate AST within compilation scope Scala AST Classes [Stocker, 2010] Block List of statements and return value of expression ValDef Immutable and mutable variable or statements Assign non-initial assignments to variables If consists of cond, thenp, and elsep sub-tree LabelDef represents iteration statement 12

Generating AST from User Program Sample program in our workflow DSL val e1 = DataSource(..") val e2 = DataSource(..") var e3: DataSet[(String, Int, Int)] = null var i = 0 while(i < 0) { if (e1.map(x => x._2) > 50) e3 = e1.map { x => (x._1, x._2 + 1000, x._3)} else e3 = e2.map { x => (x._1, x._2 + 1500, x._3)} i = i + 1 } val e4 = e3.write( ) e4 13

Stage 1 Part 2: Generate Control Flow Graph from AST 14 Control Flow Graph Directed graph in which the nodes represent basic blocks and the edges represent control flow paths [Allen, 1970] Basic Blocks à sequences of instructions or statements that are always executed together Edges represent possible flow of control from the end of one basic block to the beginning of another

CFG for Various Statements 15

Generated CFG from AST 16

Generated CFG from AST Algorithm (1 of 2) 17

Create CFG from AST Algorithm (2 of 2) 18

Stage 2: Generate CF-Enriched Data Flow 19 Data Flow Analysis [Lam et al, 2006] Transmission of information through program variables missing in CFG Derive the information about the flow of data along with program execution paths Traverse the CFG to detect data dependencies Add another type of edges which presents information on the data dependencies between the blocks

Generate Def-Use Pair Compute the set of variables defined def B and the set of variables used in each block of the CFG use B Association between the block and variable of the program: def(b,v) holds, for a variable v and a vertex B, if B defines v use(b,v) holds, for a variable v and a vertex B, if B uses the value of v Generate the Def-Use pair information for each of the block in G(V,E) Add an edge from block B1 to block B2 that depicts the data flow of variable v given that def (B1,v) reaches use (B2,v) def (B1,v) reaches use (B2,v) when there is a definition clear path from B1 to B2 20

CFG with Def-Use Pair 21 val e1 = DataSource(..") val e2 = DataSource(..") var e3: DataSet[(String, Int, Int)] = null var i = 0 while(i < 0) { if (e1.map(x => x._2) > 50) e3 = e1.map { x => (x._1, x._2 + 1000, x._3)} else e3 = e2.map { x => (x._1, x._2 + 1500, x._3)} i = i + 1 } val e4 = e3.write( ) e4

Output: G(V,E,DFE) Adding Data Flow to the CFG 22

Control-Flow-Enriched Data Flow 23 1 2 3 4 5 6 7

Stage 3: Generate Code for Underlying System 24 Assumptions Code generated will run only for systems with a specified set of primitives that are currently supported by Stratosphere Transform each block in G(V,E,DFE) to a Stratosphere job Output: Stratosphere jobs to be executed in the WMS with order according to the dependencies defined in the IR

Code Generation Algorithm (1 of 2) Each incoming DFE to a block à Stratosphere job of that block requires the input of the data or variable contained in the DFE 25 Each outgoing DFE from a block à Stratosphere job of that block need to output the variable contained in the DFE WMS automatically selects which job to be run

Code Generation Algorithm (2 of 2) 26 J à sequence of Stratosphere job j(i,o) I à data source set of all input variables to the job O à data sink set of all output variables from the job

Use Case: Ingestion Process Evaluation: Productivity 27

Oozie vs Workflow DSL Implementation (1 of 2) Oozie Implementation Specify two XML definitions, for the main process and the subprocess. Each XML definition contains the action nodes and decision nodes based on the overall workflow The input and output directory of each subprocess is also defined manually in the XML definition. 28 A part of Oozie Implementation of SubDirectory Subprocess [Source: http://www.infoq.com/articles/oozieexample ]

Oozie vs Workflow DSL Implementation (2 of 2) 29 Workflow DSL Implementation Specify one workflow definition for both the main process and sub- process Intuitive à Ex: the fork node in the main process can be replaced by a general while style iteration Body of the iteration is the subprocess itself à the conditionals branching based on the directory information var temp = new Directories() var dirlist = temp.get var i: Int = 0 while (i < temp.getsize) { var dir = new DirInfo(dirList(i)) var dirage = dir.getage var dirsize = dir.getsize } if( if(dirage < 1) dirsize > 23 else dirsize > 0) { if(dirage > 6 dirsize > 23) { var ingest = ingestfile(dir.getname) var archive = archivefile(dir.getname) } else { var reminder = sendreminder(dir.getname) } } i = i+1

Evaluation: Generality High-level declarative interface which adheres only for Stratosphere at the moment 30 Deeply embedded in Scala - same syntax and semantics with some restrictions Possible to compile a program written in our DSL to other underlying platforms i.e. Spark can understand the general-style if statement and while statement supported by our DSL

Logistic Regression in Spark & Workflow DSL 31 Spark val data = spark.hdfstextfile(...).map(readpoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(vector.zeros(d)) data.foreach(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x }) w -= gradient.value } Our workflow DSL val data = spark.hdfstextfile(...).map(readpoint).cache() var w = Vector.random(D) while(i < ITERATIONS) { w -= data.map(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y scale * p.x }).reduce(_+_) i = i + 1 } println("final w: " + w) println("final w: " + w) Source: http://laser.inf.ethz.ch/2013/material/joseph/laser-joseph-6.pdf

Conclusion Define a workflow DSL to enable the programmer to implement their algorithm Deeply embedded in Scala à avoids overhead for the programmer 32 Generate a control-flow-enriched data flow and target code from user program via static analysis of the program code Static analysis of Scala code detects the control flow and data dependencies Increase productivity compared to the implementation in other existing WMS (Oozie) Extensibility to be run on top of other frameworks

Future Work Extend grammar of our DSL i.e. For-comprehension 33 Extend our DSL to other frameworks Possible to generate the code or job scripts of the workflow for any execution framework Run program written in our DSL on multiple platforms

References 34 [Deelman et al, 2009] Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor. Workflows and e- science: An overview of workflow system features and capabilities. Future Generation Computer Systems, 25(5): 528 540, 2009. [Hollingsworth et al, 1993] David Hollingsworth and UK Hampshire. Workflow management coalition the workflow reference model. Workflow Management Coalition, 68, 1993. [Ludäscher et al, 2005] Ludäscher Bertram, Ilkay Altintas, Chard Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao, and Yang Zhao. Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice and Experience 18 no. 10, 1039-1065, 2006. [Yu et al, 2005] Jia Yu and Rajkumar Buyya. A taxonomy of workflow management systems for grid computing. Journal of Grid Computing, 3(3-4):171 200, 2005. [Stocker, 2010] Mirko Stocker. Scala Refactoring. PhD thesis, HSR Hochschule für Technik Rapperswil, 2010. [Lam et al, 2006] Monica Lam, Ravi Sethi, JD Ullman, and Alfred Aho. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 2006. [Kelly, 2011] Peter M Kelly. Applying functional programming theory to the design of work- flow engines. 2011.

References 35 [Ackermann et al, 2012] Stefan Ackermann, Vojin Jovanovic, Tiark Rompf, and Martin Odersky. Jet: An embedded dsl for high performance big data processing. In International Workshop on End-to-end Management of Big Data (BigData 2012), number EPFL-CONF-181673, 2012. [Alexandrov et al, 2014] Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Frey- tag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, et al. The stratosphere platform for big data analytics. The VLDB Journal, pages 1 26, 2014. [Allen, 1970] Frances E Allen. Control flow analysis. In ACM Sigplan Notices, volume 5, pages 1 19. ACM, 1970. [Ewen et al, 2012] Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. Spinning fast iterative data flows. Proceedings of the VLDB Endowment, 5(11):1268 1279, 2012. [Burmako, 2013] Eugene Burmako. Scala macros: Let our powers combine!: On how rich syn- tax and static types work with metaprogramming. In Proceedings of the 4th Workshop on Scala, page 3. 2013. [Islam et al, 2012] Mohammad Islam, Angelo K Huang, Mohamed Battisha, Michelle Chiang, San- thosh Srinivasan, Craig Peters, Andreas Neumann, and Alejandro Abdelnur. Oozie: towards a scalable workflow management system for hadoop. In Pro- ceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, page 4. ACM, 2012. [Crobak, 2012] http://www.crobak.org/2012/07/workflow-engines-for-hadoop