The Stratosphere Big Data Analytics Platform

Similar documents
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Apache Flink Next-gen data analysis. Kostas

Apache Flink. Fast and Reliable Large-Scale Data Processing

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Unified Big Data Analytics Pipeline. 连 城

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data looks Tiny from the Stratosphere

Spark: Cluster Computing with Working Sets

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Research in Berlin BBDC and Apache Flink

Hadoop IST 734 SS CHUNG

Scaling Out With Apache Spark. DTL Meeting Slides based on

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Unified Big Data Processing with Apache Spark. Matei

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data and Scripting Systems build on top of Hadoop

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

MapReduce (in the cloud)

Spark and the Big Data Library

Managing large clusters resources

Big Data and Scripting Systems beyond Hadoop

Challenges for Data Driven Systems

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

MLlib: Scalable Machine Learning on Spark

Beyond Hadoop with Apache Spark and BDAS

Apache Hadoop. Alexandru Costan

Ali Ghodsi Head of PM and Engineering Databricks

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

BIG DATA What it is and how to use?

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Large Scale Graph Processing with Apache Giraph

Workshop on Hadoop with Big Data

CIEL A universal execution engine for distributed data-flow computing

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop & Spark Using Amazon EMR

Spark: Making Big Data Interactive & Real-Time

Big Data Analytics. Lucas Rego Drumond

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

How Companies are! Using Spark

Spark. Fast, Interactive, Language- Integrated Cluster Computing

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data and Apache Hadoop s MapReduce

Distributed Filesystems

Apache Spark and Distributed Programming

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Energy Efficient MapReduce

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Hadoop and Map-Reduce. Swati Gore

Big Data With Hadoop

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Apache Hama Design Document v0.6

Moving From Hadoop to Spark

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Hadoop Design and k-means Clustering

Architectures for Big Data Analytics A database perspective

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Big Data Processing with Google s MapReduce. Alexandru Costan

Chapter 7. Using Hadoop Cluster and MapReduce

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

Interactive data analytics drive insights

BIG DATA AND ANALYTICS

Brave New World: Hadoop vs. Spark

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Introduction to Parallel Programming and MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 2

Bringing Big Data Modelling into the Hands of Domain Experts

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

CSE-E5430 Scalable Cloud Computing Lecture 11

Hadoop Job Oriented Training Agenda

Map Reduce & Hadoop Recommended Text:

What is Analytic Infrastructure and Why Should You Care?

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

International Journal of Emerging Technology & Research

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Data Mining in the Swamp

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

HPC ABDS: The Case for an Integrating Apache Big Data Stack

How To Choose A Data Flow Pipeline From A Data Processing Platform

Big Data Analytics Hadoop and Spark

Big Data Analytics: Where is it Going and How Can it Be Taught at the Undergraduate Level?

Transcription:

The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science amir@sics.se June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44

Big Data small data big data Amir H. Payberah (SICS) Stratosphere June 4, 2014 2 / 44

Big Data refers to datasets and flows large enough that has outpaced our capability to store, process, analyze, and understand. Amir H. Payberah (SICS) Stratosphere June 4, 2014 3 / 44

Where Does Big Data Come From? Amir H. Payberah (SICS) Stratosphere June 4, 2014 4 / 44

Big Data Market Driving Factors The number of web pages indexed by Google, which were around one million in 1998, have exceeded one trillion in 2008, and its expansion is accelerated by appearance of the social networks. Mining big data: current status, and forecast to the future [Wei Fan et al., 2013] Amir H. Payberah (SICS) Stratosphere June 4, 2014 5 / 44

Big Data Market Driving Factors The amount of mobile data traffic is expected to grow to 10.8 Exabyte per month by 2016. Worldwide Big Data Technology and Services 2012-2015 Forecast [Dan Vesset et al., 2013] Amir H. Payberah (SICS) Stratosphere June 4, 2014 6 / 44

Big Data Market Driving Factors More than 65 billion devices were connected to the Internet by 2010, and this number will go up to 230 billion by 2020. The Internet of Things Is Coming [John Mahoney et al., 2013] Amir H. Payberah (SICS) Stratosphere June 4, 2014 7 / 44

Big Data Market Driving Factors Many companies are moving towards using Cloud services to access Big Data analytical tools. Amir H. Payberah (SICS) Stratosphere June 4, 2014 8 / 44

Big Data Market Driving Factors Open source communities Amir H. Payberah (SICS) Stratosphere June 4, 2014 9 / 44

Amir H. Payberah (SICS) Stratosphere June 4, 2014 10 / 44

Big Data Analytics Stack Amir H. Payberah (SICS) Stratosphere June 4, 2014 11 / 44

Hadoop Big Data Analytics Stack Amir H. Payberah (SICS) Stratosphere June 4, 2014 12 / 44

Spark Big Data Analytics Stack Amir H. Payberah (SICS) Stratosphere June 4, 2014 13 / 44

Stratosphere Big Data Analytics Stack * Under development ** Stratosphere community is thinking to integrate with Mahout Amir H. Payberah (SICS) Stratosphere June 4, 2014 14 / 44

Amir H. Payberah (SICS) Stratosphere June 4, 2014 15 / 44

What is Stratosphere? An efficient distributed general-purpose data analysis platform. Built on top of HDFS and YARN. Focusing on ease of programming. Amir H. Payberah (SICS) Stratosphere June 4, 2014 16 / 44

Project Status Research project started in 2009 by TU Berlin, HU Berlin, and HPI An Apache Incubator project 25 contributors v0.5 is released Amir H. Payberah (SICS) Stratosphere June 4, 2014 17 / 44

Motivation MapReduce programming model has not been designed for complex operations, e.g., data mining. Very expensive, i.e., always goes to disk and HDFS. Amir H. Payberah (SICS) Stratosphere June 4, 2014 18 / 44

Solution Extends MapReduce with more operators. Support for advanced data flow graphs. In-memory and out-of-core processing. Amir H. Payberah (SICS) Stratosphere June 4, 2014 19 / 44

Stratosphere Programming Model Amir H. Payberah (SICS) Stratosphere June 4, 2014 20 / 44

Stratosphere Programming Model A program is expressed as an arbitrary data flow consisting of transformations, sources and sinks. Amir H. Payberah (SICS) Stratosphere June 4, 2014 21 / 44

Transformations Higher-order functions that execute user-defined functions in parallel on the input data. Amir H. Payberah (SICS) Stratosphere June 4, 2014 22 / 44

Transformations Higher-order functions that execute user-defined functions in parallel on the input data. Amir H. Payberah (SICS) Stratosphere June 4, 2014 22 / 44

Transformations: Map All pairs are independently processed. Amir H. Payberah (SICS) Stratosphere June 4, 2014 23 / 44

Transformations: Map All pairs are independently processed. val input: DataSet[(Int, String)] =... val mapped = input.flatmap { case (value, words) => words.split(" ") } Amir H. Payberah (SICS) Stratosphere June 4, 2014 23 / 44

Transformations: Reduce Pairs with identical key are grouped. Groups are independently processed. Amir H. Payberah (SICS) Stratosphere June 4, 2014 24 / 44

Transformations: Reduce Pairs with identical key are grouped. Groups are independently processed. val input: DataSet[(String, Int)] =... val reduced = input.groupby(_._1).reducegroup(words => words.minby(_._2)) Amir H. Payberah (SICS) Stratosphere June 4, 2014 24 / 44

Transformations: Join Performs an equi-join on the key. Join candidates are independently processed. Amir H. Payberah (SICS) Stratosphere June 4, 2014 25 / 44

Transformations: Join Performs an equi-join on the key. Join candidates are independently processed. val counts: DataSet[(String, Int)] =... val names: DataSet[(Int, String)] =... val join = counts.join(names).where(_._2).isequalto(_._1).map { case (l, r) => l._1 + "and" + r._2 } Amir H. Payberah (SICS) Stratosphere June 4, 2014 25 / 44

Transformations: and More... Groups each input on key Amir H. Payberah (SICS) Stratosphere June 4, 2014 26 / 44

Transformations: and More... Groups each input on key Builds a Cartesian Product Amir H. Payberah (SICS) Stratosphere June 4, 2014 26 / 44

Transformations: and More... Groups each input on key Builds a Cartesian Product Merges two or more input data sets, keeping duplicates Amir H. Payberah (SICS) Stratosphere June 4, 2014 26 / 44

Example: Word Count val input = TextFile(textInput) val words = input.flatmap { line => line.split(" ") map(word => (word, 1)) } val counts = words.groupby(_._1).reduce((w1, w2) => (w1._1, w1._2 + w2._2)) val output = counts.write(wordsoutput, CsvOutputFormat()) Amir H. Payberah (SICS) Stratosphere June 4, 2014 27 / 44

Join Optimization val large = env.readcsv(...) val medium = env.readcsv(...) val small = env.readcsv(...) joined1 = large.join(medium).where(_._3).isequalto(_._1).map { (left, right) =>... } joined2 = small.join(joined1).where(_._1).isequalto(_._2).map { (left, right) =>...} result = joined2.groupby(_._3).reducegroup(_.maxby(_._2)) Amir H. Payberah (SICS) Stratosphere June 4, 2014 28 / 44

Join Optimization val large = env.readcsv(...) val medium = env.readcsv(...) val small = env.readcsv(...) joined1 = large.join(medium).where(_._3).isequalto(_._1).map { (left, right) =>... } joined2 = small.join(joined1).where(_._1).isequalto(_._2).map { (left, right) =>...} result = joined2.groupby(_._3).reducegroup(_.maxby(_._2)) (1) Partitioned hash-join - Hash/Sort tables into N buckets on join key - Send buckets to parallel machines - Join every pair of buckets Amir H. Payberah (SICS) Stratosphere June 4, 2014 28 / 44

Join Optimization val large = env.readcsv(...) val medium = env.readcsv(...) val small = env.readcsv(...) joined1 = large.join(medium).where(_._3).isequalto(_._1).map { (left, right) =>... } (2) Broadcast hash-join - Broadcast the smaller table joined2 = small.join(joined1) - Avoids the network overhead.where(_._1).isequalto(_._2).map { (left, right) =>...} result = joined2.groupby(_._3).reducegroup(_.maxby(_._2)) Amir H. Payberah (SICS) Stratosphere June 4, 2014 28 / 44

Join Optimization val large = env.readcsv(...) val medium = env.readcsv(...) val small = env.readcsv(...) joined1 = large.join(medium).where(_._3).isequalto(_._1).map { (left, right) =>... } joined2 = small.join(joined1).where(_._1).isequalto(_._2).map { (left, right) =>...} result = joined2.groupby(_._3).reducegroup(_.maxby(_._2)) (3) Grouping/Aggregation reuses the partitioning from step (1) - no shuffle Amir H. Payberah (SICS) Stratosphere June 4, 2014 28 / 44

What about Iteration? Amir H. Payberah (SICS) Stratosphere June 4, 2014 29 / 44

Iterative Algorithms Algorithms that need iterations: Clustering, e.g., K-Means Gradient descent, e.g., Logistic Regression Graph algorithms, e.g., PageRank... Amir H. Payberah (SICS) Stratosphere June 4, 2014 30 / 44

Iteration Loop over the working data multiple times. Amir H. Payberah (SICS) Stratosphere June 4, 2014 31 / 44

Iteration Loop over the working data multiple times. Iterations with hadoop Slow: using HDFS Everything has to be read over and over again Amir H. Payberah (SICS) Stratosphere June 4, 2014 31 / 44

Types of Iterations Two types of iteration at stratosphere: Bulk iteration Delta iteration Both operators repeatedly invoke the step function on the current iteration state until a certain termination condition is reached. Amir H. Payberah (SICS) Stratosphere June 4, 2014 32 / 44

Bulk Iteration In each iteration, the step function consumes the entire input, and computes the next version of the partial solution. A new version of the entire model in each iteration. Amir H. Payberah (SICS) Stratosphere June 4, 2014 33 / 44

Bulk Iteration - Example // 1st 2nd 10th map(1) -> 2 map(2) -> 3... map(10) -> 11 map(2) -> 3 map(3) -> 4... map(11) -> 12 map(3) -> 4 map(4) -> 5... map(12) -> 13 map(4) -> 5 map(5) -> 6... map(13) -> 14 map(5) -> 6 map(6) -> 7... map(14) -> 15 Amir H. Payberah (SICS) Stratosphere June 4, 2014 34 / 44

Bulk Iteration - Example // 1st 2nd 10th map(1) -> 2 map(2) -> 3... map(10) -> 11 map(2) -> 3 map(3) -> 4... map(11) -> 12 map(3) -> 4 map(4) -> 5... map(12) -> 13 map(4) -> 5 map(5) -> 6... map(13) -> 14 map(5) -> 6 map(6) -> 7... map(14) -> 15 val input: DataSet[Int] =... def step(partial: DataSet[Int]) = partial.map(a => a + 1) val numiter = 10; val iter = input.iterate(numiter, step) Amir H. Payberah (SICS) Stratosphere June 4, 2014 34 / 44

Delta Iteration Only parts of the model change in each iteration. Amir H. Payberah (SICS) Stratosphere June 4, 2014 35 / 44

Delta Iteration - Example Amir H. Payberah (SICS) Stratosphere June 4, 2014 36 / 44

Delta Iteration vs. Bulk Iteration Computations performed in each iteration for connected communities of a social graph. Amir H. Payberah (SICS) Stratosphere June 4, 2014 37 / 44

Stratosphere Execution Engine Amir H. Payberah (SICS) Stratosphere June 4, 2014 38 / 44

Stratosphere Architecture Master-worker model Job Manager: handles job submission and scheduling. Task Manager: executes tasks received from JM. All operators start in-memory and gradually go out-of-core. Amir H. Payberah (SICS) Stratosphere June 4, 2014 39 / 44

Job Graph and Execution Graph Jobs are expressed as data flows. Job graphs are transformed into the execution graph. Execution graphs consist information to schedule and execute a job. Amir H. Payberah (SICS) Stratosphere June 4, 2014 40 / 44

Channels Channels transfer serialized records in buffers. Pipelined Online transfer to receiver. Materialized Sender writes result to disk, and afterwards it transferred to receiver. Used in recovery. Amir H. Payberah (SICS) Stratosphere June 4, 2014 41 / 44

Fault Tolerance Task failure compensated by backup task deployment. Track the execution graph back to the latest available result, and recompute the failed tasks. Amir H. Payberah (SICS) Stratosphere June 4, 2014 42 / 44

Amir H. Payberah (SICS) Stratosphere June 4, 2014 43 / 44

http://stratosphere.eu Amir H. Payberah (SICS) Stratosphere June 4, 2014 44 / 44