Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Similar documents
CSE-E5430 Scalable Cloud Computing Lecture 2

Apache Flink Next-gen data analysis. Kostas

Hadoop Ecosystem B Y R A H I M A.

Big Data and Apache Hadoop s MapReduce

Advanced Data Management Technologies

Challenges for Data Driven Systems

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop implementation of MapReduce computational model. Ján Vaňo

Managing large clusters resources

Open source large scale distributed data management with Google s MapReduce and Bigtable

Large scale processing using Hadoop. Ján Vaňo

Large-Scale Data Processing

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Open source Google-style large scale data analysis with Hadoop

Unified Big Data Analytics Pipeline. 连 城

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

MapReduce (in the cloud)

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Hadoop IST 734 SS CHUNG

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

CIEL A universal execution engine for distributed data-flow computing

How Companies are! Using Spark

Extending Hadoop beyond MapReduce

Unified Big Data Processing with Apache Spark. Matei

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Data processing goes big

Jeffrey D. Ullman slides. MapReduce for data intensive computing

How To Create A Data Visualization With Apache Spark And Zeppelin

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Scalable Cloud Computing

Big Data With Hadoop

Beyond Hadoop with Apache Spark and BDAS

BIG DATA What it is and how to use?

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

Hadoop and Map-Reduce. Swati Gore

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Dominik Wagenknecht Accenture

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Introduction to Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

The Internet of Things and Big Data: Intro

Lecture 10 - Functional programming: Hadoop and MapReduce

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Analysis and HADOOP

L1: Introduction to Hadoop

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Big Data Course Highlights

Real-time Big Data Analytics with Storm

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data and Industrial Internet

Moving From Hadoop to Spark

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Big Data Processing with Google s MapReduce. Alexandru Costan

COURSE CONTENT Big Data and Hadoop Training

Ali Ghodsi Head of PM and Engineering Databricks

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Big Data Analytics Hadoop and Spark

The Stratosphere Big Data Analytics Platform

Big Data and Scripting Systems build on top of Hadoop

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Internals of Hadoop Application Framework and Distributed File System

Big Data Too Big To Ignore

Hadoop in the Enterprise

The Berkeley AMPLab - Collaborative Big Data Research

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Workshop on Hadoop with Big Data

MapReduce with Apache Hadoop Analysing Big Data

HiBench Introduction. Carson Wang Software & Services Group

Big Data Analytics. Lucas Rego Drumond

What is cloud computing?

CS 294: Big Data System Research: Trends and Challenges

A Brief Introduction to Apache Tez

Apache Hadoop Ecosystem

Shark Installation Guide Week 3 Report. Ankush Arora

Chase Wu New Jersey Ins0tute of Technology

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Transcription:

Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft Research 2008/09 2008 Technical Staff Member IBM Research & Adobe Systems Graduate student IIT Kanpur 2

Big Data Systems Collect data Process data Provide services E.g. Web- crawl, Click- streams, social network graph E.g. PageRank, clustering, machine learning algorithms E.g. search, recommendations, spam detection Raw data System Information In this course! 3

Big Data Systems Data Analytics Batch processing MapReduce Spark Stream processing D- Streams Graph processing Pregel/Giraph GraphX Query processing Pig or HIVE Data Management File- system GFS/HDFS Distributed database BigTable or Hbase Spanner Resource Management Cluster manager Mesos or YARN Co- ordination service ZooKeeper or Chubby

My approach Big Data Systems Application programmer System architect 5

My lectures Date Topic 19/10/15 MapReduce & Pig 26/10/15 GFS 02/11/15 BigTable 18/01/16 Spark 25/01/16 Pregel 01/02/16 Spanner 6

Data- Intensive Computing with MapReduce/Pig Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II, 2015

How much data? processes 20 PB a day (2008) crawls 20B web pages a day (2012) >100 PB of user data + 500 TB/day (8/2012) >10 PB data, 75B DB calls per day (6/2012) Distributed Systems! S3: 449B objects, peak 290k request/second (7/2011) 1T objects (6/2012)

Data- center Cluster of 100s of thousands of machines 9

In today s class How to easily write parallel applications on distributed computing systems? 1. MapReduce 2. Pig A high- level language built on top of MapReduce 10

Design challenges How to parallelize application logic? How to communicate? How to synchronize? How to perform load balancing? How to handle faults? How to schedule jobs? For each and every application! Design Implement Optimize Debug Maintain 11

The power of abstraction Application MapReduce Library Parallelization Fault- tolerance Communication Synchronization Load balancing Scheduling 12

MapReduce Programming model Programmer writes two methods: Map & Reduce Run- time library Takes care of everything else! 13

MapReduce programming model Inspired from functional programming Data- parallel application logic Programmer s interface: Map(key, value) à (key, value) Reduce(key, <value>) à (key,value) 14

MapReduce run- time system (InK1, InV1) (InK2, InV2) (InK3, InV3) (InK4, InV4) Input M M1 M M2 M M3 M M4 Map tasks (K1, V1) (K2, V2) (K1, V3) (K2, V4) Map outputs (K1, <V1,V3>) R R1 (K2, <V2,V4>) R R2 Reduce tasks Output (OK1, OV1) (OK2, OV2) Shuffle and sort 15

An example: word- count Input: Given a corpus of documents, such as Wikipedia Output: Count the frequency of each distinct word 16

MapReduce for word- count map(string key, string value) //key: document name //value: document contents for each word w in value EmitIntermediate(w, 1 ); reduce(string key, iterator values) //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(key, AsString(result)); 17

Word- count example (Doc1, the ) (Doc2, for ) (Doc3, the ) (Doc4, for ) Input ( the, 1 ) M M1 ( for, 1 ) M M2 ( the, 1 ) M M3 ( for, 1 ) M M4 Map outputs Map tasks ( the, < 1, 1 >) R R1 ( for, < 1, 1 >) R R2 Reduce tasks Output ( the, 2 ) ( for, 2 ) Shuffle and sort 18

MapReduce software stack Master Distributed software MapReduce library Distributed file system: GFS/HDFS 19

MapReduce software stack Master Job tracker (MapReduce) Namenode (HDFS) Task tracker Task tracker Task tracker Task tracker Data node Data node Data node Data node 20

Runtime execution 21

Design challenges revisited Parallelization Communication Synchronization Load balancing Faults & semantics Scheduling 22

References MapReduce [OSDI 04] and YARN [SoCC 13] Original M/R, and the next generation of M/R Dryad [EuroSys 07] Generalized framework for data- parallel computations Spark [NSDI 12] In- memory distributed data parallel computing 23

Limitations of MapReduce Graph algorithms Pregel [SIGMOD 10], GraphX [OSDI 14] Iterative algorithms Haloop [VLDB 10], CIEL [NSDI 11] Stream processing Low latency D- stream [SOSP 13], Naiad [SOSP 13], Storm, S4 Low- level abstraction for common data analysis tasks! Pig [SIGMOD 10], Shark [SIGMOD 13], DryadLINQ [OSDI 08] 24

Motivation for Pig Programmers are lazy! (they don t even wish to write Map and Reduce) 25

Data analysis tasks Common operations: Filter, join, group- by, sort, etc. MapReduce offers a low- level primitive Requires repeated re- implementation of these operators The power of abstraction! Design once and reuse 26

Pig Latin Distributed dataflow queries Pig Latin = SQL- kind queries + Distributed execution 27

Pig architecture Pig Latin First MapReduce job M R M M R script Pipelined Pig compiler Second MapReduce job M R M M R MapReduce Runtime 28

Overview of the compilation process Pig compiler Logical plan Physical plan MapReduce plan 29

An example 30

Example: contd. 31

Example: contd. 32

Advantages of staged- compilation SQL query optimizations MapReduce specific optimizations Refer Pig papers for details [SIGMOD 08, VLDB 09] 33

Related systems Apache HIVE Built on top of MapReduce DryadLINQ [OSDI 08] or SCOPE [VLDB 08] Built on top of Dryad Shark [SIGMOD 13] Built on top of Spark 34

Summary Data- intensive computing with MapReduce Data- parallel programming model Runtime library to handle all low- level details Pig: high- level abstraction for common tasks Resources: Hadoop: http://hadoop.apache.org/ Spark: https://spark.apache.org/ Dryad: http://research.microsoft.com/en- us/projects/dryad/ 35

Thanks! pramod.bhatotia@tu- dresden.de 36