From GWS to MapReduce: Google s Cloud Technology in the Early Days

Similar documents
Introduction to Big Data! with Apache Spark" UC#BERKELEY#

CSE-E5430 Scalable Cloud Computing Lecture 11

Big Data Analytics Hadoop and Spark

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Dell In-Memory Appliance for Cloudera Enterprise

Parallel Processing of cluster by Map Reduce

Data-Intensive Computing with Map-Reduce and Hadoop

Rakam: Distributed Analytics API

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Big Data With Hadoop

Big Data Processing with Google s MapReduce. Alexandru Costan

Apache Spark and Distributed Programming

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Big Data and Scripting Systems beyond Hadoop

VariantSpark: Applying Spark-based machine learning methods to genomic information

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Jeffrey D. Ullman slides. MapReduce for data intensive computing

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data and Apache Hadoop s MapReduce

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

MAPREDUCE Programming Model

The Hadoop Framework

Developing MapReduce Programs

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Introduction to Parallel Programming and MapReduce

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data and Scripting Systems build on top of Hadoop

Scaling Out With Apache Spark. DTL Meeting Slides based on

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Cloud Computing at Google. Architecture

CSE-E5430 Scalable Cloud Computing Lecture 2

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Energy Efficient MapReduce

A very short Intro to Hadoop

Survey on Scheduling Algorithm in MapReduce Framework

Big Application Execution on Cloud using Hadoop Distributed File System

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Parallel Data Processing

Introduction to Hadoop

MapReduce (in the cloud)

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Hadoop and Map-Reduce. Swati Gore

MapReduce and Hadoop Distributed File System V I J A Y R A O

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Management & Analysis of Big Data in Zenith Team

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Intro to Map/Reduce a.k.a. Hadoop

Architectures for massive data management

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Advanced Computer Networks. Scheduling

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics


Big Data in the Enterprise: Network Design Considerations

How Companies are! Using Spark

Improving MapReduce Performance in Heterogeneous Environments

HADOOP PERFORMANCE TUNING

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Radoop: Analyzing Big Data with RapidMiner and Hadoop

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Introduction to Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Big Systems, Big Data

MapReduce and Hadoop Distributed File System

Networking in the Hadoop Cluster

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Apache Hadoop FileSystem and its Usage in Facebook

Report: Declarative Machine Learning on MapReduce (SystemML)

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Lecture 10 - Functional programming: Hadoop and MapReduce

Classification On The Clouds Using MapReduce

Apache Hadoop. Alexandru Costan

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Amazon EC2 Product Details Page 1 of 5

Big Data Analysis and HADOOP

Transcription:

Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org

MapReduce/Hadoop Around 2004, Google invented MapReduce to parallelize computation of large data sets. It s been a key component in Google s technology foundation Around 2008, Yahoo! developed the open-source variant of MapReduce named Hadoop After 2008, MapReduce/Hadoop become a key technology component in cloud computing MapReduce Hadoop Hadoop or variants In 2010, the U.S. conferred the MapReduce patent to Google

Data-Intensive Computation MapReduce parallel computing for Webscale data Map: a higher-order function applied to all elements in a list Result is a new list 1 2 3 4 5 The MapReduce Approach x 2 x 2 x 2 x 2 x 2 1 4 9 16 25 (map (lambda (x) (* x x)) '(1 2 3 4 5)) '(1 4 9 16 25)

Data-Intensive Computation The MapReduce Approach Reduce is also a higher-order function Like fold : aggregate elements of a list Accumulator set to initial value Function applied to list element and the accumulator Result stored in the accumulator Repeated for every item in the list Result is the final value in the accumulator 1 2 3 4 5 Initial value + + + + + final value 0 1 3 6 10 15 (fold + 0 '(1 2 3 4 5)) 15 (fold * 1 '(1 2 3 4 5)) 120

Massive parallel processing made simple Example: word count MapReduce programming The MapReduce Approach Map: parse a document and generate <word, 1> pairs Reduce: receive all pairs for a specific word, and count (sum) Map // D is a document for each word w in D output <w, 1> Reduce Reduce for key w: count = 0 for each input item count = count + 1 output <w, count>

Workflow 1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines. 2. One of the copies of the program is the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

Workflow 3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs (KV pairs) produced by the Map function are buffered in memory. 4. Periodically, the buffered pairs are written to local disks, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

Workflow 5. A reduce worker processes one partition of data. When a reduce worker, r, is notified by the master about the locations of intermediate data, r uses RPCs to read the data in its partition from the local disks of the map workers. After all intermediate data are read, r sorts and groups them by the intermediate keys. 6. Each reduce worker iterates over the sorted intermediate KV pairs in its partition and, for each unique intermediate key encountered, it passes the key and the associated set of values to the Reduce function. The output of the Reduce function is appended to a temporary output file on GFS. 7. A completed reduce task renames its temporary output file to the output file of its partition. When all tasks have completed, MapReduce returns to the user code.

Programming How to write a MapReduce program to Generate inverted indices? Sort? How to express more sophisticated logic? What if some workers (slaves) or the master fails?

MapReduce Limitations MapReduce provides an easy-to-use framework for parallel programming, but is it the most efficient and best solution to program execution in datacenters? MapReduce has its discontents DeWitt and Stonebraker: MapReduce: A major step backwards MapReduce is far less sophisticated and efficient than parallel query processing MapReduce is a parallel processing framework, not a database system, nor a query language It is possible to use MapReduce to implement some of the parallel query processing functions What are the real limitations? Inefficient for general programming (and not designed for that) Hard to handle data with complex dependence, frequent updates, etc. High overhead, bursty I/O, difficult to handle long streaming data Limited opportunity for optimization

Slow Problems of MapReduce/Hadoop Cannot support low-latency, interactive, or real-time programs Very slow, if the algorithm processes data in a graph model contains non-trivial dependences among operations walks through multiple iterations follows a complex data/control flow Not programmable, if the algorithm invokes recursion of Map and Reduce steps Requires an external glue language takes unpredictable execution time 11

Time (sec) 9000 8000 7000 6000 5000 Computing Capability 9044 Z. Ma and L. Gu. The Limitation of MapReduce: a Probing Case and a Lightweight Solution. CLOUD COMPUTING 2010 gcc (on one node) mrcc/hadoop mrcc/mrlite 4000 3000 2936 2000 1000 0 506 1419 653 312 50 128 65 Linux kernel ImageMagick Xen tools Using Hadoop, parallel compilation is even slower than sequential compilation! 12

More Solutions Coming Dryad/DryadLINQ Megastore/Spanner Pregel OpenCL RamCloud The prosperity of solutions in this area, however, indicates that there has not been a solid, general-purpose technology for cloud systems. There are challenges in constructing sophisticated big-data processing capability, wide-area computation and very-large database systems.

SPARK/RDD RDD: Resilient Distributed Datasets Compute in the memory Operators transform data 20 times faster than Hadoop on some iterative workloads Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proc. of NSDI '12. RDD is not a general-purpose instrument for big-data computation. We also expect empirical data on larger-scale workload to understand its scalability. In-memory computation will be increasingly important in big-data computation.

Inside MapReduce-Style Computation Network activities under MapReduce/Hadoop workload Hadoop: open-source implementation of MapReduce Processing data with 3 servers (20 cores) 116.8GB input data Network activities captured with Xen virtual machines

MapReduce workflow Initial data split into 64MB blocks Computed, results locally stored Master informed of result locations R reducers retrieve Data from mappers Final output written Communication intensive

Inside MapReduce Packet reception under MapReduce/Hadoop workload Large data volume Bursty network traffic Genrality widely observed in MapReduce workloads Packet reception on a slave server

Inside MapReduce Packet reception on the master server

Inside MapReduce Packet transmission on the master server

Datacenter Networking Common data center topology Internet Core Layer-3 router Datacenter Aggregation Layer-2/3 switch Access Layer-2 switch Servers