Big Data Processing in the Cloud. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center



Similar documents
Big Data Processing with Google s MapReduce. Alexandru Costan

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Distributed File Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

The Google File System

Distributed File Systems

Google File System. Web and scalability

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Cloud Computing at Google. Architecture

Introduction to Cloud Computing

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data With Hadoop

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

A programming model in Cloud: MapReduce

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Big Data and Apache Hadoop s MapReduce

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Distributed File Systems

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Data-Intensive Computing with Map-Reduce and Hadoop

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Architecture. Part 1

Hadoop & its Usage at Facebook

Massive Data Storage

The Google File System

THE HADOOP DISTRIBUTED FILE SYSTEM

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Parallel Processing of cluster by Map Reduce

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

HDFS Architecture Guide

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop IST 734 SS CHUNG

Introduction to Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop and Map-Reduce. Swati Gore

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop & its Usage at Facebook

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Design and Evolution of the Apache Hadoop File System(HDFS)

Apache Hadoop. Alexandru Costan

Hypertable Architecture Overview

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

The Hadoop Distributed File System

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead

Introduction to Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Chapter 7. Using Hadoop Cluster and MapReduce

Comparative analysis of Google File System and Hadoop Distributed File System

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

A Brief Outline on Bigdata Hadoop

Scala Storage Scale-Out Clustered Storage White Paper

Similarity Search in a Very Large Scale Using Hadoop and HBase

Snapshots in Hadoop Distributed File System

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Technical Overview Simple, Scalable, Object Storage Software

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Accelerating and Simplifying Apache

HDFS Users Guide. Table of contents

Map Reduce / Hadoop / HDFS

Storage Architectures for Big Data in the Cloud

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Contents. 1. Introduction

Open source Google-style large scale data analysis with Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

FAWN - a Fast Array of Wimpy Nodes

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Big Fast Data Hadoop acceleration with Flash. June 2013

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Seminar Presentation for ECE 658 Instructed by: Prof.Anura Jayasumana Distributed File Systems

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Survey on Scheduling Algorithm in MapReduce Framework

Cloud Computing Trends

International Journal of Advance Research in Computer Science and Management Studies

Hadoop Distributed File System (HDFS) Overview

Big Table A Distributed Storage System For Data

The Hadoop Distributed File System

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Transcription:

Big Data Processing in the Cloud Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Data is ONLY as useful as the decisions it enables 2

Data is ONLY as useful as the decisions it enables 3

What is needed? An efficient processing of Big Data is restricted by: Available computation/storage power» Cloud computing: Allows users to lease computing and storage resources in a Pay- As- You- Go manner Programming model» MapReduce: Simple yet scalable model 4

Cloud Computing 5

New Buzzword 6

New Buzzword 7

New Buzzword 8

New Buzzword 9

Cloud Computing Any situation in which computing is done in a remote location (out in the clouds), rather than on your desktop or portable device. 10

Cloud Computing Any situation in which computing is done in a remote location (out in the clouds), rather than on your desktop or portable device. You tap into that computing power over an Internet connection. Source: http:// www.businessweek.com 11

Cloud Premise Ubiquity of access» all you need is a browser Ease of management» no need for user experience improvements as no configuration or backup is needed Less investment» affordable enterprise solution deployed on a pay- per- use basis for the hardware, with systems software provided by the cloud providers 12

Cloud Characteristics 13

Cloud Characteristics 14

Cloud Characteristics Cloud services delivered through data centers that are built on computer and storage virtualization technologies. 15

Cloud Characteristics Cloud services delivered through data centers that are built on computer and storage virtualization technologies. 16

Cloud Characteristics Cloud Abstraction services delivered of the hardware through from data the centers service. that are built on computer and Abstraction storage virtualization of the software technologies. from the service. 17

Cloud Characteristics Cloud Abstraction services delivered of the hardware through from data the centers service. that are built on computer and Abstraction storage virtualization of the software technologies. from the service. 18

Cloud Characteristics Cloud Abstraction Services services are delivered of hosted the hardware in through the Cloud from data some the centers service. where?? that are built on computer and Abstraction storage virtualization of the software technologies. from the service. 19

Cloud Characteristics Cloud Abstraction Services services are delivered of hosted the hardware in through the Cloud from data some the centers service. where?? that are built on computer and Abstraction storage virtualization of the software technologies. from the service. 20

Cloud Characteristics Cloud Abstraction Services services are delivered of hosted the hardware in through the Cloud from data some the centers service. where?? that are built on computer and Pay as you go with no- lock access of the Cloud resources Abstraction storage virtualization of the software technologies. from the service. 21

Cloud Characteristics Cloud Abstraction Services services are delivered of hosted the hardware in through the Cloud from data some the centers service. where?? that are built on computer and Pay as you go with no- lock access of the Cloud resources Abstraction storage virtualization of the software technologies. from the service. 22

Cloud Characteristics Cloud Abstraction Services services are delivered of hosted the hardware in through the Cloud from data some the centers service. where?? that are built on computer and Pay Cloud as you Resources go with can no- lock be provisioned access of the and Cloud scaled resources Abstraction storage virtualization of the software technologies. from the service. 23

Cloud Characteristics Cloud Abstraction Services services are delivered of hosted the hardware in through the Cloud from data some the centers service. where?? that are built on computer and Pay Cloud as you Resources go with can no- lock be provisioned access of the and Cloud scaled resources Abstraction storage virtualization of the software technologies. from the service. 24

Cloud Characteristics Cloud Abstraction Services services are delivered of hosted the hardware in through the Cloud from data some the centers service. where?? that are built on computer and Pay Cloud as you Resources go with can no- lock can be be access provisioned of the and and Cloud scaled resources On Demand Abstraction storage virtualization of the software technologies. from the service. 25

Cloud Characteristics Cloud Abstraction Services services are delivered of hosted the hardware in through the Cloud from data some the centers service. where?? that are built on computer and Pay Cloud as you Resources go with can no- lock can be be access provisioned of the and and Cloud scaled resources On Demand Abstraction storage virtualization of the software technologies. from the service. 26

Cloud Characteristics Cloud Abstraction Services services Using are delivered of hosted the web- based hardware in through the and Cloud from data programmatic some the centers service. where?? that interfaces are built which on computer hide all and the complexity of the Pay Cloud as you Resources go with can no- lock can be be access provisioned of the and and Cloud scaled resources On Demand Abstraction storage virtualization of system the software and technologies. make from it Transparent the service. to the users. 27

Cloud Characteristics Cloud Abstraction Services services Using are delivered of hosted the web- based hardware in through the and Cloud from data programmatic some the centers service. where?? that interfaces are built which on computer hide all and the complexity of the Pay Cloud as you Resources go with can no- lock can be be access provisioned of the and and Cloud scaled resources On Demand Abstraction storage virtualization of system the software and technologies. make from it Transparent the service. to the users. 28

Cloud Challenges 29

Cloud Challenges Data Governance 30

Cloud Challenges Data Governance Enterprise data: Sensitive. (Requires access monitoring and protection). Full capabilities to govern. Cloud Providers guarantee: customer data are safe Access to data are restricted and protected. More Challenges Wiretapping? Patriot Act? Privacy & geographic boundaries? 31

Cloud Challenges Data Governance Manageability 32

Cloud Challenges Data Governance Manageability management issues : Auto- scaling Load balancing Recent Cloud providers (infrastructures and platforms) do not have great management capabilities. Need to build management capabilities on top of the existing cloud infrastructure/platforms. RightScale is one of the early pioneers in this space. 33

Cloud Challenges Data Governance Manageability Monitoring 34

Cloud Challenges Data Monitoring, whether is for performance Governance or availability, is critical to any IT shop. Monitoring should not be about CPU or memory only Manageability but also about performance of transactions and Monitoring disk IO and others. CPU and memory usage are misleading most of the time in virtual environments. Hypernic s CloudStatus is one of the first to recognize this issue and developed a solution for it. RightScale s solution can also provide monitoring for the virtual machines under their management. 35

Cloud Challenges Data Governance Manageability Monitoring Reliability & Availability 36

Cloud Challenges Data Governance Manageability Monitoring There s almost no clear defined SLAs provided by the cloud providers today. Can you imagine enterprises signing up cloud computing contracts without SLAs clearly defined? Reliability & Availability 37

Cloud Challenges Data Governance Manageability Monitoring There s almost no clear defined SLAs provided by the cloud providers today. Can you imagine enterprises signing up cloud computing contracts without SLAs clearly defined? Reliability & Availability 38

Cloud Challenges Data Governance Manageability Monitoring Reliability & Availability Virtualization Security 39

Cloud Challenges Data Governance Many IT people still believe that the hypervisor and virtual Manageability machines are safe. Recent presentations from Blackhat has demonstrate that we shouldn t Monitoring sleep so tight at night. What if regulated data is migrated to an unprotected rack? What about a machine image that s been lying in some obscure location in the cloud, can you be sure it s secure when it reappears on the network? Does it have the latest patches? Reliability & Availability Virtualization Security 40

Cloud Challenges Data Governance Manageability Monitoring Reliability & Availability Virtualization Security 41

Cloud Types Cloud types and services, H. Jin, S. Ibrahim, T. Bell, W. Gao, D. Huang, S. Wu, Handbook of Cloud Computing, 2010 42

Cloud roles and services Cloud types and services, H. Jin, S. Ibrahim, T. Bell, W. Gao, D. Huang, S. Wu, Handbook of Cloud Computing, 2010 43

Cloud roles and services IaaS PaaS SaaS Cloud types and services, H. Jin, S. Ibrahim, T. Bell, W. Gao, D. Huang, S. Wu, Handbook of Cloud Computing, 2010 44

Amazon Web Services 45

AWS Family Cloud types and services, H. Jin, S. Ibrahim, T. Bell, W. Gao, D. Huang, S. Wu, Handbook of Cloud Computing, 2010 46

Amazon Consumers 47

NYTimes articles from 1851-1922 available to the public online. There was a total of 11 million articles. They had to take sometimes several TIFF images and scale and glue them together to create one PDF version of the article. They used 100 EC2 instances to complete the job in just under 24 hours. They started with 4TB of data that was uploaded into S3 and through the conversion process created another 1.5TB. 48

Case Study 315,000 paying customers already, and 288,000,000 photos 49

Animoto was regularly using around 40 virtual machines, hosted in a cloud infrastructure. With the almost instant success that Facebook's organic growth provided, Animoto needed to scale up to an astounding 5,000 servers in four days. 50

Cloud Computing 51

The Google File System (FGS) adopted from Sanjay Ghemawat, Howard Gobioff, and Shun- Tak LeungThe Google File System, SOSP '03 52

Operating Environment Component Failure» It s the norm, not the exception» Expect: Application, OS bugs Human error Hardware failure: Disks Memory Power Supply» Need for: Constant monitoring Error detection Fault tolerance Automatic Recovery The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Non- standard data needs Files are huge» Typically multi- GB» Data sets can be many TB» Cannot deal with typical file sizes Would mean billions of KB sized files- I/O problems!» Must rethink block sizes The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Non- standard writing File mutation» Data is often appended, rather than overwritten» Once a file is written it is read only, typically read only sequentially» Optimization focuses on appending to large files and atomicity, not on caching The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Design Assumptions Built from cheap commodity hardware Expect large files: 100MB to many GB Support large streaming reads and small random reads Support large, sequential file appends Support producer- consumer queues for many- way merging and file atomicity Sustain high bandwidth by writing data in bulk The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

GFS Architecture 1 master, at least 1 chunkserver, many clients The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Master Responsibilities» Maintain metadata: information like namespace, access control info, mappings from files to chunks, chunk locations» Activity control Chunk lease management Garbage collection Chunk migration» Communication with chunks Heartbeats: periodic syncing with chunkservers» Lack of caching Block size too big (64 MB) Applications stream data (Clients will cache metadata, though) The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Chunks Identified by 64 bit chunk handles» Globally unique and assigned by master at chunk creation time Triple redundancy: copies of chunks kept on multiple chunkservers Chunk size: 64 MB!» Plain Linux file» Advantages Reduces R/W & communication between clients & master Reduces network overhead- persistent TCP connection Reduces size of metadata on master» Disadvantage Hotspots- small files of just one chunk that many clients need The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Metadata 3 main types to keep in memory» File and chunk namespaces» Mappings from files to chunks» Locations of chunk s replicas First two are also kept persistently in log files to ensure reliability and recoverability Chunk locations are held by chunkservers- master polls them for locations with a heartbeat occasionally (and on startup) The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

System Operations Leases» Mutations are done at all chunk s replicas» Master defines primary chunk which then decides serial order to update replicas» Timeout of 60 seconds- can be extended with help of heartbeats from master The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Write Control & Data Flow 1. Client asks master for chunkserver 2. Master replies with primary & secondaries (client caches) 3. Client pushes data to all 4. Clients sends write to primary 5. Primary forwards serially 6. Secordaries return completed write 7. Primary replies to client The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Decoupling Data & flow control are separated» Use network efficiently- pipeline data linearly along chain of chunkservers» Goals Fully utilize each machine s network bandwidth Avoid network bottlenecks & high- latency» Pipeline is carefully chosen Assumes distances determined by IP addresses The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Master operation Executes all namespace operations Manages chunk replicas throughout system» Placement decisions» Creation» Replication Load balancing chunkserver Reclaim unused storage throughout system The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Replica Placement Issues Worries» Hundreds of chunkservers spread over many machine racks» Hundreds of clients accessing from any rack» Aggregate bandwidth of rack < aggregate bandwidth of all machines on rack Want to distribute replicas to maximize data» Scalability» Reliability» Availability The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Where to Put a Chunk Considerations» It is desirable to put chunk on chunkserver with below- average disk space utilization» Limit the number of recent creations on each chunkserver- avoid heavy write traffic» Spread chunks across multiple racks The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Re- replication & Rebalancing Re- replication occurs when the number of available chunk replicas falls below a user- defined limit» When can this occur? Chunkserver becomes unavailable Corrupted data in a chunk Disk error Increased replication limit Rebalancing is done periodically by master» Master examines current replica distribution and moves replicas to even disk space usage» Gradual nature avoids swamping a chunkserver The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Collecting Garbage Lazy» Update master log immediately, but» Do not reclaim resources for a bit (lazy part)» Chunk is first renamed, not removed» Periodic scans remove renamed chunks more than a few days old (user- defined) Orphans» Chunks that exist on chunkservers that the master has no metadata for» Chunkserver can freely delete these The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Lazy Garbage Pros/Cons Pros» Simple, reliable» Makes storage reclamation a background activity of master- done in batches when the master has the time (lets master focus on responding to clients, putting speed at priority)» A simple safety net against accidental, irreversible deletion Con» Delay in deletion can make it harder to fine tune storage in tight situations The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Fault Tolerance Cannot trust hardware Goal is high availability. How?» Fast recovery» Replication Fast Recovery» Master & chunkservers can restore state and restart in a matter of seconds regardless of failure type» No distinction between failure types at all Chunk Replication» Default to keep replicas on at least 3 different racks The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Scalability Scales with available machines, subject to bandwidth» Rack- and datacenter- aware locality and replica creation policies help Single master has limited responsibility, does not rate- limit system The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Security Basically none Relies on Google s network being private File permissions not mentioned in paper» Individual users / applications must cooperate The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Deployment in Google 50+ GFS clusters Each with thousands of storage nodes Managing petabytes of data GFS is under BigTable, etc. The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Conclusion GFS demonstrates how to support large- scale processing workloads on commodity hardware» design to tolerate frequent component failures» optimize for huge files that are mostly appended and read» feel free to relax and extend FS interface as required» go for simple solutions (e.g., single master) GFS has met Google s storage needs it must be good! The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung,SOSP '03

Google MapReduce Adapted from a Presentation by Walfredo Cirne (Google) "Google Infrastructure for Massive Parallel Processing", Presentation in the industrial track in CCGrid'2007. 77

MapReduce Motivation Introduced by Google in 2004 Big Data @ Google:» 20+ billion web pages x 20KB = 400+ TB One computer can read 30-35 MB/sec from disk» ~4 months to read the web» ~1,000 hard drives just to store the web Even more Time/HDD, to do something with the data (e.g., data processing) "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 78

Solution Spread the work over many machines Good news: easy parallelisation» Reading the web with 1000 machines less than 3 hours Bad news: programming work» Communication and coordination» Debugging» Fault tolerance» Management and monitoring» Optimization Worse news: repeat for every problem you want to solve "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007.

And the Problem Size is Ever Growing Every Google service sees continuing growth in computational needs More queries» More users, happier users More data» Bigger web, mailbox, blog, etc. Better results» Find the right information, and find it faster! "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 80

Conclusion:Infrastructure is a Real Challenge At Google s scale, building and running a computing infrastructure that is efficient, cost- effective, and easy- to- use is one of the most challenging technical points! "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 81

Typical Computer Multicore machine» ~ 1-2 TB of disk» ~ 4GB- 16GB of RAM Typical machine runs:» Google File System (GFS)» Scheduler daemon for starting user tasks» One or many user tasks Tens of thousands of such machines Problem : What programming model to use as a basis for scalable parallel processing? "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 82

What Is Needed? A simple programming model that applies to many data- intensive computing problems Approach: hide messy details in a runtime library:» Automatic parallelization» Load balancing» Network and disk transfer optimization» Handling of machine failures» Robustness» Improvements to core library benefit all users of library! "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 83

Such a Model is MapReduce! Typical problem solved by MapReduce» Read a lot of data» Map: extract something you care about from each record» Shuffle and Sort» Reduce: aggregate, summarize, filter, or transform» Write the results Outline stays the same, map and reduce change to fit the problem "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 84

More Specifically It is inspired by the Map and Reduce functions (i.e., It borrows from functional programming) Users implement interface of two primary functions» map(k, v) <k', v'>*» reduce(k', <v'>*) <k', v''>* All v' with same k' are reduced together, and processed in v' order "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 85

More Specifically Map Phase <key, value> Map Reduce Phase Reduce Input Map Output Map Reduce map Shuffle Reduce map(k, v) <k', v'>* Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line) "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 86

More Specifically Map Phase <key, value> Map Reduce Phase Reduce Input Map Output Map Reduce map Shuffle Reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 87

More Specifically Map Phase <key, value> Map Reduce Phase Reduce Input Map Output Map Reduce map Shuffle Reduce reduce(k', <v'>*) <k', v''>* reduce() combines those intermediate values into one or more final values per key (usually only one) "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 88

MapReduce:Scheduling One master, many workers» Input data split into M map tasks (typically 64 MB in size)» Reduce phase partitioned into R reduce tasks» Tasks are assigned to workers dynamically» Per worker input size = GFS chunk size!» Often: M=200,000; R=4,000; workers=2,000 "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 89

MapReduce:Scheduling Master assigns each map task to a free worker» Considers locality of data to worker when assigning task» Worker reads task input (often from local disk!)» Worker produces R local files containing intermediate k/v pairs Master assigns each reduce task to a free worker» Worker reads intermediate k/v pairs from map workers» Worker sorts & applies user s Reduce op to produce the output "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 90

MapReduce:Features Exploit Parallelization:» Tasks are executed in Parallel Fault tolerance» Re- execute only the tasks on Failed machines Exploit data locality» Co- locate data and computation power therefore avoid network bottleneck "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 91

Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can t start until map phase is completely finished. "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 92

Fault Tolerance On worker failure» Detects failure via periodic heartbeats» Re- executes completed & in- progress map() tasks» Re- executes in- progress reduce() tasks Master notices particular input key/values that cause crashes in map(), and skips those values on re- execution. "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 93

Backup Task Slow workers significantly lengthen completion time» Other jobs consuming resources on machine» Bad disks with soft errors transfer data very slowly» Weird things: processor caches disabled (!!) Solution: Near end of phase, spawn backup copies of tasks Whichever one finishes first "wins" Effect: Dramatically shortens job completion time 94

Dealing with failures 95

Locality optimization Master scheduling policy:» Asks GFS for locations of replicas of input file blocks» Map tasks typically split into 64MB (== GFS block size)» Map tasks scheduled so GFS input block replica are on same machine or same rack Effect: Thousands of machines read input at local disk speed Without this, rack switches limit read rate 96

Actual Google MapReduce Example is written in pseudo- code Actual implementation is in C++, using a MapReduce library Bindings for Python and Java exist via interfaces True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.) "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 97

Widely Applicable at Google distributed grep distributed sort term- vector per host document clustering machine learning web access log stats web link- graph reversal inverted index construction statistical machine translation... "Google Infrastructure for Massive Parallel Processing", Walfredo Cirne, Presentation in the industrial track in CCGrid'2007. 98

The Colored Squares Counter 99

The Colored Squares Counter split 100

The Colored Squares Counter map split map map 101

The Colored Squares Counter map Shuffle/sort split map Shuffle/sort map Shuffle/sort 102

The Colored Squares Counter Shuffle/sort Shuffle/sort Shuffle/sort 103

The Colored Squares Counter Shuffle/sort Shuffle/sort Shuffle/sort 104

The Colored Squares Counter Shuffle/sort Reduce,7 Shuffle/sort Reduce,7 Shuffle/sort Reduce,4 105

Word Count Example 106

MapReduce: Implementations 107

108

Acknowledgments Walfredo Cirne (Google) Sanjay Ghemawat (Google) Howard Gobioff (Google) Shun- Tak Leung (Google) 109

Thank you! Shadi Ibrahim shadi.ibrahim@inria.fr 110