MapReduce (in the cloud)



Similar documents
Introduction to Parallel Programming and MapReduce

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data and Apache Hadoop s MapReduce

Introduction to Hadoop

Map Reduce / Hadoop / HDFS

Introduction to Hadoop

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Big Data Processing with Google s MapReduce. Alexandru Costan

ImprovedApproachestoHandleBigdatathroughHadoop

Advanced Data Management Technologies

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

MapReduce is a programming model and an associated implementation for processing

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Data Management in the Cloud MAP/REDUCE. Map/Reduce. Programming model Examples Execution model Criticism Iterative map/reduce

CSE-E5430 Scalable Cloud Computing Lecture 2

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Lecture 10 - Functional programming: Hadoop and MapReduce

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Sriram Krishnan, Ph.D.

Parallel Processing of cluster by Map Reduce

Cloud Computing using MapReduce, Hadoop, Spark

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

MAPREDUCE Programming Model

Chapter 7. Using Hadoop Cluster and MapReduce

Map Reduce & Hadoop Recommended Text:


BBM467 Data Intensive ApplicaAons

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Data Science in the Wild

Big Application Execution on Cloud using Hadoop Distributed File System

Lecture Data Warehouse Systems

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

A programming model in Cloud: MapReduce

Principles of Software Construction: Objects, Design, and Concurrency. Distributed System Design, Part 2. MapReduce. Charlie Garrod Christian Kästner

MapReduce: Simplified Data Processing on Large Clusters

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

The Google File System

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

MapReduce, Hadoop and Amazon AWS

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

GraySort and MinuteSort at Yahoo on Hadoop 0.23

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Cloud Computing at Google. Architecture

Distributed and Cloud Computing

BIG DATA USING HADOOP

Big Data: Big N. V.C Note. December 2, 2014

MapReduce: Simplified Data Processing on Large Clusters

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Optimization and analysis of large scale data sorting algorithm based on Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Hadoop Framework

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Log Mining Based on Hadoop s Map and Reduce Technique

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

16.1 MAPREDUCE. For personal use only, not for distribution. 333

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Open source Google-style large scale data analysis with Hadoop

Comparative analysis of Google File System and Hadoop Distributed File System

Data-Intensive Computing with Map-Reduce and Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop Architecture. Part 1

Improving MapReduce Performance in Heterogeneous Environments

Big Data With Hadoop

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Large scale processing using Hadoop. Ján Vaňo

From GWS to MapReduce: Google s Cloud Technology in the Early Days

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Hadoop implementation of MapReduce computational model. Ján Vaňo

MapReduce and Hadoop Distributed File System

Hadoop and Map-reduce computing

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Getting Started with Hadoop with Amazon s Elastic MapReduce

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

MapReduce and the New Software Stack

Transcription:

MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei

MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo

Motivation: Large Scale Data Processing Process many terabytes of raw data - documents found by a web crawl - web requests log Produce various kinds of derived data - inverted indices - various representations of graph structure of documents - summaries of number of pages crawled per host - most frequent queries in a given day Problem characteristics - Input data is large - Need to parallelize computations (with thousands of CPUs) - Need to distribute the data - How to handle failures?

Solution: MapReduce Programming Model Restricted parallel programming model meant for large clusters User implements Map() and Reduce() Parallel computing framework Libraries take care of EVERYTHING else Parallelization Fault Tolerance Data Distribution Load Balancing Useful model for many practical tasks

Map and Reduce Functions borrowed from functional programming languages (eg. Lisp) Map() Process a key/value pair to generate intermediate key/value pairs Reduce() Merge all intermediate values associated with the same key Types - map (k1, v1) list(k2, v2) - reduce (k2, list(v2)) list(v2)

Example: Counting Words Map() Input <filename, filetext> Parses file and emits <word, count> pairs eg. < hello, 1> Reduce() Sums all values for the same key and emits <word, totalcount> eg. < hello, (3 5 2 7)> => < hello, 17>

Example Use of MapReduce Counting words in a large set of documents ( value map(string key, string //key: document name //value: document contents for each word w in value EmitIntermediate(w, 1 ); ( values reduce(string key, iterator //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result));

How MapReduce works - User s To Do List: 1. Indicate - In/Out-files - M: number of map tasks - R: number of reduce tasks - W: number of machines 2. Write map and reduce functions 3. Submit the job - This requires NO knowledge of parallel/ distributed systems!! - What about everything else?

Implementation: Google Computing Environment = large clusters of commodity PCs connected together with switched Ethernet A typical cluster contains thousands of machines Machines are dual-processor x86's running Linux with 2-4GB memory/machine Commodity networking Typically 100 Mbs or 1 Gbs/sec IDE disks connected to individual machines Distributed file system (GDFS)

Fault Tolerance: Failures Worker fails - master pings workers periodically - if no response after a while, master marks worker as failed a map task for a failed worker is reset to idle state; might be rescheduled Master fails - writes periodic checkpoints of master data structures - if master dies, another could be recreated from the check-pointed copies of its state

Fault Tolerance Locality Network bandwidth is a scarce commodity Input file blocks stored on multiple machines (3 copies) Master attempts to assign worker onto a machine that contains a replica of input data or near replica (i.e., on machine attached to the same network switch) Backup Tasks (coping with stragglers ) - When computation almost done, reschedule inprogress tasks (significantly reduces time for large MapReduce computations)

Refinements Partitioning function typically hash(key) mod R to give balanced partitions eg. scenario: word counting on 10 documents with 5 map and 2 reduce functions Combiner function when there is significant repetition in intermediate keys eg. <the, 1> Ordering within a given partition, all intermediate values processed in order (simplifies creating sorted output)

MapReduce in the cloud - Open Source - Apache Hadoop - http://hadoop.apache.org/ - Cloud MapReduce (built on top of the Amazon cloud OS using cloud services such as S3/SQS/SimpleDB) - http://code.google.com/p/cloudmapreduce/ - Amazon Elastic MapReduce - http://aws.amazon.com/elasticmapreduce/

Conclusions Simplifies large-scale computations that fit this model Allows user to focus on the problem without worrying about details Computer architecture not very important Portable model

References J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters (2004). Hadoop MapReduce Tutorial: http://hadoop.apache.org/comon/docs/current/mapred_tutorial.html J. Mellor-Crummey. Data Analysis with MapReduce: http://www.owlnet.rice.edu/~comp422/lecture-notes/comp422-2010-lecture24- MapReduce.pdf