The MapReduce Framework



Similar documents
MapReduce (in the cloud)

Map Reduce / Hadoop / HDFS

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Introduction to Hadoop

Sriram Krishnan, Ph.D.

Lecture 10 - Functional programming: Hadoop and MapReduce

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Big Data and Apache Hadoop s MapReduce

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Introduction to Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to Hadoop

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Advanced Data Management Technologies

Map Reduce & Hadoop Recommended Text:

Introduction to Parallel Programming and MapReduce

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Parallel Processing of cluster by Map Reduce

Hadoop Parallel Data Processing

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop and Map-reduce computing

Chapter 7. Using Hadoop Cluster and MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 2

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Hadoop and Map-Reduce. Swati Gore

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

ImprovedApproachestoHandleBigdatathroughHadoop

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Data Science in the Wild

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

HDFS. Hadoop Distributed File System

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Large scale processing using Hadoop. Ján Vaňo

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

A programming model in Cloud: MapReduce

The Hadoop Framework

Open source Google-style large scale data analysis with Hadoop

A Brief Outline on Bigdata Hadoop

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Hadoop implementation of MapReduce computational model. Ján Vaňo

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Developing a MapReduce Application

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

MapReduce and Hadoop Distributed File System V I J A Y R A O

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

How To Use Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduction to MapReduce and Hadoop

MapReduce, Hadoop and Amazon AWS

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

CS 378 Big Data Programming. Lecture 2 Map- Reduce

MapReduce Systems. Outline. Computer Speedup. Sara Bouchenak

MAPREDUCE Programming Model

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MapReduce and the New Software Stack

Distributed Text Mining with tm

Apache Hadoop. Alexandru Costan

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop. Bioinformatics Big Data

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Cloud Computing. Chapter Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

MapReduce. Tushar B. Kute,

Lecture Data Warehouse Systems

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

CS 378 Big Data Programming

Cloudera Certified Developer for Apache Hadoop

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Introduction to Cloud Computing

Distributed Data Analysis with Hadoop and R Jonathan Seidman and Ramesh Venkataramaiah, Ph. D.

Distributed Filesystems

BBM467 Data Intensive ApplicaAons

MapReduce with Apache Hadoop Analysing Big Data

Click Stream Data Analysis Using Hadoop

Hadoop IST 734 SS CHUNG

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Open source large scale distributed data management with Google s MapReduce and Bigtable

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Transcription:

The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16

Background Google, Yahoo, etc. deal with very large amounts of data (many terabytes) need to process data fairly quickly (within a day, e.g.) use very large numbers of commodity machines (thousands) A cluster at Yahoo. Google developed an infrastructure consisting of the Google distributed file system GFS the MapReduce computational model Other implementations include Hadoop from Apache. Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 2 / 16

Requirements and Constraints Want to run on 1,000 10,000 nodes. With that many nodes some will fail some will go down for maintenance Fault tolerance is essential. Want to work on petabytes of data Data will need to be distributed across many disks. Data access speeds will depend on location: local disk will be fastest same rack may be faster than different rack Replication is needed for performance and fault tolerance. Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 3 / 16

Requirements and Constraints Want an infrastructure that takes care of management tasks distribution of data management of fault tolerance collecting results For a specific problem user writes a few routines routines plug into the general interface Goal: identify a class of computations that is general enough to cover many problems structured enough to allow development of an infrastructure reasonably easy to tailor to specific problems MapReduce seems to fit this goal reasonably well. Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 4 / 16

Google MapReduce Related to two concepts from functional programming: Mapping: applying a function to each element of a structure and returning a comparable structure of results. R/S use the term apply. Reducing or folding: Applying a binary operation, usually associative, often commutative, to an initial element and every successive element of a structure to produce a single reduced result, e.g. a sum. R 2.6.0 has recently introduced some functional programming primitives, including Map and Reduce. The names come from the Lisp world. A useful running example: Counting word frequencies in a collection of documents. Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 5 / 16

Google MapReduce MapReduce operations work with key/value pairs, e.g. document name/document content word/count A general MapReduce computation has several components: Input reader: reads input files and divides into chunks for the map function map function: receives key/value pair and emits 0 or more key value pairs. Partition function: allocates output of maps to particular reduce functions. Comparison function: used in sorting map output by keys. reduce function: takes a key and a collection of values and produces a key/value pair. Output writer: writes results to storage. All components can be customized. Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 6 / 16

Google MapReduce Often only map and reduce need to be written. For simple word counting, map might look like map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); The key is ignored. The reduce function might look like reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 7 / 16

Google MapReduce Google s MapReduce is implemented as a C++ library. Operates on commodity hardware and standard networking. Input data, intermediate results, and final results are stored in GFS. A master scheduler process distributes map, reduce tasks to workers. Fault tolerance: The master pings workers periodically. Workers that do not respond are marked as failed. Jobs assigned to failed workers are rerun. Master failure aborts the computation. Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 8 / 16

Apache Hadoop Hadoop is part of the Apache Lucene project for open-source search software. Hadoop is used heavily by Yahoo, among others. There is support for running Hadoop jobs on Amazon EC2/Amazon S3. Hadoop includes a distributed file system, HDFS. a MapReduce framework. a web monitoring interface. Hadoop is written in Java and can be extended in Java. A mechanism for extension via C/C++ is also available. A streaming interface using standard I/O can also be used. The streaming interface is the easiest way to use Python or R. Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 9 / 16

Apache Hadoop Word Count Example A Python example from the Wiki is easily adapted to R. I set up a simple test framework on my workstation. Eventually we may wish to add this to beowulf. The streaming interface uses batches of lines from text files as inputs. It requires mapper and reducer executables or scripts. The mapper produces lines of the form key<tab>value for the reducer. Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 10 / 16

Apache Hadoop Word Count Example R script mapper.r to read lines from standard input and print word<tab>1 for each word to standard output: #! /usr/bin/env Rscript trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) splitintowords <- function(line) unlist(strsplit(line, "[[:space:]]+")) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) words <- splitintowords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con) R script reducerer.r to read word/count pairs and emit word/sum pairs is a little longer. Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 11 / 16

Apache Hadoop Word Count Example Data are files from project Gutenberg. Steps to running the example: Start up hadoop. Copy data to HDFS. Run MapReduce. Copy results back from HDFS. Shut down hadoop Starting upcodehadoop: setenv HADOOP_INSTALL /home/luke/hadoop/hadoop $HADOOP_INSTALL/bin/start-all.sh Copying data to HDFS: $HADOOP_INSTALL/bin/hadoop dfs -copyfromlocal gutenberg gutenberg Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 12 / 16

Apache Hadoop Word Count Example Running the MapReduce: $HADOOP_INSTALL/bin/hadoop jar \ $HADOOP_INSTALL/contrib/hadoop-streaming.jar \ -mapper /home/luke/hadoop/mapper.r \ -reducer /home/luke/hadoop/reducer.r \ -input gutenberg/* -output gutenberg-output Looking at the results: $HADOOP_INSTALL/bin/hadoop dfs -cat \ gutenberg-output/part-00000 more... Abaft 1 abandon 7 abandoned 7 abandoned, 2... Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 13 / 16

Apache Hadoop Netflix Review Count Example With minor modifications we can count the number of movies reviewed by each customer in the Netflix data. It is useful to change the movie files from 17767: 1428688,3,2005-08-09 656399,3,2005-08-19 1356914,4,2005-05-27 1526449,4,2005-10-20... to 17767,1428688,3,2005-08-09 17767,656399,3,2005-08-19 17767,1356914,4,2005-05-27 17767,1526449,4,2005-10-20... Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 14 / 16

con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { vals <- unlist(strsplit(line, ",")) cat(vals[2], "\t", 1, "\n", sep="") } close(con) The reducer remains the same. The results for 3 movie files: $HADOOP_INSTALL/bin/hadoop dfs -cat netflix-output/part-00000 more... 1001833 1 1001928 2... 1664010 3 1664458 1... Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 15 / 16 Apache Hadoop Netflix Review Count Example The mapper script nmapper.r is #! /usr/bin/env Rscript

Discussion Many statistical computations can be expressed via MapReduce: simple summaries least squares regression k-means clustering logistic regression (needs a sequence of MapReduce operations) Languages for managing MapReduce computations are in development: Apache PIG project Google Sawzall Some extensions are also under consideration map-reduce-merge A number of frameworks supporting MapReduce are in development. Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 16 / 16