Overview of MapReduce and Hadoop

Similar documents
Cloud Computing using MapReduce, Hadoop, Spark

Introduction to MapReduce and Hadoop

Jeffrey D. Ullman slides. MapReduce for data intensive computing

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

BBM467 Data Intensive ApplicaAons

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

Hadoop and Map-Reduce. Swati Gore

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Big Data and Apache Hadoop s MapReduce

Cloud Computing using MapReduce, Hadoop, Spark

CS54100: Database Systems

Big Data Processing with Google s MapReduce. Alexandru Costan

Hadoop Architecture. Part 1

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data With Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

MapReduce (in the cloud)

A very short Intro to Hadoop

Open source Google-style large scale data analysis with Hadoop

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Data-Intensive Computing with Map-Reduce and Hadoop

Map Reduce / Hadoop / HDFS

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Chapter 7. Using Hadoop Cluster and MapReduce

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Parallel Processing of cluster by Map Reduce


Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

How To Use Hadoop

Map Reduce & Hadoop Recommended Text:

Data Deluge. Billions of users connected through the Internet. Storage getting cheaper

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Hadoop IST 734 SS CHUNG

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Big Data - What is it, How to use it?

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Alternatives to HIVE SQL in Hadoop File Structure

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop & its Usage at Facebook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Click Stream Data Analysis Using Hadoop

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduction to Parallel Programming and MapReduce

MapReduce Job Processing

BIG DATA TECHNOLOGY. Hadoop Ecosystem

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Introduction to Hadoop

Hadoop & its Usage at Facebook

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Apache Hadoop FileSystem Internals

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop Distributed File System (HDFS) Overview

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Cloud Computing at Google. Architecture

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Using distributed technologies to analyze Big Data

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Introduction to Hadoop

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

MapReduce and Hadoop Distributed File System

Apache Hadoop FileSystem and its Usage in Facebook

MapReduce with Apache Hadoop Analysing Big Data

Internals of Hadoop Application Framework and Distributed File System

Big Data Technology Core Hadoop: HDFS-YARN Internals

Intro to Map/Reduce a.k.a. Hadoop

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

HiBench Introduction. Carson Wang Software & Services Group

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Networking in the Hadoop Cluster

Introduction to Hadoop

Design and Evolution of the Apache Hadoop File System(HDFS)

This article is the second

CS 564: DATABASE MANAGEMENT SYSTEMS

Transcription:

Overview of MapReduce and Hadoop 1

Why Parallelism? Data size is increasing Single node architecture is reaching its limit Scan 100 TB on 1 node @ 100 MB/s = 12 days Standard/Commodity and affordable architecture emerging Cluster of commodity Linux nodes Gigabit ethernet interconnect 2

Design Goals for Parallelism 1. Scalability to large data volumes: Scan 100 TB on 1 node @ 100 MB/s = 12 days Scan on 1000-node cluster = 16 minutes! 2. Cost-efficiency: Commodity nodes (cheap, but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers) 3

What is MapReduce? Data-parallel programming model for clusters of commodity machines Pioneered by Google Processes 20 PB of data per day Popularized by open-source Hadoop project Used by Yahoo!, Facebook, Amazon, 4

What is MapReduce used for? At Google: Index building for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Index building for Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection In research: Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Bioinformatics (Maryland) Astronomical image analysis (Washington) Ocean climate simulation (Washington) Graph OLAP (NUS) <Your application> 5

Challenges Cheap nodes fail, especially if you have many Commodity network = low bandwidth Programming distributed systems is hard 6

Typical Hadoop Cluster Aggregation switch Rack switch 40 nodes/rack, 1000-4000 nodes in cluster 1 GBps bandwidth in rack, 8 GBps out of rack Node specs (Yahoo terasort): 8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?) In 2011 it was guestimated that Google had 1M machines, http://bit.ly/shh0ro 7

Hadoop Components Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance MapReduce implementation Executes user jobs specified as map and reduce functions Manages work distribution & fault-tolerance 8

Hadoop Distributed File System Files are BIG (100s of GB TB) Typical usage patterns Append-only Data are rarely updated in place Reads common Optimized for large files, sequential (why??) reads Files split into 64-128MB blocks (called chunks) Blocks replicated (usually 3 times) across several datanodes (called chuck or slave nodes) Chunk nodes are compute nodes too 1 2 4 Namenode 2 1 3 1 4 3 Datanodes File1 1 2 3 4 3 2 4 9

Hadoop Distributed File System Single namenode (master node) stores metadata (file names, block locations, etc) May be replicated also Client library for file access Talks to master to find chunk servers Connects directly to chunk servers to access data Master node is not a bottleneck Computation is done at chuck node (close to data) 1 2 4 Namenode 2 1 3 1 4 3 File1 1 2 3 4 3 2 4 Datanodes 10

MapReduce Programming Model Data type: key-value records File a bag of (key, value) records Map function: (K in, V in ) list(k inter, V inter ) Takes a key-value pair and outputs a set of key-value pairs, e.g., key is the line number, value is a single line in the file There is one Map call for every (k in, v in ) pair Reduce function: (K inter, list(v inter )) list(k out, V out ) All values V inter with same key K inter are reduced together and processed in V inter order There is one Reduce function call per unique key K inter 11

Example: Word Count Execution Input Map Sort & Shuffle Reduce Output the quick brown fox the fox ate the mouse Map Map the, 1 fox, 1 the, 1 the, 1 brown, 1 fox, 1 quick, 1 ate, 1 mouse, 1 brown, 1 Reduce brown, 1 brown, 2 fox, 2 how, 1 now, 1 the, 3 how now brown cow Map how, 1 now, 1 brown, 1 cow, 1 quick, 1 Reduce ate, 1 cow, 1 mouse, 1 quick, 1 12

Example: Word Count def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values)) 13

Can Word Count algorithm be improved? Some questions to consider How many key-value pairs are emitted? What if the same word appear in the same line/document? What is the overhead? How could you reduce that number? 14

An Optimization: The Combiner A combiner is a local aggregation function for repeated keys produced by the same map For associative ops. like sum, count, max Decreases size of intermediate data Example: local counting for Word Count: def combiner(key, values): output(key, sum(values)) NOTE: For Word Count, this turns out to be exactly what the Reducer does! 15

Word Count with Combiner Input Map & Combine Shuffle & Sort Reduce Output the quick brown fox Map the, 2 fox, 1 the, 1 brown, 1 fox, 1 Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 the fox ate the mouse Map quick, 1 how now brown cow Map how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Reduce ate, 1 cow, 1 mouse, 1 quick, 1 16

Care in using Combiner Usually same as reducer If operation is associative and commutative, e.g., sum How about average? How about median? 17

The Complete WordCount Program Map function Reduce function Invoking Map, Combiner & Reduce Run this program as a MapReduce job 18

Example 2: Search Input: (linenumber, line) records Output: lines matching a given pattern Map: if(line matches pattern): output(line) Reduce: identity function Alternative? 19

Example 3: Inverted Index Source documents hamlet.txt to be or not to be 12th.txt be not afraid of greatness to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt be, 12th.txt not, 12th.txt afraid, 12th.txt of, 12th.txt greatness, 12th.txt Inverted index afraid, (12th.txt) be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) or, (hamlet.txt) to, (hamlet.txt) 20

Example 3: Inverted Index Input: (filename, text) records Output: list of files containing each word Map: foreach word in text.split(): output(word, filename) Combine: merge filenames for each word Reduce: def reduce(word, filenames): output(word, sort(filenames)) 21

Example 4: Most Popular Words Input: (filename, text) records Output: the 100 words occurring in most files Two-stage solution: Job 1: Create inverted index, giving (word, list(file)) records Job 2: Map each (word, list(file)) to (count, word) Sort these records by count as in sort job 22

Note #mappers, #reducers, #physical nodes may not be equal For different problems, the Map and Reduce functions are different, but the workflow is the same Read a lot of data Map Extract something you care about Sort and Shuffle Reduce Aggregate, summarize, filter or transform Write the result 23

Refinement: Partition Function Want to control how keys get partitioned Inputs to map tasks are created by contiguous splits of input file Reduce needs to ensure that records with the same intermediate key end up at the same worker System uses a default partition function: hash(key) mod R Sometimes useful to override the hash function: E.g., hash(hostname(url)) mod R ensures URLs from a host end up in the same output file; E.g., How about sorting? 24

Example 5: Sort Input: (key, value) records Output: same records, sorted by key Map: identity function. Why? Reduce: identity function. Why? Trick: Pick partitioning function h such that k 1 <k 2 h(k 1 )<h(k 2 ) Map Map Map zebra cow pig aardvark, elephant ant, bee sheep, yak Reduce aardvark ant bee cow elephant Reduce pig sheep yak zebra [A-M] [N-Z] 25

Job Configuration Parameters 190+ parameters in Hadoop Set manually or defaults are used 26

Not all tasks fit the MapReduce model Consider a data set consisting of n observations and k variables e.g., k different stock symbols or indices (say k=10,000) and n observations representing stock price signals (up / down) measured at n different times. Problem Find very high correlations (ideally with time lags to be able to make a profit) - e.g. if Google is up today, Facebook is likely to be up tomorrow. Need to compute k * (k-1) /2 correlations to solve this problem Cannot simply split the 10,000 stock symbols into 1,000 clusters, each containing 10 stock symbols, then use MapReduce The vast majority of the correlations to compute will involve a stock symbol in one cluster, and another one in another cluster These cross-clusters computations makes MapReduce useless in this case The same issue arises if you replace the word "correlation" by any other function, say f, computed on two variables, rather than one NOTE: Make sure you pick the right tool! 27

High Level Languages on top of Hadoop MapReduce is great, as many algorithms can be expressed by a series of MR jobs But it s low-level: must think about keys, values, partitioning, etc Can we capture common job patterns? 28

Pig Started at Yahoo! Research Now runs about 30% of Yahoo! s jobs Features: Expresses sequences of MapReduce jobs Data model: nested bags of items Provides relational (SQL) operators (JOIN, GROUP BY, etc) Easy to plug in Java functions Pig Pen dev. env. for Eclipse Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18-25. Load Users Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Load Pages 29 Example from http://wiki.apache.org/pig-data/attachments/pigtalkspapers/attachments/apacheconeurope09.ppt

Ease of Translation Notice how naturally the components of the job translate into Pig Latin. Load Users Load Pages Filter by age Join on name Job 1 Group on url Job 2 Count clicks Order by clicks Job 3 Take top 5 Users = load users as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into top5sites ; 30

Hive Developed at Facebook Used for majority of Facebook jobs Relational database built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning, clustering, complex data types, some optimizations 31

Creating a Hive Table CREATE TABLE page_views(viewtime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) // break the table into separate files STORED AS SEQUENCEFILE; Sample Query Find all page views coming from xyz.com on March 31 st : SELECT page_views.* FROM page_views WHERE page_views.date >= '2008 03 01' AND page_views.date <= '2008 03 31' AND page_views.referrer_url like '%xyz.com'; Hive only reads partition 2008 03 01,* instead of scanning entire table 32

Announcement Project teams Submit list of members of your team ASAP Mail to tankl@comp.nus.edu.sg 33

MapReduce Environment MapReduce environment takes care of: Partitioning the input data Scheduling the program s execution across a set of machines Perform the group-by key step in the Sort and Shuffle phase Handling machine failures Managing required inter-machine communication 34

MapReduce Execution Details Single master controls job execution on multiple slaves as well as user scheduling Mappers preferentially placed on same node or same rack as their input block Push computation to data, minimize network use Mappers save outputs to local disk rather than pushing directly to reducers Allows recovery if a reducer crashes Output of reducers are stored in HDFS May be replicated 35

Coordinator: Master Node Master node takes care of coordination: Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its (R) intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures 36

User Program (1) submit (5) map output location Master (6) reducer input location (2) schedule map (2) schedule reduce worker split 0 split 1 split 2 split 3 split 4 (3) read worker (4) local write (7) remote read worker worker (8) write output file 0 output file 1 worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files 37

MapReduce Single reduce task 38

MapReduce Multiple reduce tasks 39

MapReduce No reduce tasks 40

MapReduce: Shuffle and sort 41

Two additional notes Barrier between map and reduce phases Reduce cannot start processing until all maps have completed But we can begin copying intermediate data as soon as a mapper completes Keys arrive at each reducer in sorted order No enforced ordering across reducers 42

How many Map and Reduce tasks? M map tasks, R reduce tasks Rule of a thumb: Make M much larger than the number of nodes in the cluster One DFS chunk per map is common Improves dynamic load balancing and speeds up recovery from worker failures Usually R is smaller than M Because output is spread across R files 43

Fault Tolerance in MapReduce 1. If a task crashes: Retry on another node Okay for map because it has no dependencies Okay for reduce because map outputs are on disk If the same task repeatedly fails, fail the job or ignore that input block (user-controlled) Note: For this and the other fault tolerance features to work, your map and reduce tasks must be side-effect-free 44

Fault Tolerance in MapReduce 2. If a node crashes: Relaunch its current tasks on other nodes Relaunch any maps the node previously ran Why? Necessary because their output files were lost along with the crashed node 45

Fault Tolerance in MapReduce 3. If a task is running slowly (straggler): Launch second copy of task on another node Take the output of whichever copy finishes first, and kill the other one Critical for performance in large clusters: stragglers occur frequently due to failing hardware, bugs, misconfiguration, etc 46

Takeaways By providing a data-parallel programming model, MapReduce can control job execution in useful ways: Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers User focuses on application, not on complexities of distributed computing 47

Conclusions MapReduce s data-parallel programming model hides complexity of distribution and fault tolerance Principal philosophies: Make it scale, so you can throw hardware at problems Make it cheap, saving hardware, programmer and administration costs (but requiring fault tolerance) Hive and Pig further simplify programming MapReduce is not suitable for all problems, but when it works, it may save you a lot of time 48

Resources Hadoop: http://hadoop.apache.org/core/ Hadoop docs: http://hadoop.apache.org/core/docs/current/ Pig: http://hadoop.apache.org/pig Hive: http://hadoop.apache.org/hive Hadoop video tutorials from Cloudera: http://www.cloudera.com/hadoop-training 49

Acknowledgements This lecture is adapted from slides from multiple sources, including those from RAD Lab, Berkeley Stanford Duke Maryland 50