MapReduce Détails Optimisation de la phase Reduce avec le Combiner

Similar documents
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

MapReduce Job Processing

Hadoop Architecture. Part 1

Assignment 2: More MapReduce with Hadoop

Data-intensive computing systems

Cloud Computing. Chapter Hadoop

How To Use Hadoop

University of Maryland. Tuesday, February 2, 2010

Chapter 7. Using Hadoop Cluster and MapReduce

Extreme Computing. Hadoop MapReduce in more detail.

A very short Intro to Hadoop

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Getting to know Apache Hadoop

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Parallel Processing of cluster by Map Reduce

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Open source Google-style large scale data analysis with Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

MapReduce. Tushar B. Kute,

Hadoop Parallel Data Processing

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Internals of Hadoop Application Framework and Distributed File System


Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

The Hadoop Framework

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

MapReduce for Statistical NLP/Machine Learning

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Apache Hadoop. Alexandru Costan

HDFS: Hadoop Distributed File System

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

MapReduce, Hadoop and Amazon AWS

H2O on Hadoop. September 30,

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

ATLAS Tier 3

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Big Data With Hadoop

Introduction to Hadoop

Hadoop: challenge accepted!

Apache Hadoop new way for the company to store and analyze big data

Qsoft Inc

CS 378 Big Data Programming

Intro to Map/Reduce a.k.a. Hadoop

HDFS. Hadoop Distributed File System

HadoopRDF : A Scalable RDF Data Analysis System

Map Reduce & Hadoop Recommended Text:

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Big Data and Scripting map/reduce in Hadoop

International Journal of Advance Research in Computer Science and Management Studies

HADOOP MOCK TEST HADOOP MOCK TEST II

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Hadoop Streaming. Table of contents

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop: The Definitive Guide

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Performance and Energy Efficiency of. Hadoop deployment models

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop/MapReduce Workshop

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

A. Aiken & K. Olukotun PA3

Hadoop. Bioinformatics Big Data

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

map/reduce connected components

Introduction to Hadoop

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Introduc)on to Map- Reduce. Vincent Leroy

Hadoop Setup. 1 Cluster

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

High Performance Computing MapReduce & Hadoop. 17th Apr 2014

COURSE CONTENT Big Data and Hadoop Training

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Transcription:

MapReduce Détails Optimisation de la phase Reduce avec le Combiner S'il est présent, le framework insère le Combiner dans la pipeline de traitement sur les noeuds qui viennent de terminer la phase Map. Le Combiner est exécuté après la phase Map, mais avant que les données intermédiaires sont envoyées vers d'autres noeuds. Le Combiner reçoit les données produites par la phase Map sur un noeud. Il reçoit seulement les données locales, pas celles des autres noeuds. Il produit des paires clé-valeur qui seront envoyées vers les Reducers. Le Combiner peut être utilisé dans les cas où on peut déjà commencer le Reduce sans avoir toutes les données. P. ex. le calcul de température maximale s'y prête très bien. Le Combiner calcule la température maximale pour les données disponibles sur le noeud local. Au lieu d'envoyer les paires (1949, 111) et (1949, 78) vers les Reducers on envoie seulement la paire (1949, 111). 41 Distributed file system HDFS HDFS design decisions: Files stored as chunks Fixed size (64MB) Reliability through replication Application HDFS client Each chunk replicated across 3+ nodes Single master to coordinate access, keep metadata Simple centralized management No data caching Little benefit due to large datasets, streaming reads Simplify the API Push some of the issues onto the client (e.g., data layout) HDFS namenode File namespace /foo/bar block 3d2f HDFS datanode Linux file system HDFS datanode Linux file system 42

Namenode responsibilities Managing the file system namespace: Holds file/directory structure, metadata, file-toblock mapping, access permissions, etc. Coordinating file operations: Directs clients to datanodes for reads and writes No data is moved through the namenode Maintaining overall health: Periodic communication with the datanodes Block re-replication and rebalancing Garbage collection 43 Putting everything together Per cluster: One Namenode (NN): master node for HDFS One Jobtracker (JT): master node for job submission Per slave machine: One Tasktracker (TT): contains multiple task slots One Datanode (DN): serves HDFS data blocks master node jobtracker namenode slave node datanode Web UI at http://hostname:50030/ Web UI at http://hostname:50070/ slave node datanode Server MapReduce HDFS slave node tasktracker tasktracker tasktracker datanode 44

Important counters Phase Measure Counter name Map Shuffle and sort Reduce Number of input records consumed by all mappers Number of key/value pairs produced by all mappers The number of bytes of map output copied by the shuffle to reducers (may be compressed) Number of unique keys fed into the reducers Number of key/value pairs produced by all reducers Map input records Map output records Reduce shuffle bytes Reduce input groups Reduce output records 45 Hadoop Streaming Writing MapReduce in scripting languages (Python, Ruby, ) To write Map and Reduce functions in other languages than Java there is the Hadoop Streaming API. Uses Unix standard streams as interface between Hadoop and your program. Your program reads data from standard input and writes data to standard output. All data is in text format: Mapper The original input data needs to be a text file The key-value pairs Receives the file to be processed as lines of text. Writes the output key-value pairs as lines of text. One pair on one line, key and value separated by tab character. Reducer Receives the input key-value pairs as lines of text. One line contains one key and one value, separated by tabs. If a pair has multiple values, the key is repeated on several lines. Writes output key-value paris as lines of text. 46

Hadoop Streaming Example Mapper in Python for maximum temperature #!/usr/bin/env python # # max_temperature_map.py - Calculate maximum temperature from NCDC Global # Hourly Data - Mapper part import re # import regular expressions import sys # import system-specific parameters and functions # loop through the input, line by line for line in sys.stdin: # remove leading and trailing whitespace val = line.strip() # extract values for year, temperature and quality indicator (year, temp, q) = (val[15:19], val[87:92], val[92:93]) # temperature is valid if not +9999 and quality indicator is # one of 0, 1, 4, 5 or 9 if (temp!= "+9999" and re.match("[01459]", q)): print "%s\t%s" % (year, temp) 47 Hadoop Streaming Example Reducer in Python for maximum temperature #!/usr/bin/env python # # max_temperature_reduce.py - Calculate maximum temperature from NCDC Global # Hourly Data - Reducer part import sys (last_key, max_val) = (None, -sys.maxint) # loop through the input, line by line for line in sys.stdin: # each line contains a key and a value separated by a tab character (key, val) = line.strip().split("\t") # Hadoop has sorted the input by key, so we get the values # for the same key immediately one after the other. # Test if we just got a new key, in this case output the maximum # temperature for the previous key and reinitialize the variables. # If not, keep calculating the maximum temperature. if last_key and last_key!= key: print "%s\t%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val))) # we've reached the end of the file, output what is left if last_key: print "%s\t%s" % (last_key, max_val) 48

Python Essential concepts A Python script always starts with the line #!/usr/bin/env python No ; at the end of statements Variables do not need to be declared before they are used: temperature = 21.4 Variables have types (int, float, bool, string, ), but the type of a variable does not need to be declared, it is automatically derived from the value: my_int = 12 my_float = 21.4 my_string = "Hello" A variable can also not have a value by using the built-in constant None my_string = None Control structures (if, while, for, ) use indentation instead of braces { } or keywords (do done) to group statements: if temperature > 27.5: print "It is getting too hot." print "Get a drink." elif temperature < 2.5: print "It is getting too cold." else: print "Temperature OK." words = [ 'how', 'are', 'you' ] for w in words: print w, len(w) Assignments can be done in tuples: (d, e, f) = (a, b/2, c+3) 49 Python Essential concepts String operations Split standard input into lines for line in sys.stdin: print line Remove whitespace at beginning and end of string: stripped = line.strip() Split a string into fields based on a delimiter character: (key, val) = str.split("\t") Extract substring: substr = str[12:18] Match regular expression: if re.match("abc", input): print "Found abc in input" Formatted output Similar to C, format string followed by values print "%s\t%s" % (string1, string2) 50

Python Essential concepts Python tutorial: https://docs.python.org/2/tutorial/index.html Python documentation: https://docs.python.org/2.7/ 51