Hadoop MapReduce Felipe Meneses Besson

Similar documents
Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

CSE-E5430 Scalable Cloud Computing Lecture 2

The MapReduce Framework

Introduction to Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Introduction to MapReduce and Hadoop

Introduction to Cloud Computing

Chapter 7. Using Hadoop Cluster and MapReduce

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Hadoop

Hadoop Architecture. Part 1


marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

MapReduce (in the cloud)

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Cloud Computing: MapReduce and Hadoop

Open source Google-style large scale data analysis with Hadoop

Lecture 10 - Functional programming: Hadoop and MapReduce

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

How To Use Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data Analytics* Outline. Issues. Big Data

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Big Data With Hadoop

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Map Reduce & Hadoop Recommended Text:

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Hadoop Parallel Data Processing

Big Data and Apache Hadoop s MapReduce

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

MapReduce, Hadoop and Amazon AWS

Parallel Processing of cluster by Map Reduce

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

MapReduce. Tushar B. Kute,

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Map Reduce / Hadoop / HDFS

MapReduce and Hadoop Distributed File System

Hadoop implementation of MapReduce computational model. Ján Vaňo

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Big Data Management and NoSQL Databases

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Application Development. A Paradigm Shift

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Sriram Krishnan, Ph.D.

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Fault Tolerance in Hadoop for Work Migration

ImprovedApproachestoHandleBigdatathroughHadoop

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Hadoop Distributed File System (HDFS) Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

L1: Introduction to Hadoop

Large scale processing using Hadoop. Ján Vaňo

Hadoop IST 734 SS CHUNG

Open source large scale distributed data management with Google s MapReduce and Bigtable

Data-Intensive Computing with Map-Reduce and Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST II

MapReduce Systems. Outline. Computer Speedup. Sara Bouchenak

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Introduction to Parallel Programming and MapReduce

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

A Performance Analysis of Distributed Indexing using Terrier

CS54100: Database Systems

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Keywords: Big Data, HDFS, Map Reduce, Hadoop

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

The Hadoop Framework

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduction to Cloud Computing

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

MapReduce Job Processing

Hadoop and No SQL Nagaraj Kulkarni

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Transcription:

Hadoop MapReduce Felipe Meneses Besson IME-USP, Brazil July 7, 2010

Agenda What is Hadoop? Hadoop Subprojects MapReduce HDFS Development and tools 2

What is Hadoop? A framework for large-scale data processing (Tom White, 2009): Project of Apache Software Foundation Most written in Java Inspired in Google MapReduce and GFS (Google File System) 3

A brief history 2004: Google published a paper that introduced MapReduce and GFS as a alternative to handle the volume of data to be processed 2005: Doug Cutting integrated MapReduce in the Hadoop 2006: Doug Cutting joins Yahoo! 2008: Cloudera¹ was founded 2009: Hadoop cluster sort 100 terabyte in 173 minutes (on 3400 nodes)² Nowadays, Cloudera company is an active contributor to the Hadoop project and provide Hadoop consulting and commercial products. [1]Cloudera: http://www.cloudera.com [2] Sort Benchmark: http://sortbenchmark.org/ 4

Hadoop Characteristics A scalable and reliable system for shared storage and analyses. It automatically handles data replication and node failure It does the hard work developer can focus on processing data logic Enable applications to work of petabytes of data in parallel 5

Who's using Hadoop Source: Hadoop wiki, September 2009 6

Hadoop Subprojects Apache Hadoop is a collection of related subprojects that fall under the umbrella of infrastructure for distributed computing. All projects are hosted by the Apache Software Foundation. 7

MapReduce MapReduce is a programming model and an associated implementation for processing and generating large data sets (Jeffrey Dean and Sanjay Ghemawat, 2004) Based on a functional programming model A batch data processing system A clean abstraction for programmers Automatic parallelization & distribution Fault-tolerance 8

MapReduce Programming model Users implement the interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list 9

MapReduce Map Function Input: Records from some data source (e.g., lines of files, rows of a databases, ) are associated in the (key, value) pair Example: (filename, content) Output: One or more intermediate values in the (key, value) format Example: (word, number_of_occurrences) 10

MapReduce Map Function map (in_key, in_value) (out_key, intermediate_value) list Source: (Cloudera, 2010) 11

MapReduce Map Function Example: map (k, v): if (isprime(v)) then emit (k, v) ( foo, 7) ( foo, 7) ( test, 10) (nothing) 12

MapReduce Reduce function After map phase is over, all the intermediate values for a given output key are combined together into a list Input: Intermediate values Example: ( A, [42, 100, 312]) Output: usually only one final value per key Example: ( A, 454) 13

MapReduce Reduce Function reduce (out_key, intermediate_value list) out_value list Source: (Cloudera, 2010) 14

MapReduce Reduce Function Example: reduce (k, vals): sum = 0 foreach int v in vals: sum += v emit (k, sum) ( A, [42, 100, 312]) ( A, 454) ( B, [12, 6, -2]) ( B, 16) 15

MapReduce Terminology Job: unit of work that the client wants to be performed Input data + MapReduce program + configuration information Task: part of the job map and reduce tasks Jobtracker: node that coordinates all the jobs in the system by scheduling tasks to run on tasktrackers 16

MapReduce Terminology Tasktracker: nodes that run tasks and send progress reports to the jobtracker Split: fixed-size piece of the input data 17

MapReduce DataFlow Source: (Cloudera, 2010) 18

MapReduce Real Example map (String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); 19

MapReduce Real Example reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); 20

MapReduce Combiner function Compress the intermediate values Run locally on mapper nodes after map phase It is like a mini-reduce Used to save bandwidth before sending data to the reducer 21

MapReduce Combiner Function Applied in a mapper machine Source: (Cloudera, 2010) 22

HDFS Hadoop Distributed Filesystem Inspired on GFS Designed to work with very large files Run on commodity hardware Streaming data access Replication and locality 23

HDFS Nodes A Namenode (the master) Manages the filesystem namespace Knows all the blocks location Datanodes (workers) Keep blocks of data Report back to namenode its lists of blocks periodically 24

HDFS Duplication Input data is copied into HDFS is split into blocks Each data blocks is replicated to multiple machines 25

HDFS MapReduce Data flow Source: (Tom White, 2009) 26

Hadoop filesystems 27 Source: (Tom White, 2009)

Development and Tools Hadoop operation modes Hadoop supports three modes of operation: Standalone Pseudo-distributed Fully-distributed More details: http://oreilly.com/other-programming/excerpts/hadooptdg/installing-apache-hadoop.html 28

Development and Tools Java example 29

Development and Tools Java example 30

Development and Tools Java example 31

Development and Tools Guidelines to get started The basic steps for running a Hadoop job are: Compile your job into a JAR file Copy input data into HDFS Execute hadoop passing the jar and relevant args Monitor tasks via Web interface (optional) Examine output when job is complete 32

Development and Tools Api, tools and training Do you want to use a scripting language? http://wiki.apache.org/hadoop/hadoopstreaming http://hadoop.apache.org/core/docs/current/streaming.html Eclipse plugin for MapReduce development http://wiki.apache.org/hadoop/eclipseplugin Hadoop training (videos, exercises, ) http://www.cloudera.com/developers/learn-hadoop/training/ 33

Bibliography Hadoop The definitive guide Tom White (2009). Hadoop The Definitive Guide. O'Reilly, San Francisco, 1st Edition Google Article Jeffrey Dean and Sanjay Ghemawat (2004). MapReduce: Simplified Data Processing on Large Clusters. Available on: http://labs.google.com/papers/mapreduce-osdi04.pdf Hadoop In 45 Minutes or Less Tom Wheeler. Large-Scale Data Processing for Everyone. Available on: http://www.tomwheeler.com/publications/2009/lambda_lounge_hadoop_200910/twheelerhadoop-20091001-handouts.pdf Cloudera Videos and Training http://www.cloudera.com/resources/?type=training 34