MapReduce. Tushar B. Kute, http://tusharkute.com

Similar documents
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

How To Use Hadoop

Apache Hadoop new way for the company to store and analyze big data

Introduction to MapReduce and Hadoop

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

How To Install Hadoop From Apa Hadoop To (Hadoop)

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

MapReduce, Hadoop and Amazon AWS

A. Aiken & K. Olukotun PA3

Chapter 7. Using Hadoop Cluster and MapReduce

Data Science Analytics & Research Centre

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

HADOOP MOCK TEST HADOOP MOCK TEST II

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Apache Hadoop. Alexandru Costan

Data-Intensive Computing with Map-Reduce and Hadoop

Parallel Processing of cluster by Map Reduce

Hadoop Architecture. Part 1

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Introduction to HDFS. Prasanth Kothuri, CERN

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Cloud Computing

Survey on Scheduling Algorithm in MapReduce Framework

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Basic Hadoop Programming Skills

map/reduce connected components

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop Streaming. Table of contents

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

A very short Intro to Hadoop

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Open source Google-style large scale data analysis with Hadoop

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data With Hadoop

Rumen. Table of contents

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

H2O on Hadoop. September 30,

Yuji Shirasaki (JVO NAOJ)

This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop Distributed File System.


NoSQL and Hadoop Technologies On Oracle Cloud

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Introduction to HDFS. Prasanth Kothuri, CERN

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Click Stream Data Analysis Using Hadoop

HADOOP - MULTI NODE CLUSTER

Hadoop Parallel Data Processing

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Getting to know Apache Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Sriram Krishnan, Ph.D.

HDFS Users Guide. Table of contents

The Hadoop Framework

TP1: Getting Started with Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

PassTest. Bessere Qualität, bessere Dienstleistungen!

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

HDFS. Hadoop Distributed File System

Data-intensive computing systems

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Hadoop (pseudo-distributed) installation and configuration

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Open source large scale distributed data management with Google s MapReduce and Bigtable

CSE-E5430 Scalable Cloud Computing. Lecture 4

To reduce or not to reduce, that is the question

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Distributed Filesystems

CURSO: ADMINISTRADOR PARA APACHE HADOOP

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Tutorial- Counting Words in File(s) using MapReduce

2.1 Hadoop a. Hadoop Installation & Configuration

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Map Reduce & Hadoop Recommended Text:

Qsoft Inc

PARALLEL IMAGE DATABASE PROCESSING WITH MAPREDUCE AND PERFORMANCE EVALUATION IN PSEUDO DISTRIBUTED MODE

Extreme Computing. Hadoop MapReduce in more detail.

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Transcription:

MapReduce Tushar B. Kute, http://tusharkute.com

What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.

Map and Reduce Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.

Map and Reduce The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.

The Algorithm MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. Map stage: The map or mapper s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

The MapReduce

Inserting Data into HDFS The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types. The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job: (Input) <k1,v1> -> map -> <k2, v2>-> reduce -> <k3, v3> (Output).

Shutting Down the HDFS

Terminology PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job. Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair. NamedNode - Node that manages the Hadoop Distributed File System (HDFS). DataNode - Node where data is presented in advance before any processing takes place.

Terminology MasterNode - Node where JobTracker runs and which accepts job requests from clients. SlaveNode - Node where Map and Reduce program runs. JobTracker - Schedules jobs and tracks the assign jobs to Task tracker. Task Tracker - Tracks the task and reports status to JobTracker. Job - A program is an execution of a Mapper and Reducer across a dataset. Task - An execution of a Mapper or a Reducer on a slice of data. Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.

Example:

Example: ProcessUnits.java

Compilation and Execution Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop). Follow the steps given below to compile and execute the above program. Step 1 The following command is to create a directory to store the compiled java classes. $ mkdir units

Compilation and Execution Step 2 Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program. Visit the following link http://mvnrepository.com/artifact/org.apache.hadoop/hadoop -core/1.2.1 To download the jar. Let us assume the downloaded folder is /home/hadoop/. Step 3 The following commands are used for compiling the ProcessUnits.java program and creating a jar for the program. $ javac -classpath hadoop-core-1.2.1.jar units/processunits.java $ jar -cvf units.jar -C units/.

Compilation and Execution Step 4 The following command is used to create an input directory in HDFS. $HADOOP_HOME/bin/hadoop fs -mkdir input_dir Step 5 The following command is used to copy the input file named sample.txt in the input directory of HDFS. $HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir/ Step 6 The following command is used to verify the files in the input directory. $HADOOP_HOME/bin/hadoop fs -ls input_dir/

Compilation and Execution Step 7 The following command is used to run the Eleunit_max application by taking the input files from the input directory. $HADOOP_HOME/bin/hadoop jar units.jar ProcessUnits input_dir/ output_dir/ Wait for a while until the file is executed. After execution, as shown below, the output will contain the number of input splits, the number of Map tasks, the number of reducer tasks, etc.

Compilation and Execution Step 8 The following command is used to verify the resultant files in the output folder. $HADOOP_HOME/bin/hadoop fs -ls output_dir/ Step 9 The following command is used to see the output in Part-00000 file. This file is generated by HDFS. $HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000 Below is the output generated by the MapReduce program. 1981 34 1984 40 1985 45

Compilation and Execution Step 10 The following command is used to copy the output folder from HDFS to the local file system for analyzing. $HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000/bin/hadoop dfs -get output_dir /home/hadoop

Commands All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command. Running the Hadoop script without any arguments prints the description for all commands. Usage: hadoop [--config confdir] COMMAND

Commands namenode -format Formats the DFS filesystem. secondarynamenode Runs the DFS secondary namenode. namenode Runs the DFS namenode.

Commands datanode Runs a DFS datanode. dfsadmin Runs a DFS admin client. mradmin Runs a Map-Reduce admin client. fsck Runs a DFS filesystem checking utility. fs Runs a generic filesystem user client. balancer Runs a cluster balancing utility.

Commands jobtracker Runs the MapReduce job Tracker node. tasktracker Runs a MapReduce task Tracker node. job Manipulates the MapReduce jobs. queue Gets information regarding JobQueues. version Prints the version. jar <jar> Runs a jar file.

Thank you This presentation is created using LibreOffice Impress 4.2.7.2, can be used freely as per GNU General Public License Web Resources http://mitu.co.in http://tusharkute.com Blogs http://digitallocha.blogspot.in http://kyamputar.blogspot.in tushar@tusharkute.com