Introduction to MapReduce and Hadoop

Similar documents

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Hadoop WordCount Explained! IT332 Distributed Systems

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

map/reduce connected components

How To Write A Mapreduce Program In Java.Io (Orchestra)

Big Data Management and NoSQL Databases

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

CS54100: Database Systems

Word count example Abdalrahman Alsaedi

Hadoop: Understanding the Big Data Processing Method

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Data Science in the Wild

Introduction To Hadoop

HPCHadoop: MapReduce on Cray X-series

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Hadoop Configuration and First Examples

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Programming in Hadoop Programming, Tuning & Debugging

By Hrudaya nath K Cloud Computing

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Getting to know Apache Hadoop

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Data Science Analytics & Research Centre

MR-(Mapreduce Programming Language)

Word Count Code using MR2 Classes and API

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Apache Hadoop new way for the company to store and analyze big data

Extreme Computing. Hadoop MapReduce in more detail.

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

MapReduce. Tushar B. Kute,

Hadoop Design and k-means Clustering

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Mrs: MapReduce for Scientific Computing in Python

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Introduction to Cloud Computing

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Implementations of iterative algorithms in Hadoop and Spark

Download and install Download virtual machine Import virtual machine in Virtualbox

Hadoop Architecture. Part 1

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Internals of Hadoop Application Framework and Distributed File System

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Big Data 2012 Hadoop Tutorial

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

TP1: Getting Started with Hadoop

INTRODUCTION TO HADOOP

Big Data Analytics* Outline. Issues. Big Data

A very short Intro to Hadoop

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Apache Hadoop. Alexandru Costan

MapReduce Tutorial. Table of contents

Hadoop IST 734 SS CHUNG

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Xiaoming Gao Hui Li Thilina Gunarathne

Tutorial- Counting Words in File(s) using MapReduce

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

A. Aiken & K. Olukotun PA3

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Introduc)on to Map- Reduce. Vincent Leroy

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

How To Use Hadoop

University of Maryland. Tuesday, February 2, 2010

Hadoop Streaming. Table of contents

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Parallel Processing of cluster by Map Reduce

Fault Tolerance in Hadoop for Work Migration

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data With Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Open source Google-style large scale data analysis with Hadoop

Introduction to Cloud Computing

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Transcription:

Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von

Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process in serial Application developers Never dealt with large amount (petabytes) of data Less experience with parallel programming Google proposed a simple programming model MapReduce to process large scale data The first MapReduce library on March 2003 Hadoop: Map/Reduce implementation for clusters Open source Widely adopted by large Web companies 2 J. Tao MapReduce & Hadoop 24.09.2010

Introduction to MapReduce A programming model for processing large scale data Automatic parallelization Friendly to procedural languages Data processing is based on a map step and a reduce step Apply a map operation to each logical record in the input to generate intermediate values Followed by the reduce operation to merge the same intermediate values Sample use cases Web data processing Data mining and machine learning Volume rendering Biological applications (DNA sequence alignments) 3 J. Tao MapReduce & Hadoop 24.09.2010

The MapReduce Execution Model Data Partitions map tasks reduce tasks Reduce outputs Splitting the input data Sorting the output of the map functions Passing the output of map functions to reduce functions 4 J. Tao MapReduce & Hadoop 24.09.2010

Simple Example: WordCount Count the occurrence of each word in a text Map Shuffle Reduce Input (Text) Output Input Output See Bob run See 1 Bob 1 run 1 See 1 See 1 See 2 See Jim come See 1 Jim 1 come 1 Bob 1 run 1 Bob 1 run 1 Jim leave Jim 1 leave 1 Jim 1 Jim 1 Jim 2 John come John 1 come 1 come 1 come 1 leave 1 come 2 leave 1 John 1 John 1 5 J. Tao MapReduce & Hadoop 24.09.2010

Programming WordCount Write map and reduce methods Input and output: <key, value> map: <k1, v1> reduce: <k2, v2> <k3, v3> <k2, v2> (input of WordCount: no key) maps transform input records into intermediate records reduces reduce a set of intermediate values which share a key to a smaller set of values Implement Mapper and Reducer interfaces MapReduce user interfaces Most important: Mapper, Reducer Other core interfaces: JobConf, JobClient, Partitioner, OutputCollector, Reporter, InputFormat, OutputFormat, OutputCommitter 6 J. Tao MapReduce & Hadoop 24.09.2010

The Map Class public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); } public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { } String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { } word.set(tokenizer.nexttoken()); output.collect(word, one); 7 J. Tao MapReduce & Hadoop 24.09.2010

Job Configuration How much data a map processes? Use the InputFormat in the job configuration public static void main(string[] args) throws Exception { } JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setmapperclass(map.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class);. JobClient.runJob(conf); How many maps? Driven by the total size of the inputs and the number of processor nodes Suggestion: a map takes at least a minute to execute 8 J. Tao MapReduce & Hadoop 24.09.2010

The Reduce Class All intermediate values associated with an output key are grouped and passed to the Reducer(s) public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Control which keys go to which Reducer by implementing the Partitioner interface 9 J. Tao MapReduce & Hadoop 24.09.2010

Run WordCount on Hadoop Create the executable $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java $ jar -cvf /user/xxx/wordcount.jar -C wordcount_classes/. Sample input files $ bin/hadoop dfs -ls /user/xxx/wordcount/input/ /user/xxx/wordcount/input/file01 /user/xxx/wordcount/input/file02 Run the application $bin/hadoop jar /user/xxx/wordcount.jar /user/xxx/wordcount/input /user/xxx/wordcount/output Output: $ bin/hadoop dfs -cat /user/xxx/wordcount/output/part-00000 Bob 1 come 2 Jim 2 10 J. Tao MapReduce & Hadoop 24.09.2010

Hadoop MapReduce Data to process do not fit on one node Move computation to data Distribute filesystem, built in replication, automatic failover in case of failure Apache Hadoop provides Automatic parallelization and distribution Fault tolerance Monitoring and status reporting Clean abstraction interface for developers Based on the Hadoop Distributed File System (HDFS) 11 J. Tao MapReduce & Hadoop 24.09.2010

Hadoop MapReduce Architecture Master node MapReduce job submitted by client computer JobTracker Slave node Slave node Slave node TaskTracker TaskTracker TaskTracker Single master running a JobTracker instance Multiple cluster-nodes running each a TaskTracker instance JobTracker distributes Map/Reduce tasks to TaskTrackers 12 J. Tao MapReduce & Hadoop 24.09.2010

Hadoop MapReduce Data Flow 13 J. Tao MapReduce & Hadoop 24.09.2010

Hadoop Distributed File System A scalable, fault tolerant distributed file system able to run on a cluster of commodity hardware A master/slave architecture Single NameNode Multiple DataNodes Files are broken into 64 MB or 128 MB Blocks Blocks replicated across several DataNodes to handle hardware failure (default replication number is 3) Single file system Namespace for the whole cluster NameNode holds the file system metadata (files names, block locations, replication factor, etc) 14 J. Tao MapReduce & Hadoop 24.09.2010

Hadoop Distributed File System (cont.) Data operations via FS shell - the command line interface bin/hadoop dfs -mkdir dirname bin/hadoop dfs -rmr dirname bin/hadoop dfs -ls dirname bin/hadoop dfs -cat filename Example: copy the input of WordCount to HDFS /bin/hadoop dfs put file01 file02 /user/xxx/wordcount/input/ /bin/hadoop dfs -ls /user/xxx/wordcount/input/ Found 2 items /user/xxx/wordcount/input/file01 /user/xxx/wordcount/input/file02 /bin/hadoop jar. input /user/xxx/wordcount/input/.. 15 J. Tao MapReduce & Hadoop 24.09.2010

Hadoop on Distributed Clusters Motivation Hadoop is limited to a single cluster Existing distributed data centers owns solid cluster scheduling frameworks Requirement on computational capacity from large applications Goal: g-hadoop runing MapReduce on multiple clusters An on-going work at SCC/KIT; collaboration with the Indiana University Tasks Replacing HDFS with Gfarm a Grid Datafarm Integration with the batch systems 16 J. Tao MapReduce & Hadoop 24.09.2010

g-hadoop vs. Hadoop: Hadoop Architecture View Hadoop Map/Reduce Slave HDFS Slave Hadoop Map/Reduce Slave HDFS Slave... Hadoop Map/Reduce Slave HDFS Slave Hadoop Map/Reduce Master Hadoop HDFS Master Hadoop Cluster 17 J. Tao MapReduce & Hadoop 24.09.2010

Single Sign-on Map/Reduce Slave slave node slave node slave node g-hadoop vs. Hadoop: g-hadoop Architecture G-Hadoop enabled Cluster N G-Hadoop enabled Cluster A Hadoop Map/Reduce Slave Hadoop Gfarm Plugin GfarmFS Slave Hadoop TORQUE Plugin TORQUE Master... Hadoop Map/Reduce Master G-Hadoop Master 18 J. Tao MapReduce & Hadoop 24.09.2010 Hadoop Gfarm Plugin GfarmFS Master

Conclusion Map/Reduce is a framework for distributed computing on large data sets Hadoop is a widely accepted and used implementation of the Map/Reduce framework, but No load-balancing mechanisms Fault tolerance is based on a simple scheme Other implementations of MapReduce on clusters Twister iterative MapReduce DryadLINQ MapReduce framework MapReduce on other architectures Mars - Map/Reduce on NVIDIA Graphics processors Phoenix - Map/Reduce for Shared memory 19 J. Tao MapReduce & Hadoop 24.09.2010

Links Hadoop (also MapReduce and HDFS) http://hadoop.apache.org/ Twister Interactive MapReduce http://www.iterativemapreduce.org/ 1st MapReduce Workshop (by HPDC 2010) http://graal.ens-lyon.fr/mapreduce/ Dean, Jeff and Ghemawat, Sanjay, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce-osdi04.pdf 20 J. Tao MapReduce & Hadoop 24.09.2010