Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu



Similar documents
Hadoop WordCount Explained! IT332 Distributed Systems

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Introduction to MapReduce and Hadoop

Word count example Abdalrahman Alsaedi

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

How To Write A Mapreduce Program In Java.Io (Orchestra)

Data Science in the Wild

CS54100: Database Systems

map/reduce connected components

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Hadoop: Understanding the Big Data Processing Method

HPCHadoop: MapReduce on Cray X-series

MR-(Mapreduce Programming Language)

Introduction to Hadoop

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Introduction To Hadoop

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Introduction to Hadoop

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Sriram Krishnan, Ph.D.

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

By Hrudaya nath K Cloud Computing

Big Data Management and NoSQL Databases

MapReduce (in the cloud)

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Implementations of iterative algorithms in Hadoop and Spark

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Word Count Code using MR2 Classes and API

Getting to know Apache Hadoop

Extreme Computing. Hadoop MapReduce in more detail.

Hadoop Configuration and First Examples

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Programming in Hadoop Programming, Tuning & Debugging

Download and install Download virtual machine Import virtual machine in Virtualbox

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Introduc)on to Map- Reduce. Vincent Leroy

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Introduc8on to Apache Spark

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

Introduction to Cloud Computing

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

Data Science Analytics & Research Centre

Chapter 7. Using Hadoop Cluster and MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data and Apache Hadoop s MapReduce

Big Data Analytics* Outline. Issues. Big Data

BIG DATA ANALYTICS HADOOP PERFORMANCE ANALYSIS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,


Internals of Hadoop Application Framework and Distributed File System

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

MAPREDUCE Programming Model

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Hadoop Architecture. Part 1

Hadoop, Hive & Spark Tutorial

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Parallel Data Processing

BIG DATA, MAPREDUCE & HADOOP

Tes$ng Hadoop Applica$ons. Tom Wheeler

Lecture 10 - Functional programming: Hadoop and MapReduce

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Introduction to MapReduce and Hadoop

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

INTRODUCTION TO HADOOP

Big Data With Hadoop

Big Data 2012 Hadoop Tutorial

Mrs: MapReduce for Scientific Computing in Python

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Introduction to Parallel Programming and MapReduce

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Map Reduce a Programming Model for Cloud Computing Based On Hadoop Ecosystem

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Advanced Data Management Technologies

Transcription:

Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu

Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs. The MapReduce library expresses the computa)on as two func)ons Map and Reduce All data resides in files e.g. in the Google File System (GFS)

Func)on Prototype map (k1,v1) list(k2,v2) The map func)on takes a key/value pair and generates a list of new key/value pairs reduce (k2,list(v2)) list(v2) The reduce func)on takes a key/list pair and generates a list of resul)ng values

Example: Word Count map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); The map func)on emits a word and an associated count The reduce func)on sums together all counts emiyed for a par)cular word

Example: Word Count Input File 1 Hello World Bye World Input File 2 Hello Hadoop Goodbye Hadoop First Combiner < Bye, 1> < Hello, 1> < World, 2> Second Combiner < Goodbye, 1> < Hadoop, 2> < Hello, 1> 1. Map Phase 2. Combiner Phase 3. Reduce Phase First Map < Hello, 1> < World, 1> < Bye, 1> < World, 1> Second Map < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> Reducer < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>

Implementa)on Details The Map invoca)ons are distributed across mul)ple machines by automa)cally par))oning the input data into a set of M splits Reduce invoca)ons are distributed by par))oning the intermediate key space into R pieces using a par))oning func)on (e.g., hash(key) mod R). The number of par))ons (R) and the par))oning func)on may be specified by the user

Execu)on Overview

Advantages Scalable and conducive to data intensive and data parallel applica)ons Fault tolerant by design workers can be restarted on failures Ability to run on non specialized commodity hardware

Common Complaints Need to write code to get an applica)on to conform to the MapReduce programming model Need to be able to script queries at run )me Need a higher level SQL like abstrac)on Hard to write complicated SQL type queries Too simplis)c The onus on op)miza)on falls on the programmer, not the database engine

Apache Hadoop Hadoop provides an Open Source implementa)on of MapReduce Uses the Hadoop Distributed File System (HDFS), which is a GFS clone Has been demonstrated on clusters with 2000 nodes

HDFS A distributed file system designed to run on commodity hardware Many similari)es with exis)ng distributed file systems Highly fault tolerant and is designed to be deployed on low cost hardware Provides high throughput access to applica)on data and is suitable for applica)ons that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

HDFS

Hadoop: Modes of Opera)on Stand alone By default, Hadoop is configured to run in a non distributed mode, as a single Java process. Mostly useful for debugging Pseudo distributed Hadoop can also be run on a single node in a pseudo distributed mode where each Hadoop daemon runs in a separate Java process Fully distributed Typically involves unpacking the sokware on all the machines in the cluster One machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively. These are the masters. The rest of the machines in the cluster act as both DataNode and TaskTracker. These are the slaves.

Hadoop Code: Word Count 12. public class WordCount { 13. 14. public sta)c class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final sta)c IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOExcep)on { 19. String line = value.tostring(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasmoretokens()) { 22. word.set(tokenizer.nexttoken()); 23. output.collect(word, one); 24. } 25. } 26. } 27. 28. public sta)c class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOExcep)on { 30. int sum = 0; 31. while (values.hasnext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } Map Func)on Reduce Func)on

Hadoop Code: Contd.. 37. 38. public sta)c void main(string[] args) throws Excep)on { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setjobname("wordcount"); 41. 42. conf.setoutputkeyclass(text.class); 43. conf.setoutputvalueclass(intwritable.class); 44. 45. conf.setmapperclass(map.class); 46. conf.setcombinerclass(reduce.class); 47. conf.setreducerclass(reduce.class); 48. 49. conf.setinputformat(textinputformat.class); 50. conf.setoutputformat(textoutputformat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 54. 55. JobClient.runJob(conf); 57. } 58. } 59. Job setup and launch

References MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean & Sanjay Ghemawat. In proceedings of OSDI 2004 The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun Tak Leung. In proceedings of SOSP'03 Apache Hadoop: hyp://hadoop.apache.org/ HDFS: hyp://hadoop.apache.org/core/docs/current/hdfs_design.html