How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)

Similar documents

Introduction to MapReduce and Hadoop

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop Design and k-means Clustering

map/reduce connected components

CS54100: Database Systems

Introduction to Cloud Computing

University of Maryland. Tuesday, February 2, 2010

Lecture 3 Hadoop Technical Introduction CSE 490H

Introduction To Hadoop

Data Science in the Wild

Extreme Computing. Hadoop MapReduce in more detail.

By Hrudaya nath K Cloud Computing

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

HowTo Hadoop. Devaraj Das

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Getting to know Apache Hadoop

Data-intensive computing systems

Big Data Management and NoSQL Databases

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

How To Write A Mapreduce Program In Java.Io (Orchestra)

MapReduce Job Processing

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

INTRODUCTION TO HADOOP

Programming in Hadoop Programming, Tuning & Debugging

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Word count example Abdalrahman Alsaedi

Introduction to MapReduce

Internals of Hadoop Application Framework and Distributed File System

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011

Hadoop: Understanding the Big Data Processing Method

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

A very short Intro to Hadoop

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

MapReduce. Tushar B. Kute,

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

MapReduce Tutorial. Table of contents

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

Introduc)on to Map- Reduce. Vincent Leroy

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Apache Hadoop. Alexandru Costan

Hadoop Streaming. Table of contents

Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.

MapReduce, Hadoop and Amazon AWS

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data and Scripting map/reduce in Hadoop

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Hadoop Configuration and First Examples

MR-(Mapreduce Programming Language)

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

The Hadoop Eco System Shanghai Data Science Meetup

HDFS. Hadoop Distributed File System

Hadoop Architecture. Part 1

Hadoop, Hive & Spark Tutorial

Parallel Processing of cluster by Map Reduce

Scalable Computing with Hadoop

Cloudera Certified Developer for Apache Hadoop

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Implementations of iterative algorithms in Hadoop and Spark

Big Data 2012 Hadoop Tutorial

Biomap Jobs and the Big Picture

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Tutorial: MapReduce. Theory and Practice of Data-intensive Applications. Pietro Michiardi. Eurecom

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Apache Hadoop new way for the company to store and analyze big data

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

HadoopRDF : A Scalable RDF Data Analysis System

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Big Data With Hadoop

Data Science Analytics & Research Centre

Xiaoming Gao Hui Li Thilina Gunarathne

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop Parallel Data Processing

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop IST 734 SS CHUNG

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Data management in the cloud using Hadoop

How To Use Hadoop

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Data-Intensive Computing with Map-Reduce and Hadoop

5 HDFS - Hadoop Distributed System

Transcription:

Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1

Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2

What is Map Function Map is a classic primitive of Functional Programming. Map means apply a function to a list of elements and return the modified list. function List Map(Function func, List elements){ List newelement; foreach element in elements{ newelement.put(apply(func, element)) } } return newelement 3

Example Map Function function double increasesalary(double salary){ } return salary* (1+0.15); function double Map(Function increasesalary, List<Employee> employees){ List<Employees> newlist; foreach employee in employees{ } Employee tempemployee = new ( newlist.add(tempemployee.income=increasesalary( tempemployee.income) } 4

Fold or Reduce Funtion Fold/Reduce reduces a list of values to one. Fold means apply a function to a list of elements and return a resulting element function Element Reduce(Function func, List elements){ Element earlierresult; foreach element in elements{ func(element, earlierresult) } return earlierresult; } 5

Example Reduce Function function double add(double number1, double number2){ } return number1 + number2; function double Reduce (Function add, List<Employee> employees){ double totalamout=0.0; foreach employee in employees{ totalamount =add(totalamount,emplyee.income); } return totalamount } 6

I know Map and Reduce, How do I use it I will use some library or framework. 7

Why some framework? Lazy to write boiler plate code For modularity Code reusability 8

What is best choice 9

Why Hadoop? 10

11

Programming Language Support C++ 12

Who uses it 13

Strong Community Image Courtesy http://goo.gl/15nu3 14

Commercial Support 15

Hadoop 16

Hadoop HDFS 17

Hadoop Distributed File System Large Distributed File System on commudity hardware 4k nodes, Thousands of files, Petabytes of data Files are replicated so that hard disk failure can be handled easily One NameNode and many DataNode 18

Hadoop Distributed File System HDFS ARCHITECTURE Metadata ops Namenode Metadata (Name,replicas,..): Client Block ops Read Data Nodes Replication Data Nodes Blocks Write Rack 1 Client Rack 2 19

NameNode Meta-data in RAM The entire metadata is in main memory. Metadata consist of List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g creation time Transaction Log NameNode uses heartbeats to detect DataNode failure 20

Data Node Data Node stores the data in file system Stores meta-data of a block Serves data and meta-data to Clients Pipelining of Data i.e forwards data to other specified DataNodes DataNodes send heartbeat to the NameNode every three sec. 21

HDFS Commands Accessing HDFS hadoop dfs mkdir mydirectory hadoop dfs -cat myfirstfile.txt Web Interface http://host:port/dfshealth.jsp 22

Hadoop MapReduce 23

Map Reduce Diagramtically Mapper Reducer Output Files Input Files Input Split 0 Input Split 1 Input Split 2 Input Split 3 Input Split 4 Input Split 5 Intermediate file is divided into R partitions, by partitioning function 24

Input Format InputFormat descirbes the input sepcification to a MapReduce job. That is how the data is to be read from the File System. Split up the input file into logical InputSplits, each of which is assigned to an Mapper Provide the RecordReader implementation to be used to collect input record from logical InputSplit for processing by Mapper RecordReader, typically, converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented view for the Mapper & Reducer tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values. 25

Creating a your Mapper The mapper should implements.mapred.mapper Earlier version use to extend class.mapreduce.mapper class Extend.mapred.MapReduceBase class which provides default implementation of close and configure method. The Main method is map ( WritableComparable key, Writable value, OutputCollector<K2,V2> output, Reporter reporter) One instance of your Mapper is initialized per task. Exists in separate process from all other instances of Mapper no data sharing. So static variables will be different for different map task. Writable -- Hadoop defines a interface called Writable which is Serializable. Examples IntWritable, LongWritable, Text etc. WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. InverseMapper swaps the key and value 26

Combiner Combiners are used to optimize/minimize the number of key value pairs that will be shuffled across the network between mappers and reducers. Combiner are sort of mini reducer that will be applied potentially several times still during the map phase before to send the new set of key/value pairs to the reducer(s). Combiners should be used when the function you want to apply is both commutative and associative. Example: WordCount and Mean value computation Reference http://goo.gl/iu5kr 27

Partitioner Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Some Partitioner are BinaryPartitioner, HashPartitioner, KeyFieldBasedPartitioner, TotalOrderPartitioner 28

Creating a your Reducer The mapper should implements.mapred.reducer Earlier version use to extend class.mapreduce.reduces class Extend.mapred.MapReduceBase class which provides default implementation of close and configure method. The Main method is reduce(writablecomparable key, Iterator values, OutputCollector output, Reporter reporter) Keys & values sent to one partition all goes to the same reduce task Iterator.next() always returns the same object, different data HashPartioner partition it based on Hash function written IdentityReducer is default implementation of the Reducer 29

Output Format OutputFormat is similar to InputFormat Different type of output formats are TextOutputFormat SequenceFileOutputFormat NullOutputFormat 30

Mechanics of whole process Configure the Input and Output Configure the Mapper and Reducer Specify other parameters like number Map job, number of reduce job etc. Submit the job to client 31

Example JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); //JobClient.submit 32

Job Tracker & Task Tracker Master Node Job Tracker Slave Node Task Tracker....... Slave Node Task Tracker Task Task Task Task 33

Job Launch Process JobClient determines proper division of input into InputSplits Sends job data to master JobTracker server. Saves the jar and JobConf (serialized to XML) in shared location and posts the job into a queue. 34 34

Job Launch Process Contd.. TaskTrackers running on slave nodes periodically query JobTracker for work. Get the job jar from the Master node to the data node. Launch the main class in separate JVM queue. TaskTracker.Child.main() 35 35

Small File Problem What should I do if I have lots of small files? One word answer is SequenceFile. SequenceFile Layout Key Value Key Value Key Value Key Value File Name File Content Tar to SequenceFile http://goo.gl/mkgc7 Consolidator http://goo.gl/evvi7 36

Problem of Large File What if I have single big file of 20Gb? One word answer is There is no problems with large files 37

SQL Data What is way to access SQL data? One word answer is DBInputFormat. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database. Database Access with Hadoop http://goo.gl/cnobc 38

JobConf conf = new JobConf(getConf(), MyDriver.class); conf.setinputformat(dbinputformat.class); DBConfiguration.configureDB(conf, com.mysql.jdbc.driver, jdbc:mysql://localhost:port/dbnamee ); String [] fields = { employee_id, "name" }; DBInputFormat.setInput(conf, MyRow.class, employees, null /* conditions */, employee_id, fields); 39

public class MyRow implements Writable, DBWritable { private int employeenumber; private String employeename; public void write(dataoutput out) throws IOException { out.writeint(employeenumber); out.writechars(employeename); } } public void readfields(datainput in) throws IOException { employeenumber= in.readint(); employeename = in.readutf(); } public void write(preparedstatement statement) throws SQLException { statement.setint(1, employeenumber); statement.setstring(2, employeename); } public void readfields(resultset resultset) throws SQLException { employeenumber = resultset.getint(1); employeename = resultset.getstring (2); } 40

Question & Answer 41

Thanks You 42