Developing a MapReduce Application

Size: px

Start display at page:

Download "Developing a MapReduce Application"

Oswin Singleton
10 years ago
Views:

1 TIE Apache Hadoop Tampere University of Technology, Finland November, 2014

2 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

3 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

4 MapReduce is a software framework for processing (large) data sets in a distributed fashion over several machines. Core idea < key, value > pairs

5 MapReduce is a software framework for processing (large) data sets in a distributed fashion over several machines. Core idea < key, value > pairs Almost all data can be mapped into key, value pairs.

6 MapReduce is a software framework for processing (large) data sets in a distributed fashion over several machines. Core idea < key, value > pairs Almost all data can be mapped into key, value pairs. Keys and values may be of any type.

7 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

8 Write your map and reduce functions

9 Write your map and reduce functions Test with a small subset of data

10 Write your map and reduce functions Test with a small subset of data If it fails use your IDE s debugger to find the problem

11 Write your map and reduce functions Test with a small subset of data If it fails use your IDE s debugger to find the problem Run on full dataset

12 Write your map and reduce functions Test with a small subset of data If it fails use your IDE s debugger to find the problem Run on full dataset If it fails Hadoop provides some debugging tools

13 Write your map and reduce functions Test with a small subset of data If it fails use your IDE s debugger to find the problem Run on full dataset If it fails Hadoop provides some debugging tools e.g. IsolationRunner : runs a task over the same input which it failed.

14 Write your map and reduce functions Test with a small subset of data If it fails use your IDE s debugger to find the problem Run on full dataset If it fails Hadoop provides some debugging tools e.g. IsolationRunner : runs a task over the same input which it failed. Do profiling to tune the performance

15 Hadoop Default Ports Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

16 Hadoop Default Ports Hadoop Default Ports Handful of ports over TCP. Some used by Hadoop itself (to schedule jobs, replicate blocks, etc.). Some are directly for users (either via an interposed Java client or via plain old HTTP)

17 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

18 Task: Counting the word occurances (frequencies) in a text file (or set of files). < word, count > as < key, value > pair Mapper: Emits < word, 1 > for each word (no counting at this part). Shuffle in between: pairs with same keys grouped together and passed to a single machine. Reducer: Sums up the values (1s) with the same key value.

20 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

22 Tasks

23 Name Node

24 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

25 Test mapper and reducer outside hadoop.

26 Test mapper and reducer outside hadoop. Copy your MapReduce function and files to DFS.

27 Test mapper and reducer outside hadoop. Copy your MapReduce function and files to DFS. Test mapper and reducer with hadoop using a small portion of the data.

28 Test mapper and reducer outside hadoop. Copy your MapReduce function and files to DFS. Test mapper and reducer with hadoop using a small portion of the data. Track the jobs, debug, do profiling

29 Questions/Comments

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in