MapReduce Job Processing

Size: px

Start display at page:

Download "MapReduce Job Processing"

Clare Thompson
8 years ago
Views:

1 April 17, 2012

2 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS).

3 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). NameNode R Record ID User ID Object ID DataNodes

4 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). NameNode R s DataChunks (Splits) 64MB 64MB 64MB 64MB DataNodes

utilize the Hadoop Distributed File System (HDFS).

5 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). NameNode R s DataChunks (Splits) 64MB 64MB 64MB 64MB DataNodes

6 Background: Hadoop Core Hadoop Core consists of one master and several TaskTrackers.

7 Background: Hadoop Core Hadoop Core consists of one master and several TaskTrackers. We assume one TaskTracker per physical machine.

8 Background: Hadoop Core Hadoop Core consists of one master and several TaskTrackers. We assume one TaskTracker per physical machine. TaskTrackers

9 Background: Hadoop Core Hadoop Core consists of one master and several TaskTrackers. We assume one TaskTracker per physical machine. Job Scheduling Task Scheduling Reduce Task Scheduling TaskTrackers

10 Background: Hadoop Core Hadoop Core consists of one master and several TaskTrackers. We assume one TaskTracker per physical machine. Job Scheduling Task Scheduling Reduce Task Scheduling s Reducers TaskTrackers

11 Background: Hadoop Cluster In a Hadoop cluster one machine typically runs both the NameNode and tasks and is called the master.

12 Background: Hadoop Cluster In a Hadoop cluster one machine typically runs both the NameNode and tasks and is called the master. The other machines run DataNode and TaskTracker tasks and are called slaves.

13 Background: Hadoop Cluster In a Hadoop cluster one machine typically runs both the NameNode and tasks and is called the master. The other machines run DataNode and TaskTracker tasks and are called slaves. NameNode + DataNodes + TaskTrackers

14 Background: Hadoop Cluster In a Hadoop cluster one machine typically runs both the NameNode and tasks and is called the master. The other machines run DataNode and TaskTracker tasks and are called slaves. NameNode + Master Slaves DataNodes + TaskTrackers

15 Background: MapReduce Job Overview Job Configuration Distributed Cache split 1 split 2 split 3 split 4 Map Phase Next we look at an overview of a typical MapReduce Job.

16 Background: MapReduce Job Overview Job Configuration Distributed Cache split 1 split 2 split 3 split 4 Map Phase Job specific variables are first placed in the Job Configuration which is sent to each Task by the.

Map Phase Job specific variables are first placed in

17 Background: MapReduce Job Overview Job Configuration Distributed Cache split 1 split 2 split 3 split 4 Map Phase Large data such as files or libraries are then put in the Distributed Cache which is copied to each TaskTracker by the.

Phase Large data such as files or libraries are then put

18 Background: MapReduce Job Overview Job Configuration Distributed Cache split 1 split 2 split 3 split 4 Map Phase The next assigns each InputSplit to a task on a TaskTracker, we assume m s and m InputSplits.

19 Background: MapReduce Job Overview Job Configuration Distributed Cache split 1 split 2 split 3 split 4 Map Phase Each maps a (k 1, v 1 ) pair to an intermediate (k 2, v 2 ) pair and partitions by k 2, i.e. hash(k 2 ) = p i for i [1, r], r = reducers.

Each maps a (k 1, v 1 ) pair to an intermediate (k 2, v 2 )

20 Background: MapReduce Job Overview Job Configuration Distributed Cache split 1 split 2 split 3 split 4 Map Phase An optional is executed over (k 2, list(v 2 )).

21 Background: MapReduce Job Overview Job Configuration Distributed Cache split 1 split 2 split 3 split 4 Map Phase The aggregates v 2 for a k 2 and a (k 2, v 2 ) is written to a partition on disk.

22 Background: MapReduce Job Overview Job Configuration Distributed Cache split 1 split 2 split 3 split 4 Reducer Reducer Map Phase Shuffle/Sort Phase The assigns two TaskTrackers to run the Reducers, each Reducer copies and sorts it s inputs from corresponding partitions.

23 Background: MapReduce Job Overview Job Configuration Distributed Cache s reduce communication overhead! split 1 split 2 split 3 split 4 Reducer Reducer Map Phase Shuffle/Sort Phase The assigns two TaskTrackers to run the Reducers, each Reducer copies and sorts it s inputs from corresponding partitions.

24 Background: MapReduce Job Overview Job Configuration Distributed Cache split 1 split 2 split 3 split 4 Reducer Reducer (k3, v3) (k3, v3) o1 o2 Map Phase Shuffle/Sort Phase Reduce Phase Each Reducer reduces a (k 2, list(v 2 )) to a single (k 3, v 3 ) and writes the results to a DFS file, o i for i [1, r].

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,