Introduction to Map/Reduce & Hadoop

Transcription

1 Introduction to Map/Reduce & Hadoop V. CHRISTOPHIDES Department of Computer Science University of Crete 1 What is MapReduce? MapReduce: programming model and associated implementation for batch processing of large data sets MapReduce could be implemented on different architectures, but Google proposed it for clusters large clusters of commodity PCs Functional programming meets distributed computing: Automatic parallelization & distribution A clean abstraction for programmers by factoring out many reliability concerns from application logic Fault-tolerance Status and monitoring tools MapReduce implementations such as Hadoop differ in details, but main principles are the same 2 1 1

2 Overview Clever abstraction that is a good fit for many real-world problems Programmer focuses on algorithm itself Runtime system takes care of all messy details Partitioning of input data Scheduling program execution Handling machine failures Managing inter-machine communication Divide and Conquer 3 MapReduce Implementations Google has a proprietary implementation in C++ Bindings in Java, Python Hadoop is an open-source implementation in Java Development led by Yahoo, now an Apache project Used in production at Yahoo, Facebook, Twitter, LinkedIn, Netflix but also A9.com, AOL, The New York Times, Last.fm, Baidu.com, Joost, Veoh, etc. The de facto big data processing platform Rapidly expanding software ecosystem Lots of custom research implementations For GPUs, cell processors, etc

3 Hadoop: Storage & Compute on One Platform Evolution from Apache Hadoop to the Enterprise Data Hub A. Awadallah 5 Co-Founder & CTO of Cloudera SMDB 2014 Expanding Data Requires A New Approach Evolution from Apache Hadoop to the Enterprise Data Hub A. Awadallah 6 Co-Founder & CTO of Cloudera SMDB

4 Map Reduce Foundations 7 Map and Reduce: The Origins The programming idea of Map, and Reduce is 40+ year old Present in all Functional Programming Languages e.g., APL, Lisp and ML Alternate Map names: Apply-All Higher Order Functions take function definitions as return a function as output arguments, or Map and Reduce are higher-order functions Map processes each record individually Reduce processes (combines) set of all records in a batch 8 4 4

5 Map: A Higher Order Function F(x: int) returns r: int Let V be an array of integers W = map(f, V) W[i] = F(V[i]) for all I i.e., apply F to every element of V Examples in Haskell map (+1) [1,2,3,4,5] == [2, 3, 4, 5, 6] map (tolower) "abcdefg12!@# == "abcdefg12!@# map (`mod` 3) [1..10] == [1, 2, 0, 1, 2, 0, 1, 2, 0, 1] 9 reduce: A Higher Order Function reduce also known as fold, accumulate, compress or inject Reduce/fold takes in a function and folds it in between the elements of a list

6 Fold-Left in Haskell Definition foldl f z [] = z foldl f z (x:xs) = foldl f (f z x) xs Examples foldl (+) 0 [1..5] ==15 foldl (+) 10 [1..5] == 25 foldl (div) 7 [34,56,12,4,23] == 0 11 Fold-Right in Haskell Definition foldr f z [] = z foldr f z (x:xs) = f x (foldr f z xs) Example foldr (div) 7 [34,56,12,4,23] ==

7 13 Examples of Map Reduce Computation

8 What can I do in MapReduce? Three main functions: ❶Querying Filtering (distributed grep, etc.) Relational-based (join, selection, projection, etc.) ❷Summarizing Computing Aggregates (word/record count, Min/Max/Average/Median/Standard deviation, etc.) Data Organization (sort, indexing, etc.) ❸Analyzing Iterative Message Passing (graph processing)... large datasets in offline mode for boosting other on-line processes 15 Grep Example Input data Split data Split data Split data grep grep grep matches matches matches cat All matches Split data grep matches Search input files for a given pattern: e.g., Given a list of tweets (username, date, text) determine the tweets that contain a word Map: emits a (filename, line) if pattern is matched Reduce: Copies results to output

9 Word Count Example Input data Split data Split data Split data token token token count count count merge All matches Split data token count Read text files and count how often words occur The input is text files The output is a text file each line: word, tab, count Map: Produce pairs of (word, count = 1) from files Reduce: For each word, sum up the counts (i.e., fold) 17 Inverted Index Example (this was the original Google's Use Case) Input data Split data Split data Split data parse parse parse inverse inverse inverse cat Inverted list Split data parse inverse Generate an inverted index of words from a given set of files Map: parses a document and emits <word, docid> pairs Reduce: takes all pairs for a given word, sorts the docid values, and emits a <word, list(docid)> pair

10 10 10 Map Reduce Input data M A P Partitioning Function R E D U C E Result Map: Accepts input key/value pair Emits intermediate key/value pair Reduce : Accepts intermediate key/value* pair Emits output key/value pair 19 Real World Applications in MapReduce Organizations Google Yahoo Amazon Facebook PowerSet(Microsoft) Twitter New York Times Others (>74) MapReduce Applications Wide-range applications, grep/ sorting, machine learning, clustering, report extraction, graph computation Data model training, Web map construction, Web log processing using Pig, and much, much more Build product search indices Web log processing via both MapReduce and Hive HBase for natural language search Web log processing using Pig Large-scale image conversion Details in (so far, the longest list of applications for MapReduce 20

11 11 11 MapReduce Principle applied to BigData 21 Adapt MapReduce for BigData Always maps/reduces on list of key/value pairs Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs map() produces one or more intermediate values along with an output key from the input After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key in practice, only one final value per key 22

12 12 12 Adapt MapReduce for BigData Automatic Parallelization: Depending on the size of RAW INPUT DATA instantiate multiple map() tasks Depending upon the number of intermediate <key, value> partitions instantiate multiple reduce() tasks Master program divvies up tasks based on data location tries to have map() tasks on same machine as physical file data, or at least same rack 23 Adapt MapReduce for BigData Map/Reduce execute in parallel on a cluster Fault tolerance is built in the framework Specific systems/implementation aspects matters How is data partitioned as input to map How is data serialized between processes Cloud specific improvements: Handle elasticity Take cluster topology (e.g., node proximity, node size) into account Source: Google Developers 24

13 13 13 Execution on Clusters ❶Input files split (M splits) ❷Assign Master & Workers ❸Map tasks ❹Writing intermediate data to disk (R regions) ❺Intermediate data read & sort ❻Reduce tasks ❼Return 25 Execution Overview Master assigns map and reduce tasks to workers, taking data location into account Mapper reads an assigned file split and writes intermediate key-value pairs to local disk Mapper informs master about result locations, who in turn informs the reducers Reducers pull data from appropriate mapper disk location After map phase is completed, reducers sort their data by key For each key, the Reduce function is executed and output is appended to final output file When all reduce tasks are completed, master wakes up user program 26

14 14 14 MapReduce Data Flow in Hadoop with no Reduce Tasks 27 MapReduce Data Flow in Hadoop with a Single Reduce Task (K1, V1) V (K2, V2) V (K2, List<V2>) (K3, V3) 28

15 15 15 MapReduce Data Flow in Hadoop with Multiple Reduce Tasks 29 MapReduce Programing Constructs in Hadoop Programmers must specify: map (k, v) <k, v >* reduce (k, v ) <k, v >* All values with the same key are reduced together Optionally, also: partition (k, number of partitions) partition for k Often a simple hash of the key, e.g., hash(k ) mod n Divides up key space for parallel reduce operations combine (k, v ) <k, v >* Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic 1:many 30

16 16 16 The Execution Framework Handles Everything Else The execution framework handles everything else Scheduling: assigns workers to map and reduce tasks Data distribution : moves processes to data Synchronization: gathers, sorts, and shuffles intermediate data Errors and faults: detects worker failures and restarts Limited control over data and execution flow All algorithms must expressed in m, r, c, p You don t know: Where mappers and reducers run When a mapper or reducer begins or finishes Which input a particular mapper is processing Which intermediate key a particular reducer is processing 31 Execution Map Phase Shuffle Phase Reduce Phase 32

17 17 17 Automatic Parallel Execution in MapReduce (Google) Master Task Usually many more map tasks than machines E.g. 200K map tasks 5K reduce tasks 2K machines 33 Map/Reduce Execution in Clusters Several map or reduce tasks can run on a single computer Each intermediate file is divided into R partitions, by partitioning function Each reduce task corresponds to one partition 34

18 18 18 Moving Data From Mappers to Reducers Mappers output need to be separated for different reducers Reducers need to collect their data from all mappers and group it by key Keys at each reducer are processed in order! Shuffle & sort phase=synchronization barrier between map & reduce phase Collects the output from all map executions Transforms the map output Into the reduce input Divides the map output Into chunks Often one of the most expensive parts of a MapReduce execution 35 Shuffle and Sort in Map Reduce Spilled to a new disk file when almost full Spill files merged into single output file Spill files on disk: partitioned by reduce task, each partition sorted by key Merge happens in memory if data fits, otherwise also on disk Reduce task starts copying data from map task as soon as it completes. Reduce cannot start working on the data until all mappers 36 have finished and their data has arrived.

19 19 19 Shuffle and Sort in Hadoop Probably the most complex aspect of MapReduce! Map side Map outputs are buffered in memory in a circular buffer When buffer reaches threshold, contents are spilled to disk Spills merged in a single, partitioned file (sorted within each partition): combiner runs here Reduce side First, map outputs are copied over to reducer machine Sort is a multi-pass merge of map outputs (happens in memory and on disk): combiner runs here Final merge pass goes directly into reducer 37 Shuffle and Sort in Hadoop Mapper merged spills (on disk) intermediate files (on disk) Combiner Reducer circular buffer (in memory) Combiner spills (on disk) other reducers other mappers 38

20 20 20 Tools for Synchronization Cleverly-constructed data structures Bring partial results together Sort order of intermediate keys Control order in which reducers process keys Partitioner Control which reducer processes which keys Preserving state in mappers and reducers Capture dependencies across multiple keys and values 39 Master Data Structures Master keeps track of status of each map and reduce task and who is working on it Idle, in-progress, or completed Master stores location and size of output of each completed map task Pushes information incrementally to workers with in-progress reduce tasks 40

21 21 21 Semantics with Failures If map and reduce are deterministic, then output is identical to nonfaulting sequential execution For non-deterministic operators, different reduce tasks might see output of different map executions Relies on atomic commit of map and reduce outputs In-progress task writes output to private temp file Mapper: on completion, send names of all temp files to master (master ignores if task already complete) Reducer: on completion, atomically rename temp file to final output file (needs to be supported by distributed file system) 41 Fault Recovery Workers are commodity (low cost) machines so failures are frequent: given n workers, probability of at least one "problem" is 1 (1-p) n Workers are pinged by master periodically Non-responsive workers are marked as failed All tasks in-progress or completed by failed worker become eligible for rescheduling (reset to idle state) Can be assigned to other mapper Completed tasks are re-executed since result is stored on mapper s local disk Reducers are notified about mapper failure, so that they can read the data from the replacement mapper 42

22 22 22 Fault Recovery Failed reducers identified through ping as well Reducer s in-progress tasks are reset to idle state Can be assigned to other reducer No need to restart completed reduce tasks, because result is written to distributed file system Current implementations abort MapReduce computation on master failure Users re-submit aborted jobs when new master process is up Alternative: master writes periodic checkpoints of its data structures so that it can be re-started from checkpointed state 43 Machines Share Roles So far, logical view of cluster In reality Each cluster machine stores data And runs MapReduce workers Lots of storage + compute cycles nearby 44

23 23 23 The MapReduce Runtime Handles scheduling Assigns workers to map and reduce tasks Handles data distribution Moves processes to data Handles synchronization Gathers, sorts, and shuffles intermediate data Handles errors and faults Detects worker failures and restarts Everything happens on top of a distributed filesystem 45 Implementing in MapReduce Externally for the user Write a map function, and a reduce function Submit a job; wait for result No need to know anything about the environment Google: 4000 servers disks, many failures Internally for MapReduce system designer Run map in parallel Shuffle: combine map results to produce reduce input Run reduce in parallel Deal with failures 46

24 24 24 Component Overview 47 Peta-Bytes Data Processing 48

25 25 25 Google s MapReduce inspired Yahoo s Hadoop Now as an open source (Java) Project of Apache hadoop.apache.org We will use release (November 2014) Distribute processing of petabytes data across thousands of commodity machines In 2008 about 100k MapReduce jobs per day in Google over 20 petabytes of data processed per day each job occupies about 400 servers Easy to use since run-time complexity hidden from the users 49 Hadoop Key Components MapReduce - distributes applications Hadoop Distributed File System (HDFS) - distributes data Store big files across machines Store each file as a sequence of blocks Blocks of a file are replicated for fault tolerance 50

26 26 26 MapReduce vs. Hadoop MapReduce Hadoop Org Google Yahoo/Apache Impl C++ Java Distributed File Sys GFS HDFS Data Base Bigtable HBase Distributed lock Mgr Chubby History Dec 2004 Google GFS paper published July 2005 Nutch uses MapReduce Jan 2006 Doug Cutting joins Yahoo! Feb 2006 Becomes Lucene subproject Apr 2007 Yahoo! on 1000-node cluster Jan 2008 An Apache Top Level Project Feb 2008 Yahoo! production search index ZooKeeper 51 Google s Problem in 2003: Already Lots of Data Example: 20+ billion web pages x 20KB = 400+ terabytes One computer can read MB/sec from disk ~four months to read the web ~1,000 hard drives just to store the web Even more to do something with the data: process crawled documents process web request logs build inverted indices construct graph representations of web documents 52

27 27 27 The Need: Data & Programs Distribution! A Distributed System: Scalable Fault-Tolerant Easy to Program Applicable to Many Problems Re-discovered by Google with goals: "Reliability has to come from the software" "How can we make it easy to write distributed programs?" 53 How do we get Data to the Workers? NAS Compute Nodes SAN What s the problem here? 54

28 28 28 Distributed File System Don t move data to workers Move workers to the data! Store data on the local disks for nodes in the cluster Start up the workers on the node that has the data local! Why? Not enough RAM to hold all the data in memory Disk access is slow, disk throughput is good A distributed file system is the answer GFS (Google File System) DFS for Hadoop (= GFS clone) 55 Distributed File System Principles Single Namespace for entire cluster Data Coherency Write-once-read-many access model Client can only append to existing files Files are broken up into blocks Typically 128 MB block size Each block replicated on multiple DataNodes Intelligent Client Client can find location of blocks Client accesses data directly from DataNode 56

29 29 29 Hadoop Supported File Systems HDFS: Hadoop's own file system Amazon S3 file system Targeted at clusters hosted on the Amazon Elastic Compute Cloud server-on-demand infrastructure Not rack-aware CloudStore, previously Kosmos Distributed File System like HDFS, this is rack-aware optimization which takes into account the geographic clustering of servers network traffic between servers in different geographic clusters is minimized FTP Filesystem stored on remote FTP servers Read-only HTTP and HTTPS file systems 57 Hadoop Distributed File System (HDFS) Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth User Space, runs on top of the file systems of the underlying OS 58

30 30 30 Hadoop Distributed File System (HDFS) Master manages the file system and access to files by clients Runs on master node of the HDFS cluster Directs DataNodes to perform their low-level I/O tasks Slave manages the storages attached to the nodes running on Runs on each slave machine in the HDFS cluster Does the low-level I/O work Files stored in many blocks Each block has a block Id Block Id associated with several nodes hostname:port (depending on level of replication) 59 HDFS Architecture NameNode Cluster Membership Client Secondary NameNode Cluster Membership DataNodes NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log 60

31 31 31 NameNode Metadata Meta-data in Memory The entire metadata is in main memory No demand paging of meta-data Types of Metadata List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g creation time, replication factor A Transaction Log stored in multiple directories (on a local/remote FS) Records file creations, file deletions. etc 61 DataNode A Block Server Stores data in the local file system (e.g. ext3) Stores meta-data of a block (e.g. CRC) Serves data and meta-data to Clients Block Report Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data Forwards data to other specified DataNodes 62

32 32 32 Block Placement Current Strategy One replica on local node Second replica on a remote rack Third replica on same remote rack Additional replicas are randomly placed Clients read from nearest replica Would like to make this policy pluggable 63 Data Correctness Use Checksums to validate data Use CRC32 File Creation Client computes checksum per 512 byte DataNode stores the checksum File access Client retrieves the data and checksum from DataNode If validation fails, Client tries other replicas 64

33 33 33 Primary vs Secondary NameNode NameNode daemon Head interface to HDFS cluster Records all global metadata A single point of failure Secondary NameNode daemon One per cluster to monitor status of HDFS: Not a failover NameNode! Records snapshots of HDFS metadata from real NameNode to facilitate recovery from failures Can merge update logs in flight Can upload snapshot back to primary Need to develop a real High Availability (HA) solution 65 Data Pipelining Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next DataNode in the Pipeline When all replicas are written, the Client moves on to write the next block in file 66

34 34 34 Rebalancer Goal: % disk full on DataNodes should be similar Usually run when new DataNodes are added Cluster is online when Rebalancer is active Rebalancer is throttled to avoid network congestion Command line tool 67 dfs.name.dir Dir in the namenode to store metadata dfs.data.dir Dir in data nodes to store the data blocks Must be in a local disk partition fs.default.name URI of the namenode conf/masters ip address of the master conf/slaves ip address of the slaves Should have password-less SSH access to all the nodes Configuring HDFS conf/hdfs-site.xml <property> <name>dfs.name.dir</name> <value>/tmp/hadoop-test/name</value> </property> <property> <name>dfs.data.dir</name> <value>/tmp/hadoop-test/data</value> </property> conf/core-site.xml <property> <name>fs.default.name</name> <value>hdfs:// :9000/</va lue> </property> 68

35 35 35 User Interface Command for HDFS User hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir myfile.txt Command for HDFS Administrator hadoop dfsadmin -report hadoop dfsadmin -decommission datanodename Web Interface WordCount A Simple Hadoop Example 70

36 36 36 Word Count over a Given Set of Web Pages see bob throw see spot run see 1 bob 1 throw 1 see 1 spot 1 run 1 bob 1 run 1 see 2 spot 1 throw 1 How can we do word count in parallel? 71 Word Counting with MapReduce M 1 Documents Key Value Key Value Doc1 Doc2 Doc3 Doc4 Doc5 Financial, IMF, Economics, Crisis Financial, IMF, Crisis Documents Economics, Harry Financial, Harry, Potter, Film Crisis, Harry, Potter Map Map ` ` Financial 1 IMF 1 Economics 1 ` Crisis 1 Financial 1 ` IMF 1 Crisis 1 Economics 1 ` Harry 1 Financial 1 Harry 1 ` Potter 1 Film 1 Crisis 1 Harry 1 ` Potter 1 M 2 Kyuseok Shim (VLDB 2012 TUTORIAL) 72

37 37 37 Word Counting with MapReduce Doc1 Doc2 Doc3 Doc4 Doc5 Documents Financial, IMF, Economics, Crisis Financial, IMF, Crisis Documents Economics, Harry Financial, Harry, Potter, Film Crisis, Harry, Potter Map Map Key KeyValue Key Value list Financial Financial1 Crisis 1, 1, 1 1 Financial IMF 1 Crisis 1, 1 1 Financial Economics 1 Crisis 1, 1 1 IMF Crisis 1 Harry 1, 1, 1 1 IMF Harry 1 Harry 1, 1, 1 1 Economics Film 1 Harry 1 1 Economics Potter 1 Film 1, 1 1 Potter 1 Potter 1 Value Reduc Reduce Key Financial 3 IMF 2 Economics 2 Crisis 3 Harry 3 Film 1 Potter 2 Before reduce functions are called, for each distinct key, the list of its values is generated Kyuseok Shim (VLDB 2012 TUTORIAL) ` ` Value 73 Basic Hadoop API Mapper void map(k1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Reducer/Combiner void reduce(k2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Partitioner void getpartition(k2 key,v2 value,int numpartitions) *Note: forthcoming API changes 74

38 38 38 WordCount Mapper public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } 75 WordCount Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } 76

39 39 39 Job Configuration Job object forms the job specification and gives control for running the job Set mapper and reducer class to be used Set output key and value classes for map and reduce functions For reducer: setoutputkeyclass(), setoutputvalueclass() For mapper (omit if same as reducer): setmapoutputkeyclass(), setmapoutputvalueclass() Can set input types similarly (default is TextInputFormat) Specify data input path using setinputpath() Can be single file, directory (to use all files there), or file pattern addinputpath() can be called multiple times to add multiple paths Specify output path using setoutputpath() Single output path, which is a directory for all output files Method waitforcompletion() submits job and waits for it to finish 77 Job Configuration public int run(string[] args) throws Exception { Job job = new Job(getConf(), "WordCount"); job.setjarbyclass(wordcount.class); job.setoutputkeyclass(text.class Text.class); job.setoutputvalueclass(intwritable.class IntWritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setcombinerclass(reduce.class Reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); job.setnumreducetasks(2); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); return 0; } 78

40 40 40 Invocation of WordCount hadoop dfs -mkdir <hdfs hdfs-dir dir> hadoop dfs -copyfromlocal <local local-dir dir> <hdfs hdfs-dir dir> hadoop jar hadoop-*-examples.jar WordCount [-m <#maps>] [-r <#reducers>] <in-dir dir> <out-dir dir> 79 Hadoop Job Tuning Choose appropriate number of mappers and reducers Define combiners whenever possible Consider Map output compression Optimize the expensive shuffle phase (between mappers and reducers) by setting its tuning parameters Profiling distributed MapReduce jobs is challenging 80

41 41 41 Job Configuration Parameters 190+ parameters in Hadoop Do their settings impact performance? What are ways to set these parameters? Defaults -- are they good enough? Best practices -- the best setting can depend on data, job, and cluster properties Automatic setting (Not yet in Hadoop) 81 MapReduce Development Steps ❶Write Map and Reduce functions Create unit tests ❷Write driver program to run a job Can run from IDE with small data subset for testing If test fails, use IDE for debugging Update unit tests and Map/Reduce if necessary ❸Once program works on small test set, run it on full data set If there are problems, update tests and code accordingly ❹Fine-tune code, do some profiling 82

42 42 42 Mechanics of Programming Hadoop Jobs 83 Anatomy of a Hadoop Job in MR v1.0 MapReduce program in Hadoop = Hadoop job Jobs are divided into map and reduce tasks An instance of a running task is called a task attempt Multiple jobs can be composed into a workflow 84

43 43 43 Client (i.e., driver program) creates a job, configures it, and submits it to job tracker Launching a Hadoop Job JobTracker puts job data in shared location, enqueues tasks JobClient computes input splits (on client end) Job data (jar, configuration XML) are sent to JobTracker TaskTrackers poll JobTracker for tasks TaskTrackers launch a task in a separate Java instance Source: 85 Job Launch: JobClient, JobTracker and TaskTracker Client submits MapReduce job through JobClient.runJob() // blocks JobClient.submitJob() // does not block waitforcompletion() submits job and polls JobTracker about progress every sec, outputs to console if changed JobClient: Determines proper division of input into InputSplits Sends job data to master JobTracker server JobTracker: Inserts jar and JobConf (serialized to XML) in shared location Posts a JobInProgress to its run queue TaskTrackers running on slaves periodically query JobTracker for work Retrieve job-specific jar and config Launch task in separate instance of Java 86 main() is provided by Hadoop

44 44 44 Job Launch: Task and TaskRunner TaskTracker.Child.main(): Sets up the child TaskInProgress attempt Reads XML configuration Connects back to necessary MapReduce components via RPC Uses TaskRunner to launch user process TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch Mapper Task knows ahead of time which InputSplits it should be mapping Calls Mapper once for each record retrieved from the InputSplit Running the Reducer is much the same 87 Creating the Mapper One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress Exists in separate process from all other instances of Mapper no data sharing! public void map (WritableComparable key, Writable value, Context context) 88

45 45 45 Data Types in Hadoop Writable WritableComparable Defines a de/serialization protocol Every data type in Hadoop is a Writable Defines a sort order All keys must be of this type (but not values) IntWritable LongWritable Text SequenceFiles Concrete classes for different data types Binary encoded of a sequence of key/value pairs 89 What is Writable? Hadoop defines its own box classes for strings (Text), integers (IntWritable), etc. All values are instances of Writable All keys are instances of WritableComparable Writing for cache coherency while (more input exists) { myintermediate = new intermediate(input); myintermediate.process(); export outputs; } 90

46 46 46 Getting Data to the Mapper Input File Input File InputFormat InputSplit InputSplit InputSplit InputSplit InputSplit RecordReader RecordReader RecordReader RecordReader RecordReader Mapper Mapper Mapper Mapper Mapper Intermediates Intermediates Intermediates Intermediates Intermediates Data sets are specified by InputFormats Defines input data (e.g., a directory) Identifies partitions of the data that form an InputSplit Factory for RecordReader objects to extract (k,v) records from the input source 91 FileInputFormat and Friends TextInputFormat Treats each \n -terminated line of a file as a value KeyValueTextInputFormat Maps \n - terminated text lines of k SEP v SequenceFileInputFormat Binary file of (k, v) pairs with some add l metadata SequenceFileAsTextInputFormat Same, but maps (k.tostring(), v.tostring()) FileInputFormat will read all files out of a specified directory and send them to the mapper Delegates filtering this file list to a method subclasses may override e.g., create your own xyzfileinputformat to read *.xyz from directory list These classes make use of the readfields() method of the specific 92 Writable classes used by your MapReduce pass

47 47 47 A simple CustomWritable public class MyWritable implements Writable { private int counter; private long timestamp; public void write(dataoutput out) throws IOException { out.writeint(counter); out.writelong(timestamp); } public void readfields(datainput in) throws IOException { counter = in.readint(); timestamp = in.readlong(); } public static MyWritable read(datainput in) throws IOException { MyWritable w = new MyWritable(); w.readfields(in); return w; } } 93 Record Readers Each InputFormat provides its own RecordReader implementation Provides (unused?) capability multiplexing LineRecordReader Reads a line from a text file KeyValueRecordReader Used by KeyValueTextInputFormat 94

48 48 48 Input Split Size FileInputFormat will divide large files into chunks Exact size controlled by mapred.min.split.size RecordReaders receive file, offset, and length of chunk Custom InputFormat implementations may override split size e.g., NeverChunkFile 95 WritableComparator Compares WritableComparable data Will call WritableComparable.compare() Can provide fast path for serialized data JobConf.setOutputValueGroupingComparator() 96

49 49 49 Sending Data To Reducers: The Context Class Both Mapper and Reducer define an inner class called Context which implements the JobContext interface Job also implements JobContext when you create a new Job, you also set the context for the Mapper and Reducer Some methods of Context: write: generate an output key/value pair progress and setstatus: report progress or set the status of the task getcounter: get access (read/write) to the value of a Counter getconfiguration: return the configuration for the job getcachefiles: get cache files set in the Configuration 97 Partitioning Which reducer will receive the intermediate output keys and values? (key, value) pairs with the same key end up at the same partition The mappers partition data independently they never exchange information with one another Hadoop uses an interface called Partitioner to determine which partition a (key, value) pair will go to A single partition refers to all (key, value) pairs which will be sent to a single reduce task: #partitions = #reduce tasks each Reducer can process multiple reduce tasks The Partitioner determines the load balancing of the reducers JobConf sets Partitioner implementation 98

50 50 50 The Partitioner interface The Partitioner interface defines the getpartition() method Input: a key, a value and the number of partitions Output: a partition id for the given key, value pair The default Partitioner is the HashPartitioner: int getpartition(k key, V value, int numpartitions) { return key.hashcode() % numpartitions; } numpartitions = 3 Key key.hashcode() partitionid (0-2) hello world map reduce Partition and Shuffle Mapper Mapper Mapper Mapper Mapper Intermediates Intermediates Intermediates Intermediates Intermediates Partitioner Partitioner Partitioner Partitioner Partitioner (combiners omitted here) Intermediates Intermediates Intermediates Reducer Reducer Reduce 100

51 51 51 Reduction public void reduce(writablecomparable key, Iterator values, Context context) Keys & values sent to one partition all go to the same reduce task Calls are sorted by key: earlier keys are reduced and output before later keys 101 Finally: Writing The Output Reducer Reducer Reduce OutputFormat RecordWriter RecordWriter RecordWriter Output File Output File Output File 102

52 52 52 OutputFormat The OutputFormat and RecordWriter interfaces dictate how to write the results of a job back to the underlying permanent storage The default format (TextOutputFormat) will write (key, value) pairs as strings to individual lines of an output file Writes key val\n strings using the tostring() methods of the keys and values The SequenceFileOutputFormat will keep the data in binary, so it can be later read quickly by the SequenceFileInputFormat Uses a binary format to pack (k, v) pairs NullOutputFormat Discards output These classes make use of the write() method of the specific Writable classes used by your MapReduce pass 103 Sending Data to the Client Reporter object sent to Mapper allows simple asynchronous feedback incrcounter(enum key, long amount) setstatus(string msg) Allows self-identification of input InputSplit getinputsplit() 104

53 53 53 MapReduce Coding Summary Decompose problem into appropriate workflow of MapReduce jobs For each job, implement the following Job configuration Map function Reduce function Combiner function (optional) Partition function (optional) Might have to create custom data types as well WritableComparable for keys Writable for values 105 Local (Standalone) Mode Runs same MapReduce user program as cluster version, but does it sequentially Does not use any of the Hadoop daemons Works directly with local file system No HDFS, hence no need to copy data to/from HDFS Great for development, testing, initial debugging 106

54 54 54 Pseudo-Distributed Mode Still runs on single machine, but now simulates a real Hadoop cluster Simulates multiple nodes Runs all daemons Uses HDFS Main purpose: more advanced testing and debugging You can also set this up on your laptop 107 Running a MapReduce Workflow Linear chain of jobs To run job2 after job1, create JobConf s conf1 and conf2 in main function Call JobClient.runJob(conf1); JobClient.runJob(conf2); Catch exceptions to re-start failed jobs in pipeline More complex workflows Use JobControl from org.apache.hadoop.mapred.jobcontrol 108

55 55 55 Lifecycle of a MapReduce Job Time Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined? 109 HDFS Limitations Almost GFS (Google FS) No file update options (record append, etc); all files are write-once Does not implement demand replication Designed for sequential access Random seeks devastate performance 110

56 56 56 Comparison of MapReduce and Other Approaches MapReduce: The Programming Model and Practice", SIGMETRICS, Turorials 2009, Google. 111 Comparison of MapReduce and other approaches MapReduce: The Programming Model and Practice", SIGMETRICS, Turorials 2009, Google. 112

57 57 57 MapReduce: A Step Backwards? Don t need 1000 nodes to process petabytes: Parallel DBs do it in fewer than 100 nodes No support for schema: Sharing across multiple MR programs is difficult No indexing: Wasteful access to unnecessary data Non-declarative programming model: Requires highly-skilled programmers No support for JOINs: Requires multiple MR phases for the analysis Agrawal et al., VLDB 2010 Tutorial 113 Map Reduce vs Parallel DBMS Parallel DBMS MapReduce Schema Support Not out of the box Indexing Not out of the box Programming Model Optimizations (Compression, Query Optimization) Declarative (SQL) Imperative (C/C++, Java, ) Extensions through Pig and Hive Not out of the box Flexibility Not out of the box Fault Tolerance Coarse grained techniques [Pavlo et al., SIGMOD 2009, Stonebraker et al., CACM 2010, ] 114

58 58 58 Additional Languages & Components 115 Hadoop and C++ Hadoop Pipes Library of bindings for native C++ code Operates over local socket connection Straight computation performance may be faster Downside: Kernel involvement and context switches 116

59 59 59 Hadoop and Python Option 1: Use Jython Caveat: Jython is a subset of full Python Option 2: HadoopStreaming 117 Hadoop Streaming Effectively allows shell pipe operator to be used with Hadoop You specify two programs for map and reduce (+) stdin and stdout do the rest (-) Requires serialization to text, context switches (+) Reuse Linux tools: cat grep sort uniq 118

60 60 60 Eclipse Plugin Support for Hadoop in Eclipse IDE Allows MapReduce job dispatch Panel tracks live and recent jobs The Hadoop Ecosystem 120

61 61 61 Hadoop Levels of Abstraction Less Hadoop visible HBase Queries against tables More DB view Hive SQL-Like language Pig Query and workflow language More Hadoop visible Java Write map-reduce functions More mapreduce view 121 MapReduce Cloud Service Providing MapReduce frameworks as a service in clouds becomes an attractive usage model for enterprises A MapReduce cloud service allows users to cost-effectively access a large amount of computing resources with creating own cluster Users are able to adjust the scale of MapReduce clusters in response to the change of the resource demand of applications 122

62 62 62 Hadoop Workflow 1. Load data into HDFS 2. Develop code locally You 3. Submit MapReduce job 3a. Go back to Step 2 Hadoop Cluster 4. Retrieve data from HDFS 123 On Amazon: With EC2 2. Develop code locally 0. Allocate Hadoop cluster 1. Load data into HDFS EC2 You Uh oh. Where did the data go? 3. Submit MapReduce job 3a. Go back to Step 2 Your Hadoop Cluster 4. Retrieve data from HDFS 5. Clean up! 124

63 63 63 On Amazon: EC2 and S3 EC2 (The Cloud) Copy from S3 to HDFS S3 (Persistent Store) Your Hadoop Cluster Copy from HFDS to S3 125 Takeaway MapReduce s data-parallel programming model hides complexity of distribution and fault tolerance Principal philosophies: Make it scale, so you can throw hardware at problems Make it cheap, saving hardware, programmer and administration costs (but requiring fault tolerance) MapReduce is not suitable for all problems, but when it works, it may save you a lot of time Agrawal et al., VLDB 2010 Tutorial 126

64 64 64 References We rely on the new org.apache.hadoop.mapreduce API, but Many existing programs might be written using the old API org.apache.hadoop.mapred Some old libraries might only support the old API Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. Usenix SDI '04, David DeWitt, Michael Stonebraker, "MapReduce: A major step backwards, craig-henderson.blogspot.com CS9223 Massive Data Analysis J. Freire & J. Simeon New York University Course 2013 INFM 718G / CMSC 828G Data-Intensive Computing with MapReduce J. Lin University of Maryland 2013 The MapReduce Programming Model: Introduction and Examples Jose Maria Alvarez-Rodriguez FP7 RELATE-ITN April, Hadoop v1.0 Daemons: Putting it All Together JobTracker daemon One per cluster, usually running on master node Communicates with client application and controls MapReduce execution in TaskTracker daemons One TaskTracker per slave node Performs actual Map and Reduce execution Can spawn multiple JVMs to do the work tasktracker datanode daemon Linux file system namenode namenode daemon tasktracker datanode daemon Linux file system job submission node jobtracker tasktracker datanode daemon Linux file system Typical setup slave node slave node NameNode and JobTracker run on cluster head node DataNode and TaskTracker run on all other nodes Secondary NameNode runs on dedicated machine or on cluster head node (usually not a good idea, but ok for small clusters) slave node 128

65 65 65 JobTracker UI 129 NameNode WebUI 130

66 66 66 Datanodes 131