Hadoop
INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2
Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes) NYSE produces 1 TB of data per day The Internet Archive grows by 20 TB per month The LHC produces 15 PB of data per year AT&T has a 300 TB database 100 TB of data uploaded daily on Facebook Distributed Systems + Middleware: Hadoop 3
Data Personal data is growing, too E.g., photos: a single photo taken with a Nikon commercial camera takes about 6 MB (default settings); a year of family photos takes about 8 GB of space adding to the slices of personal stuff uploaded on social networks, video Websites, blogs and more Machine-produced data is also growing Machine logs Sensor networks and monitored data Distributed Systems + Middleware: Hadoop 4
Data analysis Main problem: disk reading speed / capacity has not really improved Solution: parallelize the storage and read less data from each disk Problems: hardware replication, data aggregation Take, for example, a RDBMS (keeping in mind that the seeking time on disks measures the latency of the operations): Updating records is fast: a B-Tree structure is efficient Reading many records is slow: if access is dominated by seeking time, it is faster to read the entire disk (which operates at transfer time) Distributed Systems + Middleware: Hadoop 5
Hadoop Reliable data storage: Hadoop Distributed File System Data analysis: MapReduce implementation Many tools for the developers Easy cluster administration Query languages Some are similar to SQL Column-oriented distributed databases on top of Hadoop Structured to unstructured repositories and back Distributed Systems + Middleware: Hadoop 6
Hadoop vs. The (existing) World RDBMS: Disk seek time Some types of data are not normalized (e.g., logs): MapReduce works well with unstructured data MapReduce scales linearly (while a RDBMS does not) Volunteer computing (e.g., SETI@home) Similar model, but Hadoop works in a localized cluster sharing high-performance bandwidth, while volunteer computing works over the Internet on untrusted computers performing other operations meanwhile Distributed Systems + Middleware: Hadoop 7
Hadoop vs. The (existing) World MPI: Works well for compute-intensive jobs, but network becomes the bottleneck when hundreds of GB of data have to be analyzed Conversely, MapReduce does its best to exploit data locality by collocate the data with the compute node (network bandwidth is the most precious resource, it must not be wasted) MapReduce operates at a higher level wrt MPI: data flow is already taken care MapReduce implements failure recovery (in MPI the developer has to handle checkpoints and failure recovery) Distributed Systems + Middleware: Hadoop 8
HADOOP HISTORY Distributed Systems + Middleware: Hadoop 9
Brief history In 2002, Mike Cafarella and Doug Cutting started working on Apache Nutch, a new Web search engine In 2003, Google published a paper on the Google File System, a distributed filesystem, and Mike and Doug started working on a similar, open source, project In 2004, Google published another paper, on the MapReduce computation model, and yet again Mike and Doug implemented an open source version in Nutch Distributed Systems + Middleware: Hadoop 10
Brief history In 2006, these two projects separated from Nutch and became Hadoop In the same year, Doug Cutting started working for Yahoo! and started using Hadoop there In 2008 Hadoop was used by Yahoo! (10000-core cluster), Last.fm, Facebook and the NYT In 2009, Yahoo! broke the world record for sorting 1 TB of data in 62 seconds, using Hadoop Since then, Hadoop became mainstream in industry Distributed Systems + Middleware: Hadoop 11
Examples from the Real World Last.fm Each user listening to a song (local or in streaming) generates a trace Hadoop analyses these traces to produce charts Facebook E.g., track statistics per user and per country, weekly top tracks Daily and hourly summaries over user logs Products usage, ads campaigns Ad-hoc jobs over historical data Long term archival store Integrity checks Distributed Systems + Middleware: Hadoop 12
Examples from the Real World Nutch search engine Link inversion: find outgoing links that point to a specific Web page URL fetching Produce Lucene indexes (for text searches) Infochimps: explore network graphs Social networks: Twitter analysis, measure communities Biology: neuron connections in roundworms Street connections: OpenStreetMap Distributed Systems + Middleware: Hadoop 13
Hadoop umbrella HDFS: distributed filesystem MapReduce: distributed data processing model MRUnit: unit testing of MapReduce applications Pig: data flow language to explore large datasets Hive: distributed data warehouse HBase: distributed, column-oriented db ZooKeeper: distributed coordination service Sqoop: efficient bulk transfers of data over HDFS Distributed Systems + Middleware: Hadoop 14
MAPREDUCE BY EXAMPLE Distributed Systems + Middleware: Hadoop 15
The Word Count Example Count the number of times each word occurs in a set of documents. Example: one document with one sentence Do as I say, not as I do Distributed Systems + Middleware: Hadoop 16
A possible solution A Multiset is a set where each element also has a count. define wordcount as Multiset; for each document in documentset { T = tokenize(document); for each token in T { wordcount[token]++; } } display(wordcount); Distributed Systems + Middleware: Hadoop 17
Problems with our solution This program works fine until the set of documents you want to process becomes large. E.g. Spam filter to know the words frequently used in the millions of spam emails you receive. Looping through the documents using a single computer will be extremely time consuming Possible alternative solution: Speed it up by rewriting the program so that it distributes the work over several machines. Each machine will process a distinct fraction of the documents. When all the machines have completed this, a second phase of processing will combine the result of all the machines. Distributed Systems + Middleware: Hadoop 18
Possible re-write define wordcount as Multiset; for each document in documentsubset { T = tokenize(document); for each token in T { } wordcount[token]++; } sendtosecondphase(wordcount); define totalwordcount as Multiset; for each wordcount received from firstphase { multisetadd (totalwordcount, wordcount); } Distributed Systems + Middleware: Hadoop 19
Possible Problems We ignore the performance requirement of reading in the documents. If the documents are all stored in one central storage server, then the bottleneck is in the bandwidth of that server. We need to split up the documents among the set of processing machines such that each machine will process only those documents that are stored in it. Storage and processing have to be tightly coupled in dataintensive distributed applications wordcount (and totalwordcount) are stored in memory. When processing large document sets, the number of unique words can exceed the RAM storage of a machine We need to rewrite our program to store this hash table on disk (lots of code) Distributed Systems + Middleware: Hadoop 20
Even more problems Phase two has only one machine, which will process wordcount sent from all the machines in phase one After we have added enough machines to phase one processing, the single machine in phase two will become the bottleneck We need to rewrite phase two in a distributed fashion so that it can scale by adding more machines Distributed Systems + Middleware: Hadoop 21
Final Solution subset 1 Phase 1 word count - a word count - b word count - c Reshuffle Phase 2 word count - x word count - y word count - z A 26 machines for phase 2 One per letter in alphabet subset 2 word count - a word count - b word count - c word count - x word count - y word count - z B C 26 disk-based hash tables for wordcount subset n-2 word count - a word count - b word count - c word count - x word count - y word count - z X word count - a word count - b word count - c Y subset n word count - x word count - y word count - z Z Distributed Systems + Middleware: Hadoop 22
Considerations Starts getting complex Requirements Store files over many processing machines (of phase one). Write a disk-based hash table permitting processing without being limited by RAM capacity. Partition the intermediate data (that is, wordcount) from phase one. Shuffle the partitions to the appropriate machines in phase two. And we re still not dealing with possible failures!!! Distributed Systems + Middleware: Hadoop 23
MapReduce Model for analyzing large amounts of data Unstructured data is organized as key value datasets and lists Two phases: map(k1, v1) -> list(k2, v2), where the input domain is different from the output domain filter and transform (shuffle: intermediate phase to sort the output of map and group by key) default implementations reduce(k2, list(v2)) -> list(v3), where the input and output domain is the same aggregate Distributed Systems + Middleware: Hadoop 24
Examples of inputs <k1, v1> Multiple files list(<string filename, String file_content>) One large log file list(<integer line_number, String log_event>) Lists are broken up and each individual pair is processed by the map function Each input becomes a list(<k2,v2>) Distributed Systems + Middleware: Hadoop 25
WordCount in MapReduce Mapper Input <String filename, String file_ content> Ignores filename Mapper Output list of <String word, Integer count> (e.g., <"foo", 3>) or list of <String word, Integer 1> (e.g., < foo, 1>) With repeated entries Easier to program All pairs sharing the same k2 are grouped They form a <k2, list(v2)> Aggregation by Reducer Two mappers produce < foo, list(1,1)> and < foo, list(1,1,1)> The aggregated pair the reducer sees is <"foo", list(1,1,1,1,1)>. Reducer produces < foo, 5> Distributed Systems + Middleware: Hadoop 26
How would we write this? map(string filename, String document) { } List<String> T = tokenize(document); for each token in T { } emit ((String)token, (Integer) 1); reduce(string token, List<Integer> values) { } Integer sum = 0; for each value in values { sum = sum + value; } emit ((String)token, (Integer) sum); Distributed Systems + Middleware: Hadoop 27
MOVING TO HADOOP Distributed Systems + Middleware: Hadoop 28
Building Blocks (daemons) On a fully configured cluster, running Hadoop means running multiple daemons NameNode DataNode Secondary NameNode JobTracker TaskTracker Distributed Systems + Middleware: Hadoop 29
NameNode Politecnico Hadoop uses a master/slave configuration both for distributed storage and distributed computation Distributed Storage is called HDFS NameNode is the master of HDFS It directs slave DataNodes to perform low-level I/O Keeps track of how files are broken down into file-blocks which nodes store those blocks, and the overall health of the distributed filesystem The function of the NameNode is memory and I/O intensive. As such, the server hosting the NameNode typically doesn t store any user data or perform any computations single point of failure!!! Distributed Systems + Middleware: Hadoop 30
DataNode Each slave machine in the cluster will host a DataNode daemon for reading and writing HDFS blocks to actual files on the local filesystem Files are broken into blocks and the NameNode tells a client which DataNode each block resides in The clients communicate directly with the DataNode daemons to process the local files DataNodes may replicate data blocks for redundancy. Distributed Systems + Middleware: Hadoop 31
HDFS Example Distributed Systems + Middleware: Hadoop 32
Secondary NameNode The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS. each cluster has one SNN, and it typically resides on its own machine The SNN differs from the NameNode in that this process doesn t receive or record any real-time changes to HDFS. Instead, it communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration. Does not solve the single point of failure Still requires human intervention if the NameNode fails Distributed Systems + Middleware: Hadoop 33
JobTracker There is only one JobTracker daemon per Hadoop cluster. It s typically run on a server as a master node of the cluster. Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they re running. Should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries. Distributed Systems + Middleware: Hadoop 34
Task Tracker TaskTrackers manage the execution of individual tasks on each slave node Although there is a single TaskTracker per slave node, each TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in parallel. TaskTrackers constantly communicate with the JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster Distributed Systems + Middleware: Hadoop 35
Job Submission Distributed Systems + Middleware: Hadoop 36
Summary of Architecture Distributed Systems + Middleware: Hadoop 37
USING HADOOP Distributed Systems + Middleware: Hadoop 38
HDFS HDFS is a filesystem designed for large-scale distributed data processing It is possible to store a big data set of (say) 100 TB as a single file in HDFS HDFS abstracts details away and gives the illusion that we re dealing with a single file In a typical Hadoop workflow files are created elsewhere and copied into HDFS using command line utilities MapReduce programs process this data, but they don t read/write HDFS files directly Distributed Systems + Middleware: Hadoop 39
HDFS Commandline Utilities hdfs dfs mkdir /user/chuck hdfs dfs -ls / hdfs dfs ls -R / hdfs dfs -put example.txt / hdfs dfs -get /example.txt / hdfs dfs cat /example.txt hdfs dfs rm /example.txt Distributed Systems + Middleware: Hadoop 40
Anatomy of a Hadoop Application 41
Data Types MapReduce framework has a certain defined way of serializing the key/value pairs to move them across the cluster s network only classes that support this kind of serialization can function as keys or values in the framework. Classes that implement the Writable interface can be values the WritableComparable<T> interface can be either keys or values Distributed Systems + Middleware: Hadoop 42
Predefined Types Distributed Systems + Middleware: Hadoop 43
Mapper To serve as the mapper, a class implements from the Mapper interface and inherits the MapReduceBase class. The MapReduceBase class serves as the base class for both mappers and reducers. It includes two methods that effectively act as the constructor and destructor for the class: void configure(jobconf job) In this function you can extract the parameters set either by the configuration XML files or in the main class of your application. void close() As the last action before the map task terminates, this function should wrap up any loose ends Distributed Systems + Middleware: Hadoop 44
Mapper The Mapper interface is responsible for the data processing step. It utilizes Java generics of the form Mapper<K1,V1,K2,V2> where the key classes and value classes implement the WritableComparable and Writable interfaces. One method to process an individual (key/value) pair void map(k1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter ) throws IOException Distributed Systems + Middleware: Hadoop 45
Predefined Mappers Distributed Systems + Middleware: Hadoop 46
Reducer void reduce(k2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) throws IOException When the reducer task receives the output from the various mappers, it sorts the incoming data on the key of the (key/value) pair and groups together all values of the same key. The reduce() function is then called, and it generates a (possibly empty) list of (K3, V3) pairs by iterating over the values associated with a given key. Distributed Systems + Middleware: Hadoop 47
Predefined Reducers Distributed Systems + Middleware: Hadoop 48
Partitioner With multiple reducers, we need some way to determine the appropriate one to send a (key/value) pair outputted by a mapper. The default behavior is to hash the key to determine the reducer. We can define application-specific Partitioners by implementing the Partitioner Interface Distributed Systems + Middleware: Hadoop 49
Combiner (or local reduce) Politecnico In many situations with MapReduce applications, we may wish to perform a local reduce before we distribute the mapper results. Send 1 <word, 574> pair instead of 574 <word, 1> pairs The shapes represents keys, the inner patterns represent values. 50
Reading and Writing to HDFS Input data usually resides in large files, typically tens or hundreds of gigabytes or even more. One of the fundamental principles of MapReduce s processing power is the splitting of the input data into splits Reads are done through FSDataInputStream FSDataInputStream extends DataInputStream with random read access MapReduce requires this because a machine may be assigned to process a split that sits right in the middle of an input file. Distributed Systems + Middleware: Hadoop 51
InputFormat The way an input file is split up and read by Hadoop is defined by one of the implementations of the InputFormat interface. TextInputFormat is the default InputFormat implementation The key returned by TextInputFormat is the byte offset of each line One can create their own InputFormat public interface InputFormat<K, V> { InputSplit[] getsplits(jobconf job, int numsplits) throws IOException; RecordReader<K, V> getrecordreader(inputsplit split, JobConf job, Reporter reporter) throws IOException; } Distributed Systems + Middleware: Hadoop 52
Common InputFormats Politecnico 53
Output Formats The default OutputFormat is TextOutputFormat, which writes each record as a line of text. Each record s key and value are converted to strings through tostring(), and a tab (\t) character separates them. The separator character can be changed in the mapred.textoutputformat.separator property. TextOutputFormat outputs data in a format readable by KeyValueTextInputFormat. Distributed Systems + Middleware: Hadoop 54
Common Output Formats Distributed Systems + Middleware: Hadoop 55
THE WORDCOUNT EXAMPLE Distributed Systems + Middleware: Hadoop 56
WordCount 2.0 public class WordCount2 { } public static void main(string[] args) { JobClient client = new JobClient(); } JobConf conf = new JobConf(WordCount2.class); FileInputFormat.addInputPath(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(longwritable.class); conf.setmapperclass(tokencountmapper.class); conf.setcombinerclass(longsumreducer.class); conf.setreducerclass(longsumreducer.class); client.setconf(conf); JobClient.runJob(conf); 57
CONFIGURING HADOOP Distributed Systems + Middleware: Hadoop 58
Different Running Modes Hadoop can be run in three different modes Local mode default mode for Hadoop Hadoop will run completely on the local machine the standalone mode doesn t use HDFS, It does not launch any of the Hadoop daemons Pseudo-distributed mode running Hadoop in a cluster of one daemons running on a single machine They communicate through ssh Fully-distributed mode Deployed to multiple machines Distributed Systems + Middleware: Hadoop 59
Pseudo-distributed mode core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. </description> </property> </configuration> Distributed Systems + Middleware: Hadoop 60
Pseudo-distributed mode mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> <description>the host and port that the MapReduce job tracker runs at.</description> </property> </configuration> Distributed Systems + Middleware: Hadoop 61
Pseudo-distributed mode hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>the actual number of replications can be specified when the file is created.</description> </property> </configuration> Distributed Systems + Middleware: Hadoop 62
Masters and Slaves files [hadoop-user@master]$ cat masters localhost [hadoop-user@master]$ cat slaves localhost Distributed Systems + Middleware: Hadoop 63
Pseudo-distributed mode Check SSH ssh localhost If not configured ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys Format HDFS hdfs namenode -format Distributed Systems + Middleware: Hadoop 64
What s running [hadoop-user@master]$ jps 26893 Jps 26832 TaskTracker 26620 SecondaryNameNode 26333 NameNode 26484 DataNode 26703 JobTracker Distributed Systems + Middleware: Hadoop 65
Fully distributed mode core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. </description> </property> </configuration> Distributed Systems + Middleware: Hadoop 66
Fully-distributed mode mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>master:9001</value> <description>the host and port that the MapReduce job tracker runs at.</ description> </property> </configuration> Distributed Systems + Middleware: Hadoop 67
Fully-distributed mode hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>3</value> <description>the actual number of replications can be specified when the file is created.</description> </property> </configuration> Distributed Systems + Middleware: Hadoop 68
Masters and Slaves files [hadoop-user@master]$ cat masters backup [hadoop-user@master]$ cat slaves hadoop1 hadoop2 hadoop3 Distributed Systems + Middleware: Hadoop 69
What s running? [hadoop-user@master]$ jps 30879 JobTracker 30717 NameNode 30965 Jps [hadoop-user@backup]$ jps 2099 Jps 1679 SecondaryNameNode [hadoop-user@hadoop1]$ jps 7101 TaskTracker 7617 Jps 6988 DataNode Distributed Systems + Middleware: Hadoop 70
PATENT EXAMPLE Distributed Systems + Middleware: Hadoop 71
Two data sources Patent citation data contains citations from U.S. patents issued between 1975 and 1999. It has more than 16 million rows Patent description data "CITING","CITED" 3858241,956203 3858241,1324234 3858241,3398406 3858241,3557384 3858241,3634889 3858242,1515701 3858242,3319261 3858242,3668705 3858242,3707004 "PATENT","GYEAR","GDATE","APPYEAR","CO UNTRY","POSTATE","ASSIGNEE", It has the patent "ASSCODE","CLAIMS","NCLASS","CAT","SUBCA T","CMADE","CRECEIVE", number, the patent "RATIOCIT","GENERAL","ORIGINAL","FWDAP application year, the LAG","BCKGTLAG","SELFCTUB", patent grant year, the "SELFCTLB","SECDUPBD","SECDLWBD" 3070801,1963,1096,,"BE","",,1,,269,6,69,,1,,0,,,,,,, number of claims, and 3070802,1963,1096,,"US","TX",,1,,2,6,63,,0,,,,,,,,, other metadata 3070803,1963,1096,,"US","IL",,1,,2,6,63,,9,,0.3704,,,,,,, 3070804,1963,1096,,"US","OH",,1,,2,6,63,,3,,0.6667,,,,,,, Distributed Systems + Middleware: Hadoop 3070805,1963,1096,,"US","CA",,1,,2,6,63,,1,,0,,,,,,, 72...
What do citations look like? Politecnico 73
For each patent find and group the patents that cite it INVERT THE DATA Distributed Systems + Middleware: Hadoop 74
Count the number of citations a patent has received COUNT CITATIONS Distributed Systems + Middleware: Hadoop 75
How many patents have been cited n times COUNT THE CITATION COUNTS Distributed Systems + Middleware: Hadoop 76
SENSOR DATA Distributed Systems + Middleware: Hadoop 77
HADOOP PROJECT 2013-2014 Distributed Systems + Middleware: Hadoop 78