A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

Often seen problems Often seen problems Low parallelism I/O is done to/from shared storage, not locally to the computing node limits scalability in number of nodes load on central storage and network is higher than necessary increases infrastructure cost High job failure rate No robustness to node or equipment failure A failed step requires human intervention to resolve Low automation and high operating costs luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 2 / 18

Hadoop Important features: distributed scalable robust open source luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 3 / 18

Hadoop To understand the advantages of Hadoop and how it works let's briey cover two things: 1 MapReduce 2 Hadoop MapReduce and Distributed File System luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 4 / 18

MapReduce A programming model for large-scale distributed data processing Breaks algorithms into two steps: 1 Map: map a set of input key/value pairs to a set of intermediate key/value pairs 2 Reduce: apply a function to all values associated to the same intermediate key; emit output key/value pairs Functions don't have side eects; (k,v) pairs are the only input/output Functions don't share data structures luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 5 / 18

MapReduce A programming model for large-scale distributed data processing Breaks algorithms into two steps: 1 Map: map a set of input key/value pairs to a set of intermediate key/value pairs 2 Reduce: apply a function to all values associated to the same intermediate key; emit output key/value pairs Functions don't have side eects; (k,v) pairs are the only input/output Functions don't share data structures (name, age) (name, mean age) (luca, 27) (luca, 31) (luca, 30.67) (luca, 34) luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 5 / 18

MapReduce Example Word Count Consider a program to calculate word frequency in a document. The quick brown fox ate the lazy green fox. Word Count ate 1 brown 1 fox 2 green 1 lazy 1 quick 1 the 2 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 6 / 18

MapReduce Example Word Count The quick brown fox ate the lazy green fox. Here's some pseudo code for a MapReduce word counting program: map ( key, value ): foreach word in value : emit ( word, 1) reduce ( key, value_list ): int wordcount = 0 foreach count in value_ list : wordcount += count emit ( key, wordcount ) luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 7 / 18

MapReduce Example Word Count the quick brown fox ate the lazy green fox Mapper Mapper Mapper Map the, 1 fox, 1 quick, 1 ate, 1 brown, 1 fox, 1 the, 1 lazy, 1 green, 1 Shuffle & Sort Reducer Reducer Reduce quick, 1 brown, 1 fox, 2 ate, 1 the, 2 lazy, 1 green, 1 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 8 / 18

MapReduce The lack of side eects and shared data structures is the key. No multi-threaded programming No synchronization, locks, mutexes, deadlocks, etc. No shared data implies no central bottleneck. Failed functions can be retriedtheir output only being committed upon successful completion. MapReduce allows you to put much of the parallel programming into a reusable framework, outside of the application. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 9 / 18

Hadoop MapReduce The MapReduce model needs an implementation Hadoop is arguably the most popular open-source MapReduce implementation Born out of Yahoo! Currently used by many very large operations, e.g.: Yahoo! Facebook Amazon Ebay etc. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 10 / 18

Hadoop DFS A MapReduce framework goes hand-in-hand with a distributed le system Multiplying the number of nodes poses challenges multiplied network trac multiplied disk accesses multiplied failure rates luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 11 / 18

Hadoop DFS Hadoop provides the Hadoop Distributed File System (HDFS) Stores blocks of the data on each node. Move computation to the data and decentralize data access Uses the disks on each node Aggregate I/O throughput scales with the number of nodes Replicates data on multiple nodes Resistance to node failure luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 12 / 18

Easier MapReduce Implementing Hadoop MapReduce programs can be time-consuming Especially true for one-o scripts/hacks There are other options Pig Hive Pydoop others... luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 13 / 18

Easier MapReduce Pig is a scripting language for Hadoop Hive is more of a query language Pydoop is a Python API for Hadoop MapReduce luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 14 / 18

Pig example Pig example... luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 15 / 18

Pydoop Pydoop script page: http://pydoop.sf.net/docs/pydoop_script.html Pydoop API example: http://pydoop.sf.net/docs/examples/wordcount.html luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 16 / 18

other things Lots of other tools in tthis ecosystem Workow management (oozie) Data stores (HBase) Data transfer (sqoop, ume)... luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 17 / 18

Questions Questions? luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 18 / 18