Lecture 6 Programming Hadoop II Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Outline Hadoop streaming Side data distribution Hadoop Zen System integration 2 / 22
HADOOP STREAMING
Motivation You want to use a scripting language Faster development time Easier to read, debug Use existing libraries You (still) have lots of data 4 / 22
HadoopStreaming Interfaces Hadoop MapReduce with arbitrary program code Uses stdin and stdout for data flow You define a separate program for each of mapper, reducer 5 / 22
WordCount in Shell Simplified Input Format: One word each line Input files:./input/* WordCount in Shell cat * sort uniq c Can we leverage MR to run it on thousands of files and nodes? 6 / 22
Hadoop Streaming + Shell hadoop jar <hadoop>/contrib/streaming/hadoop-0.20.2-streaming.jar \ -input input \ -output output \ -mapper /bin/cat \ -reducer /bin/uniq -c Note: Make sure those shell installed on every node. 7 / 22
Reusing Programs Identity mapper/reducer: cat Summing: wc wc -l a.txt Field selection: cut cat /etc/passwd cut -d: -f1 > user.txt Filtering: awk 8 / 22
Data Format Input (key, val) pairs sent in as lines of input key (tab) val (newline) Data naturally transmitted as text You emit lines of the same form on stdout for output (key, val) pairs. 9 / 22
Hadoop Streaming + Python Map: wcmap.py #!/usr/bin/python import re import sys for line in sys.stdin: for word in line.strip().split( ): print word + \t1 Reduce: wcred.py #!/usr/bin/python import re import sys w2c = {} for line in sys.stdin: if len(line.strip())!= 0: (k,v) = line.strip().split("\t") w2c[k] = w2c.get(k,0) + int(v) for w,c in w2c.items(): print "%s\t%d" % (w,c) 10 / 22
Hadoop Streaming + Python Test locally cat../data/test.txt python wcmap.py sort python wcred.py Run on hadoop hadoop jar <hadoop>/contrib/streaming/hadoop-0.20.2-streaming.jar \ -input input \ -output output \ -mapper wcmap.py \ -reducer wcred.py \ -file wcmap.py \ -file wcred.py 11 / 22
Hadoop Streaming Advanced features Support Java Classes -mapper org.apache.hadoop.mapred.lib.identitymapper Support Hadoop Aggregate Operators -reduce aggregate //sum/min/max Set job parameters by jobconf -jobconf mapred.reduce.tasks=12 12 / 22
Side Data Distribution Side Data: extra read-only data needed by a job to process the main dataset. Using the Job configuration (for small meta data only ) Using distributed cache 13 / 22
Side Data Caching via Job Configuration Used for meta data for no more than a few K bytes. Load with JobTracker/TaskTracker/JVM Sub-process Usage Set in job configuration Configuration conf = new Configuration(); conf.set( line-prefix, [SYSTEM]: ); conf.add Resource( test.xml ); Job job = new Job(conf, wordcount ); Get in Mapper/Reducer context.getconfiguration().get("line-prefix"); 14 / 22
Side Data Distribution Distributed Cache A service copying files and archives to the task nodes in runtime Files cached in local file system of tasktracker, possibly shared among different tasks. Usage Hadoop jar -files / -archives options hadoop jar -files /test/file/file.1. DistributedCache Class Access in Mapper/Reducer FileReader reader = new FileReader( god.txt ) 15 / 22
Hadoop Zen Don t get frustrated (take a deep breath) Remember this when you experience those W$*#T@F! moments This is bleeding edge technology: Lots of bugs Stability issues Even lost data To upgrade or not to upgrade (damned either way)? Poor documentation (or none) But Hadoop is the path to data nirvana 16 / 22
System Integration Front-end Real-time Customer-facing Well-defined workflow Back-end Batch Internal Ad hoc analytics 17 / 22
Customers Browser Interactive Web applications Server-side software stack Interface AJAX HTTP request HTTP response Web Server Middleware DB Server
Typical Scale Out Strategies LAMP stack as standard building block Lots of each (load balanced, possibly virtualized): Web servers Application servers Cache servers RDBMS Reliability achieved through replication Most workloads are easily partitioned Partition by user Partition by geography 19 / 22
Caching servers: 15 million requests per second, 95% handled by memcache (15 TB of RAM) Database layer: 800 eight-core Linux servers running MySQL (40 TB user data) Source: Technology Review (July/August, 2008)
Customers Browser Interactive Web applications Server-side software stack Interface AJAX HTTP request HTTP response Web Server Middleware DB Server Browser Hadoop cluster Interface AJAX HTTP request HTTP response MapReduce HDFS Internal Analysts Back-end batch processing
OK, So now we have gone through MapReduce basics