Data-Intensive Programming. Timo Aaltonen University Lecturer Tampere University of Technology

Size: px

Start display at page:

Download "Data-Intensive Programming. Timo Aaltonen University Lecturer Tampere University of Technology"

Gillian Miles
7 years ago
Views:

1 Data-Intensive Programming Timo Aaltonen University Lecturer Tampere University of Technology

2 Outline Guest talk by Shubham Keshri Course work Data movement through MR architecture Developing MapReduce application Software Engineering Process Configuration Testing

3 IDE Question Hadoop is a framework, which can be programmed with practically any IDE Eclipse maven, pom.xml The coursework is more like weekly excercises, therefore, any text editor should be fine, also emacs, vi

4 Course Work Current plan of tasks Week #1: Group creation Week #2: Install Hadoop Week #3: Data to HDFS Week #4: MapReduce Week #5: MapReduce Week #6: Higher-level tools

5 Coursework: Task Write a MapReduce application for calculating average air temperatures for each station Your data records might look like this 0,1014,"{u'utc': u' t11:29:00.000z', u'localtime': u' t11:29:00.000z'}",2000,12.7,24.4,41.6,48 Mapper emits station ID a key and air temperature as value for example ( 1014, 24.4) Reducer calculates the averages tempreducer( 1014, [24.4, 24.4, 26.4]) ( 1014, 24.4)

6 Coursework: Task Task 4.1 Use Hadoop Streaming and write Map and Reduce with Python Task 4.2 Optional Use Hadoop core utilites and write Map and Reduce with Java

7 Task 4.1 A piece of CSV file: 0,1014,"{u'utc': u' t11:29:00.000z', u'localtime': u' t11:29:00.000z'}",2000,12.7,24.4,41.6,48 The first field (index 0) is 0 (line number) The second is 1014 (station id) The third: "{u'utc': u' t11:29:00.000z'... The seventh (index 6) is 24.4 (the air temperature)

8 Task 4.1 Command line equivalence cat rw.csv cut -d\, -f2,7 tr -s ',' '\t' sort./reduceraverage.py First transform to cat rw.csv./mappertemp sort./reduceraverage.py Then run in Hadoop hadoop jar $HADOOP_INSTALL/share/hadoop/tools/lib/hadoop-streamin g jar -input inputdir/inpufile.csv -output output -mapper./mappertemp.py -reducer./reduceraverage.py -file./mappertemp.py -file./reduceraverage.py

9 Input Map Task 4.1 Shuffle Reduce Command line equivalence cat rw.csv cut -d\, -f2,7 tr -s ',' '\t' sort./reduceraverage.py First implement mapper and recuder to cat rw.csv./mappertemp sort./reduceraverage.py Then run in Hadoop hadoop jar $HADOOP_INSTALL/share/hadoop/tools/lib/hadoop-streamin g jar -input inputdir/inpufile.csv -output output -mapper./mappertemp.py -reducer./reduceraverage.py -file./mappertemp.py -file./reduceraverage.py

10 Task 4.2 AverageTemperature in Hadoop/Java Suggestion: Use the canonical WordCount as a starting point Transform the mapper and reducer to the python equivalent Hints TextInpuFormat slides from this slideset Debugging output in the demo

11 Task 4.2 Definition of the mapper begins like this public static class AirTempMapper extends Mapper<LongWritable, Text, Text, DoubleWritable>{ Definition os reducer: public static class DoubleAverageReducer extends Reducer<Text,DoubleWritable,Text,DoubleWritable> {

12 Coursework?

13 Data movement Data read from files into Mappers Emitted by mappers to reducers, and Emitted by reducers into output files

14 Input Formats InputFormat defines how to read data from a file into the Mapper instances Reads input file and emits (key, value) pairs, which are fed to mappers Hadoop comes with several implementations of InputFormat, like SequenceFileInputFormat, FileInputFormat, TextInputFormat Custom format by subclassing e.g. FileInputFormat

15 Input Formats InputFormat divides the input data sources (e.g., input files) into fragments that make up the inputs to individual map tasks fragment is called split Most files are split up on the boundaries of the underlying blocks in HDFS remember non-splittable files

16 FileInputFormat Used in the WordCount example Is actually a base class for file-based InputFormats like TextInputFormat By default split size is between 1 and Long.MAX_VALUE Not line-oriented

17 Input Formats Import clause in WordCount: import org.apache.hadoop.mapreduce.lib.input.fileinputformat; Job configuration: for (int i = 0; i < otherargs.length - 1; ++i) { FileInputFormat.addInputPath(job, new Path(otherArgs[i])); }

18 TextInputFormat InputFormat for plain text files Files are broken into lines Either linefeed or carriage-return are used to signal end of line Keys are the position in the file, and values are the line of text

19 TextInputFormat TextInputFormat divides files into splits strictly by byte offsets It reads individual lines of the files from the split in as record inputs to the Mapper The key it emits for each record is the byte offset of the line read (as a LongWritable) The value is the contents of the line up to the terminating '\n' character (as a Text object)

20 InputFormats?

21 Developing a MR Application Writing a MR program write map and reduce functions unit test validates the correctness write a driver program to run the job use IDE to testing and debugging When program works for a small data run it in a cluster the full data set is likely to lead to expose more issues debugging in the cluster is more challenging

22 Developing a MR Application When program works fine tuning profiling

23 Configuration API Components are configured with Hadoop's own configuration API org.apache.hadoop.conf an instance of Configuration class represents properties and their values each property is name by String

24 Configuration API Example: conf-1.xml: <?xml version= 1.0?> <configuration> <property> <name>size</name> <value>10</value> <description>size</description> </property>...

25 Configuration API Now Configuration conf = new Configuration(); conf.addresource( conf-1.xml ); assertthat(conf.get( size ), is(10));

26 Managing Configuration When developing Hadoop application it is common to switch between running app locally cluster (pseudodistributed cluster) One way to make this happen is to have configuration files containing connection settings Assume a directory named conf hadoop-local.xml, hadoop-cluster.xml

27 hadoop-local.conf: <?xml version= 1.0?> <configuration> <property> <name>fs.default.name</name> <value>file:///</value> </property> <property> <name>mapred.job.tracker</name> <value>local</value> </property> </configuration>

28 hadoop-cluster.conf: <?xml version= 1.0?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://namenode/</value> </property> <property> <name>mapred.job.tracker</name> <value>jobtracker:8021</value> </property> </configuration>

29 Managing Configuration Now Hadoop can be started with different configurations: % hadoop... -conf conf/hadoop-local.xml... If -conf is omitted, then $HADOOP_INSTALL/conf is used

30 MRUnit Due to their functional style, Map and Reduce are easy to test in isolation MRUnit is a testing library Easy to pass input to mapper and reducer and validate the output Can be used in conjunction with standard test execution frameworks, like JUnit

31 public class AirTempMapperTest public void processvalidrecord() throws IOException, InterruptedException { Text value = new Text("0,1014,\"{u'utc': u' t11:29:00.000z', u'localtime': u' t11:29:00.000z'}\",2000,12.7,24.4,41.6,48"); new MapDriver<LongWritable, Text, Text, DoubleWritable>().withMapper(new fi.tut.cs.airtempaverage.airtempmapper()).withinputvalue(value).withoutput(new Text("1014"), new DoubleWritable(24.4)).runTest(); } }

32 Demo Demo is taken from Unfortunately Maven is used for building and packaging the application

Getting to know Apache Hadoop

Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the