Recommended Literature for this Lecture

COSC 6339 Big Data Analytics Introduction to MapReduce (III) and 1 st homework assignment Edgar Gabriel Spring 2015 Recommended Literature for this Lecture Andrew Pavlo, Erik Paulson, Alexander Rasin, A Comparison of Approaches to Large-Scale Data Analysis, http://database.cs.brown.edu/sigmod09/benchmarkssigmod09.pdf 1

Two Approaches to Large-Scale Data Analysis Shared nothing architectures Distributed file system Map, Split, Copy, Reduce MR scheduler Standard relational tables Data are partitioned over cluster nodes SQL MapReduce vs. Parallel DBMS: Schema Support Flexible, programmers write code to interpret input data Good for single application scenario Bad if data are shared by multiple applications. Must address data syntax, consistency, etc. Relational schema required Good if data are shared by multiple applications 2

MapReduce vs. Parallel DBMS: Programming Model & Flexibility Low level Anecdotal evidence from the MR community suggests that there is widespread sharing of MR code fragments to do common tasks, such as joining data sets. very flexible SQL user-defined functions, stored procedures, user-defined aggregates MapReduce vs. Parallel DBMS: Indexing No native index support Programmers can implement their own index support in Map/Reduce code But hard to share the customized indexes in multiple applications Hash/b-tree indexes well supported 3

MapReduce vs. Parallel DBMS: Execution Strategy & Fault Tolerance Intermediate results are saved to local files If a node fails, run the node-task again on another node At a mapper machine, when multiple reducers are reading multiple local files, there could be large numbers of disk seeks, leading to poor performance. Intermediate results are pushed across network If a node fails, must re-run the entire query Avoiding Data Transfers Schedule Map to close to data But other than this, programmers must avoid data transfers themselves A lot of optimizations Such as determine where to perform filtering 4

Compiling Hadoop test code mkdir hw1 mkdir hw1/src put all java file in hw1/src ideally as part of the same package cp build.xml to hw1 cd hw1 Adjust build.xml, specifically name of your jar file ant build In hw1/build/lib there is now the jar file To verify content of your jar file jar tf nameofjarfile.jar Running a Hadoop Mapreduce job cd hw1/build/lib name of jar file name of package name of class containing the main() function yarn jar hw1.jar example.flightclass /bigdata-common/ /bigd45/output/ name of input directory (in hdfs) name of output directory - must not exist before execution 5

How to kill a job if it is hanging yarn application -list 14/02/05 17:01:23 INFO client.rmproxy: Connecting to ResourceManager at shark/192.168.1.170:10040 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id Application-Name Application- Type User Queue State Final- State Progress Tracking-URL application_1391622275771_0007 wordcount MAPREDUCE gabriel default RUNNING UNDEFINED 5% http://shark01:5 8572 yarn application -kill application_1391622275771_0007 14/02/05 17:01:38 INFO client.rmproxy: Connecting to ResourceManager at shark/192.168.1.170:10040 Killing application application_1391622275771_0007 14/02/05 17:01:38 INFO impl.yarnclientimpl: Killing application application_1391622275771_0007 Using HDFS If you want to run a MapReduce job on the cluster, you have to have the input data in HDFS (can not be local), and the result also will be in HDFS Large input data set is already available with read-only permission for all students in hdfs:///bigdata-common/ Using HDFS is similar to local UNIX file system hdfs dfs ls / hdfs dfs ls /bigd45/ hdfs dfs mkdir /bigd45/newdir hdfs dfs rm /bigd45/file.txt hdfs dfs rm -r /bigd45/ 6

Using HDFS (II) Copying a file into hdfs hdfs dfs put <localfilename> /bigd45/<remotefilename> Looking at the content of a file in hdfs hdfs dfs cat /bigd45/filename.txt Copying a file from hdfs into local directory hdfs dfs get /bigd45/output/part-r-00000. Merging multiple output file (each reducer produces a separate output file!) hdfs dfs getmerge /bigd/output/part-* allparts.out 1 st Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt file) explanations to the code answers to questions Deliver electronically to gabriel@cs.uh.edu Expected by Friday, February 27, 11.59pm In case of questions: ask, ask, ask! 7

1. Given a data set continaing many questions asked on www.stackexchange.com and the corresponding answers Different categories (e.g. academic, math, stackoverflow etc.) Questions are answered by users, answers are being voted on Answer might be marked as Accepted Input file are stripped down xml files xml headers have been removed, you can probably parse as a regular string/text Description of the input file - Id - PostTypeId (1: Question 2: Answer) - ParentID (only present if PostTypeId is 2) - AcceptedAnswerId (only present if PostTypeId is 1) - CreationDate - Score - ViewCount - Body - OwnerUserId - LastEditorUserId - LastEditorDisplayName="Jeff Atwood" - LastEditDate="2009-03-05T22:28:34.823" - LastActivityDate="2009-03-11T12:51:01.480" - CommunityOwnedDate="2009-03-11T12:51:01.480" - ClosedDate="2009-03-11T12:51:01.480" - Title= - Tags= - AnswerCount - CommentCount - FavoriteCount 8

Example <row Id="1" PostTypeId="1" AcceptedAnswerId="180" CreationDate="2012-02- 14T20:23:40.127" Score="12" ViewCount="166" Body="As from title. What kind of visa class do I have to apply for, in order to work as an academic in Japan? " OwnerUserId="5" LastEditorUserId="2700" LastEditDate="2013-10-30T09:14:11.633" LastActivityDate="2013-10-30T09:14:11.633" Title="What kind of Visa is required to work in Academia in Japan?" Tags="<jobsearch><visa><japan>" AnswerCount="1" CommentCount="1" FavoriteCount="1" /> <row Id="2" PostTypeId="1" AcceptedAnswerId="246" CreationDate="2012-02- 14T20:26:22.683" Score="7" ViewCount="292" Body="Which online resources are available for job search at the Ph.D. level in the computational chemistry field? " OwnerUserId="5" LastEditorUserId="116" LastEditDate="2012-02-18T21:06:16.473" LastActivityDate="2012-11-12T14:06:37.883" Title= Which online resources are available for Ph.D. level jobs?" Tags="<phd><jobsearch><chemistry>" AnswerCount="2" CommentCount="2" /> <row Id="6" PostTypeId="2" ParentId="3" CreationDate="2012-02- 14T20:30:29.917" Score="18" Body="If your institution has a subscription to Journal Citation Reports (JCR), you can check it there" OwnerUserId="18" LastActivityDate="2012-02-14T20:30:29.917" CommentCount="1" /> Part 1: Write a Hadoop MapReduce code that calculates the average number of answers per question Part 2: write a MapReduce code that calculates the average score per accepted answer, and the average score for answers that were not marked as accepted answers. Compare the two values. Part 3: write a MapReduce code that determines the number of answers per UserID For each of the three code versions, determine the time to perform the required analysis on the large file on the shark cluster. 9

Input files Very small input file available on the webpages for code development (300 lines) Small input file ( ~30MB) available in hdfs in /bigdata-hw1-small/ Large input file ( ~10GB) available in hdfs in /bigdata-hw1-large Only use large input file after you have confirmed that your code runs correctly with the small input file Documentation The Documentation should contain (Brief) Problem description Solution strategy Results section Description of resources used Description of measurements performed Results (graphs/tables + findings) 10

The document should not contain Replication of the entire source code that s why you have to deliver the sources Screen shots of every single measurement you made Actually, no screen shots at all. The output files 11