Recommended Literature for this Lecture

Size: px

Start display at page:

Download "Recommended Literature for this Lecture"

Jayson McCormick
10 years ago
Views:

1 COSC 6339 Big Data Analytics Introduction to MapReduce (III) and 1 st homework assignment Edgar Gabriel Spring 2015 Recommended Literature for this Lecture Andrew Pavlo, Erik Paulson, Alexander Rasin, A Comparison of Approaches to Large-Scale Data Analysis, 1

Andrew Pavlo, Erik Paulson, Alexander Rasin, A Comparison of Approaches to

2 Two Approaches to Large-Scale Data Analysis Shared nothing architectures Distributed file system Map, Split, Copy, Reduce MR scheduler Standard relational tables Data are partitioned over cluster nodes SQL MapReduce vs. Parallel DBMS: Schema Support Flexible, programmers write code to interpret input data Good for single application scenario Bad if data are shared by multiple applications. Must address data syntax, consistency, etc. Relational schema required Good if data are shared by multiple applications 2

Parallel DBMS: Schema Support Flexible, programmers write code to interpret input data Good for single application scenario Bad

3 MapReduce vs. Parallel DBMS: Programming Model & Flexibility Low level Anecdotal evidence from the MR community suggests that there is widespread sharing of MR code fragments to do common tasks, such as joining data sets. very flexible SQL user-defined functions, stored procedures, user-defined aggregates MapReduce vs. Parallel DBMS: Indexing No native index support Programmers can implement their own index support in Map/Reduce code But hard to share the customized indexes in multiple applications Hash/b-tree indexes well supported 3

sharing of MR code fragments to do common tasks, such as joining data sets.

4 MapReduce vs. Parallel DBMS: Execution Strategy & Fault Tolerance Intermediate results are saved to local files If a node fails, run the node-task again on another node At a mapper machine, when multiple reducers are reading multiple local files, there could be large numbers of disk seeks, leading to poor performance. Intermediate results are pushed across network If a node fails, must re-run the entire query Avoiding Data Transfers Schedule Map to close to data But other than this, programmers must avoid data transfers themselves A lot of optimizations Such as determine where to perform filtering 4

node At a mapper machine, when multiple reducers are reading multiple local files, there could be large numbers of disk seeks, leading to poor

5 Compiling Hadoop test code mkdir hw1 mkdir hw1/src put all java file in hw1/src ideally as part of the same package cp build.xml to hw1 cd hw1 Adjust build.xml, specifically name of your jar file ant build In hw1/build/lib there is now the jar file To verify content of your jar file jar tf nameofjarfile.jar Running a Hadoop Mapreduce job cd hw1/build/lib name of jar file name of package name of class containing the main() function yarn jar hw1.jar example.flightclass /bigdata-common/ /bigd45/output/ name of input directory (in hdfs) name of output directory - must not exist before execution 5

xml, specifically name of your jar file ant build In hw1/build/lib there is now the jar file To verify content of your jar file jar tf nameofjarfile.

6 How to kill a job if it is hanging yarn application -list 14/02/05 17:01:23 INFO client.rmproxy: Connecting to ResourceManager at shark/ :10040 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id Application-Name Application- Type User Queue State Final- State Progress Tracking-URL application_ _0007 wordcount MAPREDUCE gabriel default RUNNING UNDEFINED 5% yarn application -kill application_ _ /02/05 17:01:38 INFO client.rmproxy: Connecting to ResourceManager at shark/ :10040 Killing application application_ _ /02/05 17:01:38 INFO impl.yarnclientimpl: Killing application application_ _0007 Using HDFS If you want to run a MapReduce job on the cluster, you have to have the input data in HDFS (can not be local), and the result also will be in HDFS Large input data set is already available with read-only permission for all students in hdfs:///bigdata-common/ Using HDFS is similar to local UNIX file system hdfs dfs ls / hdfs dfs ls /bigd45/ hdfs dfs mkdir /bigd45/newdir hdfs dfs rm /bigd45/file.txt hdfs dfs rm -r /bigd45/ 6

:01:23 INFO client.rmproxy: Connecting to ResourceManager at shark/192.168.1.170:10040 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id

7 Using HDFS (II) Copying a file into hdfs hdfs dfs put <localfilename> /bigd45/<remotefilename> Looking at the content of a file in hdfs hdfs dfs cat /bigd45/filename.txt Copying a file from hdfs into local directory hdfs dfs get /bigd45/output/part-r Merging multiple output file (each reducer produces a separate output file!) hdfs dfs getmerge /bigd/output/part-* allparts.out 1 st Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt file) explanations to the code answers to questions Deliver electronically to [email protected] Expected by Friday, February 27, 11.59pm In case of questions: ask, ask, ask! 7

) hdfs dfs getmerge /bigd/output/part-* allparts.out 1 st Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.

8 1. Given a data set continaing many questions asked on and the corresponding answers Different categories (e.g. academic, math, stackoverflow etc.) Questions are answered by users, answers are being voted on Answer might be marked as Accepted Input file are stripped down xml files xml headers have been removed, you can probably parse as a regular string/text Description of the input file - Id - PostTypeId (1: Question 2: Answer) - ParentID (only present if PostTypeId is 2) - AcceptedAnswerId (only present if PostTypeId is 1) - CreationDate - Score - ViewCount - Body - OwnerUserId - LastEditorUserId - LastEditorDisplayName="Jeff Atwood" - LastEditDate=" T22:28:34.823" - LastActivityDate=" T12:51:01.480" - CommunityOwnedDate=" T12:51:01.480" - ClosedDate=" T12:51:01.480" - Title= - Tags= - AnswerCount - CommentCount - FavoriteCount 8

regular string/text Description of the input file - Id - PostTypeId (1: Question 2: Answer) - ParentID (only present if PostTypeId is 2) - AcceptedAnswerId (only present if PostTypeId is 1) -

9 Example <row Id="1" PostTypeId="1" AcceptedAnswerId="180" CreationDate=" T20:23:40.127" Score="12" ViewCount="166" Body="As from title. What kind of visa class do I have to apply for, in order to work as an academic in Japan? " OwnerUserId="5" LastEditorUserId="2700" LastEditDate=" T09:14:11.633" LastActivityDate=" T09:14:11.633" Title="What kind of Visa is required to work in Academia in Japan?" Tags="<jobsearch><visa><japan>" AnswerCount="1" CommentCount="1" FavoriteCount="1" /> <row Id="2" PostTypeId="1" AcceptedAnswerId="246" CreationDate=" T20:26:22.683" Score="7" ViewCount="292" Body="Which online resources are available for job search at the Ph.D. level in the computational chemistry field? " OwnerUserId="5" LastEditorUserId="116" LastEditDate=" T21:06:16.473" LastActivityDate=" T14:06:37.883" Title= Which online resources are available for Ph.D. level jobs?" Tags="<phd><jobsearch><chemistry>" AnswerCount="2" CommentCount="2" /> <row Id="6" PostTypeId="2" ParentId="3" CreationDate=" T20:30:29.917" Score="18" Body="If your institution has a subscription to Journal Citation Reports (JCR), you can check it there" OwnerUserId="18" LastActivityDate=" T20:30:29.917" CommentCount="1" /> Part 1: Write a Hadoop MapReduce code that calculates the average number of answers per question Part 2: write a MapReduce code that calculates the average score per accepted answer, and the average score for answers that were not marked as accepted answers. Compare the two values. Part 3: write a MapReduce code that determines the number of answers per UserID For each of the three code versions, determine the time to perform the required analysis on the large file on the shark cluster. 9

633" LastActivityDate="2013-10-30T09:14:11.633" Title="What kind of Visa is required to work in Academia in Japan?

10 Input files Very small input file available on the webpages for code development (300 lines) Small input file ( ~30MB) available in hdfs in /bigdata-hw1-small/ Large input file ( ~10GB) available in hdfs in /bigdata-hw1-large Only use large input file after you have confirmed that your code runs correctly with the small input file Documentation The Documentation should contain (Brief) Problem description Solution strategy Results section Description of resources used Description of measurements performed Results (graphs/tables + findings) 10

confirmed that your code runs correctly with the small input file Documentation The Documentation should contain (Brief) Problem

11 The document should not contain Replication of the entire source code that s why you have to deliver the sources Screen shots of every single measurement you made Actually, no screen shots at all. The output files 11

sources Screen shots of every single measurement you

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt