Recommended Literature for this Lecture



Similar documents
COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Hadoop

Introduction to HDFS. Prasanth Kothuri, CERN

Introduction to HDFS. Prasanth Kothuri, CERN

Introduction to Cloud Computing

A Comparison of Approaches to Large-Scale Data Analysis

HDFS. Hadoop Distributed File System

Apache Hadoop. Alexandru Costan

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

CS455 - Lab 10. Thilina Buddhika. April 6, 2015

Chapter 7. Using Hadoop Cluster and MapReduce

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Extreme computing lab exercises Session one

Big Data Too Big To Ignore

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

How To Use Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Map Reduce & Hadoop Recommended Text:

Hadoop Architecture. Part 1

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Xiaoming Gao Hui Li Thilina Gunarathne

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

CS 455 Spring Word Count Example

CSE-E5430 Scalable Cloud Computing. Lecture 4

Big Data With Hadoop

MapReduce. Tushar B. Kute,

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Extreme computing lab exercises Session one

How To Install Hadoop From Apa Hadoop To (Hadoop)

Understanding Hadoop Performance on Lustre

Hadoop Hands-On Exercises

Introduction to MapReduce and Hadoop

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Map Reduce / Hadoop / HDFS

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Distributed Filesystems

Running Hadoop on Windows CCNP Server

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Comparing SQL and NOSQL databases

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

A. Aiken & K. Olukotun PA3

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

GraySort and MinuteSort at Yahoo on Hadoop 0.23

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Cloud Computing at Google. Architecture

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop MultiNode Cluster Setup

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Data Management in the Cloud

Apache Hadoop new way for the company to store and analyze big data

MapReduce and Hadoop Distributed File System

Hadoop IST 734 SS CHUNG

Scaling Out With Apache Spark. DTL Meeting Slides based on

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

CS 378 Big Data Programming

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Using distributed technologies to analyze Big Data

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Unified Big Data Analytics Pipeline. 连 城

CSE-E5430 Scalable Cloud Computing Lecture 2

MapReduce and Hadoop Distributed File System V I J A Y R A O

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Architectures for Big Data Analytics A database perspective

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Hadoop Job Oriented Training Agenda

Hadoop 2.6 Configuration and More Examples

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Daniel J. Adabi. Workshop presentation by Lukas Probst

Fundamentals Curriculum HAWQ

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Constructing a Data Lake: Hadoop and Oracle Database United!

Tutorial for Assignment 2.0

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

NoSQL for SQL Professionals William McKnight

MapReduce, Hadoop and Amazon AWS

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

COURSE CONTENT Big Data and Hadoop Training

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

YARN and how MapReduce works in Hadoop By Alex Holmes

Transcription:

COSC 6339 Big Data Analytics Introduction to MapReduce (III) and 1 st homework assignment Edgar Gabriel Spring 2015 Recommended Literature for this Lecture Andrew Pavlo, Erik Paulson, Alexander Rasin, A Comparison of Approaches to Large-Scale Data Analysis, http://database.cs.brown.edu/sigmod09/benchmarkssigmod09.pdf 1

Two Approaches to Large-Scale Data Analysis Shared nothing architectures Distributed file system Map, Split, Copy, Reduce MR scheduler Standard relational tables Data are partitioned over cluster nodes SQL MapReduce vs. Parallel DBMS: Schema Support Flexible, programmers write code to interpret input data Good for single application scenario Bad if data are shared by multiple applications. Must address data syntax, consistency, etc. Relational schema required Good if data are shared by multiple applications 2

MapReduce vs. Parallel DBMS: Programming Model & Flexibility Low level Anecdotal evidence from the MR community suggests that there is widespread sharing of MR code fragments to do common tasks, such as joining data sets. very flexible SQL user-defined functions, stored procedures, user-defined aggregates MapReduce vs. Parallel DBMS: Indexing No native index support Programmers can implement their own index support in Map/Reduce code But hard to share the customized indexes in multiple applications Hash/b-tree indexes well supported 3

MapReduce vs. Parallel DBMS: Execution Strategy & Fault Tolerance Intermediate results are saved to local files If a node fails, run the node-task again on another node At a mapper machine, when multiple reducers are reading multiple local files, there could be large numbers of disk seeks, leading to poor performance. Intermediate results are pushed across network If a node fails, must re-run the entire query Avoiding Data Transfers Schedule Map to close to data But other than this, programmers must avoid data transfers themselves A lot of optimizations Such as determine where to perform filtering 4

Compiling Hadoop test code mkdir hw1 mkdir hw1/src put all java file in hw1/src ideally as part of the same package cp build.xml to hw1 cd hw1 Adjust build.xml, specifically name of your jar file ant build In hw1/build/lib there is now the jar file To verify content of your jar file jar tf nameofjarfile.jar Running a Hadoop Mapreduce job cd hw1/build/lib name of jar file name of package name of class containing the main() function yarn jar hw1.jar example.flightclass /bigdata-common/ /bigd45/output/ name of input directory (in hdfs) name of output directory - must not exist before execution 5

How to kill a job if it is hanging yarn application -list 14/02/05 17:01:23 INFO client.rmproxy: Connecting to ResourceManager at shark/192.168.1.170:10040 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id Application-Name Application- Type User Queue State Final- State Progress Tracking-URL application_1391622275771_0007 wordcount MAPREDUCE gabriel default RUNNING UNDEFINED 5% http://shark01:5 8572 yarn application -kill application_1391622275771_0007 14/02/05 17:01:38 INFO client.rmproxy: Connecting to ResourceManager at shark/192.168.1.170:10040 Killing application application_1391622275771_0007 14/02/05 17:01:38 INFO impl.yarnclientimpl: Killing application application_1391622275771_0007 Using HDFS If you want to run a MapReduce job on the cluster, you have to have the input data in HDFS (can not be local), and the result also will be in HDFS Large input data set is already available with read-only permission for all students in hdfs:///bigdata-common/ Using HDFS is similar to local UNIX file system hdfs dfs ls / hdfs dfs ls /bigd45/ hdfs dfs mkdir /bigd45/newdir hdfs dfs rm /bigd45/file.txt hdfs dfs rm -r /bigd45/ 6

Using HDFS (II) Copying a file into hdfs hdfs dfs put <localfilename> /bigd45/<remotefilename> Looking at the content of a file in hdfs hdfs dfs cat /bigd45/filename.txt Copying a file from hdfs into local directory hdfs dfs get /bigd45/output/part-r-00000. Merging multiple output file (each reducer produces a separate output file!) hdfs dfs getmerge /bigd/output/part-* allparts.out 1 st Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt file) explanations to the code answers to questions Deliver electronically to gabriel@cs.uh.edu Expected by Friday, February 27, 11.59pm In case of questions: ask, ask, ask! 7

1. Given a data set continaing many questions asked on www.stackexchange.com and the corresponding answers Different categories (e.g. academic, math, stackoverflow etc.) Questions are answered by users, answers are being voted on Answer might be marked as Accepted Input file are stripped down xml files xml headers have been removed, you can probably parse as a regular string/text Description of the input file - Id - PostTypeId (1: Question 2: Answer) - ParentID (only present if PostTypeId is 2) - AcceptedAnswerId (only present if PostTypeId is 1) - CreationDate - Score - ViewCount - Body - OwnerUserId - LastEditorUserId - LastEditorDisplayName="Jeff Atwood" - LastEditDate="2009-03-05T22:28:34.823" - LastActivityDate="2009-03-11T12:51:01.480" - CommunityOwnedDate="2009-03-11T12:51:01.480" - ClosedDate="2009-03-11T12:51:01.480" - Title= - Tags= - AnswerCount - CommentCount - FavoriteCount 8

Example <row Id="1" PostTypeId="1" AcceptedAnswerId="180" CreationDate="2012-02- 14T20:23:40.127" Score="12" ViewCount="166" Body="<p>As from title. What kind of visa class do I have to apply for, in order to work as an academic in Japan? </p> " OwnerUserId="5" LastEditorUserId="2700" LastEditDate="2013-10-30T09:14:11.633" LastActivityDate="2013-10-30T09:14:11.633" Title="What kind of Visa is required to work in Academia in Japan?" Tags="<jobsearch><visa><japan>" AnswerCount="1" CommentCount="1" FavoriteCount="1" /> <row Id="2" PostTypeId="1" AcceptedAnswerId="246" CreationDate="2012-02- 14T20:26:22.683" Score="7" ViewCount="292" Body="<p>Which online resources are available for job search at the Ph.D. level in the computational chemistry field?</p> " OwnerUserId="5" LastEditorUserId="116" LastEditDate="2012-02-18T21:06:16.473" LastActivityDate="2012-11-12T14:06:37.883" Title= Which online resources are available for Ph.D. level jobs?" Tags="<phd><jobsearch><chemistry>" AnswerCount="2" CommentCount="2" /> <row Id="6" PostTypeId="2" ParentId="3" CreationDate="2012-02- 14T20:30:29.917" Score="18" Body="<p>If your institution has a subscription to Journal Citation Reports (JCR), you can check it there" OwnerUserId="18" LastActivityDate="2012-02-14T20:30:29.917" CommentCount="1" /> Part 1: Write a Hadoop MapReduce code that calculates the average number of answers per question Part 2: write a MapReduce code that calculates the average score per accepted answer, and the average score for answers that were not marked as accepted answers. Compare the two values. Part 3: write a MapReduce code that determines the number of answers per UserID For each of the three code versions, determine the time to perform the required analysis on the large file on the shark cluster. 9

Input files Very small input file available on the webpages for code development (300 lines) Small input file ( ~30MB) available in hdfs in /bigdata-hw1-small/ Large input file ( ~10GB) available in hdfs in /bigdata-hw1-large Only use large input file after you have confirmed that your code runs correctly with the small input file Documentation The Documentation should contain (Brief) Problem description Solution strategy Results section Description of resources used Description of measurements performed Results (graphs/tables + findings) 10

The document should not contain Replication of the entire source code that s why you have to deliver the sources Screen shots of every single measurement you made Actually, no screen shots at all. The output files 11