Extreme Computing Second Assignment



Similar documents
Extreme computing lab exercises Session one

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Extreme computing lab exercises Session one

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

ratings.dat ( UserID::MovieID::Rating::Timestamp ) users.dat ( UserID::Gender::Age::Occupation::Zip code ) movies.dat ( MovieID::Title::Genres )

Package hive. January 10, 2011

A. Aiken & K. Olukotun PA3

MapReduce. Tushar B. Kute,

Hadoop MultiNode Cluster Setup

Introduction to Hadoop

Estimating PageRank Values of Wikipedia Articles using MapReduce

Package hive. July 3, 2015

Recommended Literature for this Lecture

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA


Introduction to MapReduce and Hadoop

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

Hadoop Streaming. Table of contents

Data processing goes big

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Introduction to Cloud Computing

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Note: With v3.2, the DocuSign Fetch application was renamed DocuSign Retrieve.

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

PassTest. Bessere Qualität, bessere Dienstleistungen!

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

File System Shell Guide

Reduction of Data at Namenode in HDFS using harballing Technique

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Data Intensive Computing Handout 6 Hadoop

HDFS File System Shell Guide

How to use. ankus v0.2.1 ankus community 작성자 : 이승복. This work is licensed under a Creative Commons Attribution 4.0 International License.

Hadoop Shell Commands

MapReduce Job Processing

Data Mining in the Swamp

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

CS455 - Lab 10. Thilina Buddhika. April 6, 2015

Hadoop Parallel Data Processing

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig

An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

python hadoop pig October 29, 2015

Running Hadoop on Windows CCNP Server

Hadoop Shell Commands

BiDAl: Big Data Analyzer for Cluster Traces

ITG Software Engineering

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Delivering Intelligence to Publishers Through Big Data

MapReduce, Hadoop and Amazon AWS

Distributed Computing and Big Data: Hadoop and MapReduce

Developing a MapReduce Application

Hadoop Distributed File System. Dhruba Borthakur June, 2007

How To Install Hadoop From Apa Hadoop To (Hadoop)

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Chapter 2 Text Processing with the Command Line Interface

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Chapter 7. Using Hadoop Cluster and MapReduce

Introduc)on to. Eric Nagler 11/15/11

Redundant Data Removal Technique for Efficient Big Data Search Processing

VDF Query User Manual

5 HDFS - Hadoop Distributed System

Word 2010: Mail Merge to with Attachments

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Hadoop Tutorial. General Instructions

CPS352 Database Systems: Design Project

Driving Big Data With Big Compute

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Big Data Too Big To Ignore

Introduction to HDFS. Prasanth Kothuri, CERN

HDFS. Hadoop Distributed File System

Enterprise Reporting Advanced Web Intelligence Training. Enterprise Reporting Services

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Set Up Hortonworks Hadoop with SQL Anywhere

Project 5 Twitter Analyzer Due: Fri :59:59 pm

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Introduction to Hadoop

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari

Infrastructures for big data

CREATING AN IMAGE FROM AUTOCAD CADD NOTE 16. MENU: AutoCAD, File, Plot COMMAND: plot ICON:

To reduce or not to reduce, that is the question

Qsoft Inc

SCM Dashboard Monitoring Code Velocity at the Product / Project / Branch level

/ Cloud Computing. Recitation 11 November 10 th and November 12 th, 2015

Creating a Participants Mailing and/or Contact List:

Rumen. Table of contents

Tutorial for Assignment 2.0

IIPC Metadata Workshop. Brad Tofel Vinay Goel Aaron Binns Internet Archive

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

At the end of this lesson, you will be able to create a Request Set to run all of your monthly statements and detail reports at one time.

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Big Data and Scripting map/reduce in Hadoop

ITG Software Engineering

Tutorial for Assignment 2.0

Transcription:

Extreme Computing Second Assignment Michail Basios (m.basios@sms.ed.ac.uk) Stratis Viglas (sviglas@inf.ed.ac.uk) This assignment is divided into three parts and eight tasks. The first part deals with taking messy Web data, cleaning it up and performing simple queries over it. The second part focuses on how huge matrices can be transposed by using MAPREDUCE. Finally, the last part deals with how a relational join operation should be done by using MAPREDUCE. You should use the teaching Hadoop Cluster and any programming language you want. For each part there are different data sets (on HDFS). There are two versions of each input file that should be used in your program, a small one and a larger one. The small version file is for developing and testing your code; when you are happy that it works, you should use the large version. The files (on HDFS) that you will need for each part follows: Part 1 File Number of lines Number of words /user/s1250553/ex2/websmall.txt 10000 149552 /user/s1250553/ex2/weblarge.txt 5000000 69760377 Part 2 Files Dimensions /user/s1250553/ex2/matrixsmall.txt (20, 10) /user/s1250553/ex2/matrixlarge.txt (4000, 3000) Part3 File Number of lines Number of words /user/s1250553/ex2/unismall.txt 2000 6987 /user/s1250553/ex2/unilarge.txt 2000000 6985354 All your actual results should be produced using the larger files and for all tasks you should use Hadoop MAPREDUCE and HDFS. 1

1 Tasks 1.1 Processing Web data Task 1 Take file weblarge.txt, and produce a version which is all UPPER-case. For example, the sentence John loves Mary. would become JOHN LOVES MARY. Call this the UPPER-case version. It does not matter if the output is a single file or multiple files. Task 2 Remove duplicated sentences from the produced UPPER-case version. Call this the deduplicated version. Your approach should be exact (make no mistakes). The output will be used for some of the following tasks. Task 3 (2 marks) (3 marks) Look at the Unix command wc. Implement an exact version of it (which only counts words and lines) for Hadoop, using MAPREDUCE and any appropriate Unix commands. You should assume the input data and results are stored in HDFS. The output results from HDFS should be merged into a single file (on Unix) with the total number of words and lines. For example, the following file example.txt: bob had a little lamb and a small cat alice had one tiger mary had some small dogs and a rabbit should have the following result (21 is the total number of words and 3 is the total number of lines): 21 3 Task 4 Implement a randomised version of the previous task which uses probabilistic counting, as taught during the lectures. Task 5 (2 marks) (3 marks) On the deduplicated version, find all three-word sequences and their counts. For example, the two sentences: mary had a little lamb mary had a little tiger have the following three-word sequences and counts: 2

Task 6 Sequence Count mary had a 2 had a little 2 a little lamb 1 a little tiger 1 (3 marks) What are the top twenty most frequent three-word sequences? (2 marks) 1.2 Matrix Transpose Task 7 Create a program that uses MAPREDUCE and takes as an input a large file containing a matrix /user/s1250553/ex2/largematrix.txt and returns an output file that has it transposed. The final result can be either a single file or multiple files (in which case, when concatenated, they should give the same result as a single file). For example, the input matrix: should be transposed to: 1 2 3 4 5 6 7 8 9 10 11 12 1 5 9 2 6 10 3 7 11 4 8 12 (5 marks) 1.3 Relational Join using MAPREDUCE In this task you will perform a join operation in Hadoop. Let us assume that we have the relations student(studentid, name) and marks(courseid, studentid, mark) as shown below: students studentid name 1 George 2 Anna 3

marks courseid studentid mark EXC 1 70 EXC 2 65 TTS 1 70 ADBS 1 80 and need to join them on the studentid field. Traditionally, this is an easy task when we deal with relational databases and can be performed by using the relational join operator. However, the way this join operation is performed drastically changes, when we assume our input is into a single file that stores information from both relations. Assume the format of such a single input file storing data from two relations is as follows: student 1 George mark EXC 1 70 student 2 Anna mark ADBS 1 80 mark EXC 2 65 mark TTS 1 80 The first column is a tag that shows from which relation the data comes from. Depending on this tag, we can assign meaning to the other columns. When the tag used is mark, we know that the second column refers to the courseid, the third to the studentid and the fourth refers to the grade the student took in this specific course. On the other hand, if the tag is student, we know that there are only two other columns, one with the studentid and one with the student name. Task 8 Use the unilarge.txt file perform a join operation on the studentid key and produce an output that will have the grades of each student as follows: name --> (course1, mark1) (course2, mark2) (course3, mark3)... For example, for the previous input file your algorithm should return: George --> (ADBS, 80) (EXC, 70) (TTS, 80) Anna --> (EXC, 65) (5 marks) 2 Marking Guidelines There are 25 marks available. Each task has marks allocated, which gives a guide to how much effort you should spend. For programs, marks will be awarded for making use of Hadoop, efficiency and correctness. Bonus marks will be awarded if you are creative. 4

3 Submission Your work should consist of: 1. One plain text file, with sections for each part of this assignment. (This is so that it is possible to see which part of the file answers which question.) Include all programs you have written. Make sure you comment your code so that what you are doing can be understood. Do not submit a Word document or a PDF file only submit a plain text file. 2. The output of each task stored in HDFS and named as sxxxxxxx task i.out where XXXXXXX is your matriculation number and i is the number of the task the folder corresponds to. Do not delete these files from HDFS until you are told it is OK to do so. Your text file submission should be named exc-mr.txt. Organise your file in sections for each task. Each section should have a code block that should start with Task i code begin and finish with Task i code end where i is the task you are currently providing the solution for. The code block should contain the mapper code, the reducer code (if a reducer is used) and the command (or commands) for running it on Hadoop. It should then be followed by a result block starting with Task i results begin and finishing with Task i results end for task i. The result block should contain the first 20 lines of your first output file that you have stored in HDFS. This is because the output files will be very big. You can use the command: hadoop dfs -cat filename head -20 for showing just the first 20 lines of the file. An example submission file should therefore look like: Task 1 code begin code for task 1 Task 1 code end Task 1 results begin first 20 lines of the result of task 1 Task 1 results end Task 2 code begin code for task 2 Task 2 code end 5

Task 2 results begin first 20 lines of the result of task 2 Task 2 results end code and results for Tasks 3-8 Use the submit program to submit the exc-mr.txt text file; for the rest of your results you only need to make sure they follow the outlined naming scheme and are accessible on HDFS. You can submit the exc-mr.txt file from the command line as: submit exc 2 exc-mr.txt 4 Deadline The submission deadline is Wednesday 4 December, 4:00 pm. 6