Data Intensive Computing Handout 6 Hadoop



Similar documents
Data Intensive Computing Handout 5 Hadoop

Hadoop Streaming. Table of contents

Extreme computing lab exercises Session one

A. Aiken & K. Olukotun PA3

Hadoop Design and k-means Clustering

Extreme computing lab exercises Session one

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Assignment 2: More MapReduce with Hadoop

PassTest. Bessere Qualität, bessere Dienstleistungen!

Extreme Computing. Hadoop MapReduce in more detail.

map/reduce connected components

Introduction to Cloud Computing

Hadoop Distributed File System Propagation Adapter for Nimbus

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Cloudera Certified Developer for Apache Hadoop

Hadoop and Map-reduce computing

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Hadoop Installation MapReduce Examples Jake Karnes

MapReduce. Tushar B. Kute,

TP1: Getting Started with Hadoop

Getting Started with Hadoop with Amazon s Elastic MapReduce

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Hadoop Hands-On Exercises

ITG Software Engineering

Introduction to MapReduce and Hadoop

Hadoop WordCount Explained! IT332 Distributed Systems

Sriram Krishnan, Ph.D.

Distributed Computing and Big Data: Hadoop and MapReduce

How To Use Hadoop

Integrating NLTK with the Hadoop Map Reduce Framework Human Language Technology Project

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Big Data Frameworks: Scala and Spark Tutorial

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Click Stream Data Analysis Using Hadoop

COURSE CONTENT Big Data and Hadoop Training

Other Map-Reduce (ish) Frameworks. William Cohen

Introduction to Big Data with Apache Spark UC BERKELEY

Optimize the execution of local physics analysis workflows using Hadoop

Tutorial for Assignment 2.0

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Big Data Management and NoSQL Databases

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

MapReduce, Hadoop and Amazon AWS

A very short Intro to Hadoop

Single Node Setup. Table of contents

MapReduce and Hadoop Distributed File System

Hadoop Hands-On Exercises

Big Data and Scripting map/reduce in Hadoop

University of Maryland. Tuesday, February 2, 2010

Getting to know Apache Hadoop

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

CS 378 Big Data Programming

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

HDFS Users Guide. Table of contents

Introduction To Hive

CSCI 5417 Information Retrieval Systems Jim Martin!

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

HiBench Introduction. Carson Wang Software & Services Group

python hadoop pig October 29, 2015

Disco: Beyond MapReduce

Complete Java Classes Hadoop Syllabus Contact No:

Machine Learning using MapReduce

The Hadoop Framework

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Map Reduce & Hadoop Recommended Text:

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

HDFS. Hadoop Distributed File System

Peers Techno log ies Pv t. L td. HADOOP

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Apache Hadoop new way for the company to store and analyze big data

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hands-on Exercises with Big Data

To reduce or not to reduce, that is the question

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Hadoop and Map-Reduce. Swati Gore

Map-Reduce and Hadoop

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

CS2510 Computer Operating Systems Hadoop Examples Guide

Transcription:

Data Intensive Computing Handout 6 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070. To conveniently access the web interfaces remotely, you can use ssh -D localhost:2020 -p 11422 ufallab.ms.mff.cuni.cz to open SOCKS5 proxy server forwarding requests to the remove site, and use it as a proxy server in a browser, for example as chromium --proxy-server=socks://localhost:2020. Simple Tokenizer We will commonly need to split given text into words (called tokens). You can do so easily by using function wordpunct_tokenize from nltk.tokenize package, i.e. using the following import line at the beginning of you program: from nltk.tokenize import wordpunct_tokenize Dumbo Dumbo is (one of several) Python API to Hadoop. Hadoop language bindings. Overview of features: always a mapper, maybe reducer and combiner mapper, reducer and combiner can be implemented as It utilizes Hadoop Streaming, as most simple method class with init, call and possibly cleanup. Note that both call and cleanup must return either a list or a generator returning the results. Especially, if you do not want to return anything, you have to return [] (that is Dumbo bug as it could be fixes easily). parameters through self.params and -param passing files using -file or -cachefile counters multiple iterations with any non-circular dependencies Simple Grep Example All mentioned examples are available in /home/straka/examples. grep_ab.py def mapper ( key, value ) : i f key. s t a r t s w i t h ( Ab ) : yield key, value. r e p l a c e ( \n, ) dumbo. run ( mapper ) 1

grep_ab.sh dumbo start grep_ab.py -input / data / wiki /cs - medium -output / users / $USER / grep_ ab - outputformat text - overwrite yes Running Dumbo Run above script using dumbo start script.py options. Available options: -input input_hdfs_path: input path to use -inputformat [auto sequencefile text keyvaluetext]: input format to use, auto is default -output output_hdfs_path: output path to use -output [sequencefile text]: output format to use, sequencefile is default -nummaptasks n: set number of map tasks to given number -numreducetasks n: set number of reduce tasks to given number. Zero is allowed (and default if no reducer is specified) and only mappers are executed in that case. -file local_file: file to be put in the dir where the python program gets executed -cachefile HDFS_path#link_name: create a link link_name in the dir where the python program gets executed -param name=value: param available in the Python script as self.params["name"] -hadoop hadoop_prefix: default is /HADOOP -name hadoop_job_name: default is script.py -mapper hadoop_mapper: Java class to use as mapper instead of mapper in script.py -reducer hadoop_reducer: Java class to use as reducer instead of reducer in script.py Dumbo HDFS Commands dumbo cat HDFS_path [-ascode=yes]: convert to text and print given file dumbo ls HDFS_path dumbo exists HDFS_path dumbo rm HDFS_path dumbo put local_path HDFS_path dumbo get HDFS_path local_path Grep Example Parameters can be passed to mappers and reducers using -param name=value Dumbo option and accessed using self.params dictionary. Also note the class version of the mapper, using constructor init and mapper method call. Reducers can be implemented similarly. import re grep.py class Mapper : s e l f. re = re. compile ( s e l f. params. get ( pattern, ) ) 2

i f s e l f. re. search ( key ) : yield key, value. r e p l a c e ( \n, ) dumbo. run ( Mapper ) grep.sh dumbo start grep.py -input / data / wiki /cs - medium -output / users / $USER / grep - outputformat text - param pattern =" ^ H" - overwrite yes Simple Word Count Reducers are similar to mappers, and can be specified also either using a method or a class. An optional combiner (third parameter of dumbo.run) can be specified too. wordcount.py def mapper ( key, value ) : def reducer ( key, v a l u e s ) : dumbo. run ( mapper, reducer, reducer ) wordcount.sh dumbo start wordcount.py -input / data / wiki /cs - medium -output / users / $USER / wordcount - outputformat text - overwrite yes Efficient Word Count Example More efficient word count is obtained when counts in the processed block are stored in an associative array (more efficient version of local reducer). To that end, cleanup method can be used nicely. Note that we have to return [] in call. That is causes by the fact that Dumbo iterates over results of a call invocation and unfortunately does not handle None return value. wc_effective.py class Mapper : s e l f. counts = {} s e l f. counts [ word ] = s e l f. counts. get ( word, 0) + 1 return [ ] # Method c a l l has to return the ( key, v a l u e ) p a i r s. # Unfortunately, NoneType i s not handled in Dumbo. 3

def cleanup ( s e l f ) : for word count in s e l f. counts. i t e r i t e m s ( ) : yield word count class Reducer : def c a l l ( s e l f, key, v a l u e s ) : dumbo. run ( Mapper, Reducer ) wc_effective.sh dumbo start wc_effective.py -input / data / wiki /cs - medium -output / users / $USER / wc_ effective - outputformat text - overwrite yes Word Count with Counters User counters can be collected using Hadoop using self.counters object. wc_counters.py def mapper ( key, value ) : class Reducer : def c a l l ( s e l f, key, v a l u e s ) : t o t a l = sum( v a l u e s ) counter = Key o c c u r r e n c e s + ( s t r ( t o t a l ) i f t o t a l < 10 else 10 or more ) s e l f. c ounters [ counter ] += 1 yield key, t o t a l dumbo. run ( mapper, Reducer ) wc_counters.sh dumbo start wc_counters.py -input / data / wiki /cs - medium -output / users / $USER / wc_ counters - outputformat text - overwrite yes Word Count using Stop List Sometimes customization using -param is not enough, instead a whole file should be used to customize the mapper or reducer. Consider for example case where word count should ignore given words. This task can be solved by using -param to specify file with words to ignore and by -file or -cachefile to distribute the file in question with the computation. wc_excludes.py class Mapper : 4

f i l e = open ( s e l f. params [ e x c l u d e s ], r ) s e l f. e x c l u d e s = s e t ( l i n e. s t r i p ( ) for l i n e in f i l e ) f i l e. c l o s e ( ) i f not ( word in s e l f. e x c l u d e s ) : def reducer ( key, v a l u e s ) : dumbo. run ( Mapper, reducer, reducer ) wc_excludes.sh dumbo start wc_excludes.py -input / data / wiki /cs - medium -output / users / $USER / wc_ excludes - outputformat text - param excludes = stoplist. txt - file stoplist. txt - overwrite yes Multiple Iterations Word Count Dumbo can execute multiple iterations of MapReduce. In the following artifical example, we first create lower case variant of values and then filter out words not matching given pattern and count their occurrences. import re wc_2iterations.py class LowercaseMapper : yield key, value. decode ( utf 8 ). lower ( ). encode ( utf 8 ) class GrepMapper : s e l f. re = re. compile ( s e l f. params. get ( pattern, ) ) i f s e l f. re. search ( word ) : def reducer ( key, v a l u e s ) : def runner ( job ) : job. a d d i t e r ( LowercaseMapper ) job. a d d i t e r ( GrepMapper, reducer ) dumbo. main ( runner ) wc_2iterations.sh 5

dumbo start wc_2iterations.py -input / data / wiki /cs - medium -output / users / $USER / wc_ 2iterations - outputformat text - param pattern = h - overwrite yes Non-Trivial Dependencies Between Iterations The MapReduce iterations can depend on output of arbitrary iterations (as long as the dependencies do not form a cycle, of course). This can be specifies using input parameter to additer as follows. import re wc_dag.py class LowercaseMapper : yield key, value. decode ( utf 8 ). lower ( ). encode ( utf 8 ) class FilterMapper1 : s e l f. re = re. compile ( s e l f. params. get ( pattern1, ) ) i f s e l f. re. search ( word ) : class FilterMapper2 : s e l f. re = re. compile ( s e l f. params. get ( pattern2, ) ) i f s e l f. re. search ( word ) : class IdentityMapper : yield key, value def reducer ( key, v a l u e s ) : def runner ( job ) : lowercased = job. a d d i t e r ( LowercaseMapper ) # i m p l i c i t input = j o b. root f i l t e r e d 1 = job. a d d i t e r ( FilterMapper1, input = lowercased ) f i l t e r e d 2 = job. a d d i t e r ( FilterMapper2, input = lowercased ) job. a d d i t e r ( IdentityMapper, reducer, input = [ f i l t e r e d 1, f i l t e r e d 2 ] ) dumbo. main ( runner ) wc_dag.sh dumbo start wc_dag.py -input / data / wiki /cs - medium -output / users / $USER / wc_ dag - outputformat text - param pattern1 = h - param pattern2 = i - overwrite yes 6

Execute Hadoop Locally Using dumbo-local instead of dumbo, you can run the Hadoop computation locally using one mapper and one reducer. The standard error of the Python script is available in that case. Wikipedia Data The Wikipedia Data available from /dlrc_share/data/wiki/ are available also in HDFS: /data/wiki/cs: Czech Wikipedia data (Sep 2009), 195MB, 124k articles /data/wiki/en: English Wikipedia data (Sep 2009), 4.9GB, 2.9M articles All data is stored in a record-compressed sequence files, with article names as keys and article texts as values, in UTF-8 encoding. Tasks Solve the following tasks. Solution for each task is a Dumbo Python source processing the Wikipedia source data and producing required results. Tasks From Previous Seminars Task Points Description dumbo_unique_words 2 Create a list of unique words used in the articles using Dumbo. Convert them to lowercase to ignore case. article_initials 2 Run a Dumbo job which uses counters to count the number of articles according to their first letter, ignoring the case and merging all non-czech initials. dumbo_inverted_index 2 Compute inverted index in Dumbo for every lowercased word from the articles, compute (article name, ascending positions of occurrences as word indices) pairs. The output should be a file with one word on a line in the following format: word \t articlename \t spaceseparatedoccurrences... You will get 2 additional points if the articles will be numbered using consecutive integers. In that case, the output is ascending (article id, ascending positions of occurrences as word indices) pairs, together with a file containing list of articles representing this mapping (the article on line i is the article with id i). no_references 3 An article A is said to reference article B, if it contains B as a token (ignoring case). Run a Dumbo job which counts for each article how many references there exists for the given article (summing all references in a single article). You will get one extra point if the result is sorted by the number of references (you are allowed to use 1 reducer in the sorting phase). 7

Tasks From Previous Seminars Task Points Description dumbo_wordsim_index 4 In order to implement word similarity search, compute for each form with at least three occurrences all contexts in which it occurs, including their number of occurrences. List the contexts in ascending order. Given N (either 1, 2, 3 or 4), the context of a form occurrence is N forms preceding this occurrence and N forms following this occurrence (ignore sentence boundaries, use empty words when article boundaries are reached). The output should be a file with one form on a line in the following format: form \t context \t counts... Let S be given natural number. Using the index created dumbo_wordsim_find 4 in dumbo_wordsim_index, find for each form S most similar forms. The similarity of two forms is computed using cosine C similarity as A C B C A C B F is a vector of occurrences of form F contexts. The output should be a file with one form on a line in the following format: form \t most similar form \t cosine similarity... New Tasks Task Points Description dumbo_anagrams 2 Two words are anagrams if one is a letter permutation of the other (ignoring case). For a given input, find all anagram classes that contain at least A words. Output each anagram class on a separate line. Implement K-means clustering algorithm as described dumbo_kmeans 6 on http://en.wikipedia.org/wiki/k-means_clustering# Standard_algorithm. The user specifies number of iterations and the program run specified number of K-means clustering algorithm iterations. You can use the following training data. The key of the data is point ID, the value of the data is list of integral coordinates. The coordinates in one input naturally have the same dimension. HDFS path Points Dimension Clusters /data/points/small 10000 50 50 /data/points/medium 100000 100 100 /data/points/large 500000 200 200 You will get 2 additional points if the algorithm stops when none of the centroid positions change more than given ε. 8