Big Data 2012 Hadoop Tutorial

Similar documents

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Tutorial- Counting Words in File(s) using MapReduce

Mrs: MapReduce for Scientific Computing in Python

Extreme Computing. Hadoop MapReduce in more detail.

BIG DATA APPLICATIONS

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduction to MapReduce and Hadoop

Getting to know Apache Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Apache Hadoop new way for the company to store and analyze big data

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Single Node Setup. Table of contents

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

TP1: Getting Started with Hadoop

Introduc)on to Map- Reduce. Vincent Leroy

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

How To Use Hadoop

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

map/reduce connected components

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Enterprise Data Storage and Analysis on Tim Barr

Big Data Management and NoSQL Databases

Xiaoming Gao Hui Li Thilina Gunarathne

2.1 Hadoop a. Hadoop Installation & Configuration

Hadoop Configuration and First Examples

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Running Hadoop on Windows CCNP Server

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Single Node Hadoop Cluster Setup

The Hadoop Eco System Shanghai Data Science Meetup

University of Maryland. Tuesday, February 2, 2010

Hadoop WordCount Explained! IT332 Distributed Systems

INTRODUCTION TO HADOOP

Internals of Hadoop Application Framework and Distributed File System

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Tutorial for Assignment 2.0

MapReduce. Tushar B. Kute,

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop and Big Data. Keijo Heljanko. Department of Information and Computer Science School of Science Aalto University

Data Science Analytics & Research Centre

Chapter 7. Using Hadoop Cluster and MapReduce

Lecture 3 Hadoop Technical Introduction CSE 490H

研發專案原始程式碼安裝及操作手冊. Version 0.1

Installation and Configuration Documentation

Outline of Tutorial. Hadoop and Pig Overview Hands-on

Hadoop (pseudo-distributed) installation and configuration

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Introduction to Cloud Computing

MapReduce, Hadoop and Amazon AWS

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Distributed Filesystems

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Big Data and Apache Hadoop s MapReduce

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Word Count Code using MR2 Classes and API

Sriram Krishnan, Ph.D.

19 Putting into Practice: Large-Scale Data Management with HADOOP

A very short Intro to Hadoop

How To Install Hadoop From Apa Hadoop To (Hadoop)

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Tutorial for Assignment 2.0

Data-intensive computing systems

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Introduction to Cloud Computing

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Big Data for the JVM developer. Costin Leau,

Jeffrey D. Ullman slides. MapReduce for data intensive computing

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Programming in Hadoop Programming, Tuning & Debugging

Big Data Analytics* Outline. Issues. Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Basics with InfoSphere BigInsights

Running Kmeans Mapreduce code on Amazon AWS

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop IST 734 SS CHUNG

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

CS54100: Database Systems

Transcription:

Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1

Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail: martinka@inf.ethz.ch Download of Exercises: http://www.systems.ethz.ch/courses /fall2012/bigdata Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 2

MapReduce Parallel problems distributed across huge data sets using a large number of nodes Two stages: Map step: master node takes input, divides into smaller sub-problems Reduce step: master node collects answer from all sub-problems and combindes them in some way Condition: reduction function is associative Remember: A x (B x C) = (A x B) x C Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 3

MapReduce MapReduce transforms (key, value) pairs into list of values: Map and Reduce functions defined with respect data stored in KV pairs: Map(k1, v2) list(k2,v2) MapReduce then groups all pairs with same key Reduce(k2, list(v2)) list(v3) All functions executed in parallel! Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 4

Dataflow of MapReduce Input reader: divides input into splits. One split is assigned to a map function. Map function: Takes (key, value) pairs and generates one or more output KV pairs Partition function: Asigning each map function to a reducer. Returns an index of reduce. Comparison function: The input for each reduce is sorted using a comparison function. Reduce function: The reduce function is called once for each unique key in the sorted order iterating through values and producing zero, one or more outputs Output writer: writes output of reduce to storage Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 5

Overview of Hadoop provides a programming model Efficient, automatic distribution of data & work across machines Open source implementation of Google s MapReduce FW on top of Hadoop Distributed File System (HDFS) Large-scale distributed batch processing for vasts amount of data (multi-terabytes) Runs on large clusters (1000s of nodes) of commodity hw with reliability & fault-tolerance Highly scalable filesystem, computing coupled to storage Provides a simplified programming model: map() & reduce() no schema or type support Slides adopted by Cagri Balkesen Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 6

Namenode: HDFS master server Manages the filesystem namespace (block mappings) Regulates access to files by clients (open, close, rename,...) Datanode: Manages data attached to each node Data is split into blocks & replicated (default is 64MB) Serves r/w requests of blocks Data locality, computing goes to data effective scheduling & parallel processing High aggregate bandwidth HDFS Architecture Image Sources: [1] http://developer.yahoo.com/hadoop/tutorial/, [2] http://hadoop.apache.org/common/docs/current/hdfs_design.html Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 7

The MapReduce Paradigm Mapping Lists MAP SHUFFLE SORT Reducing Lists REDUCE Maps execute in parallel over different local chunks Map outputs shuffled/copied to reduce nodes Reduce tasks begin after all local data is sorted Image Sources: [1] http://developer.yahoo.com/hadoop/tutorial/ Mapper(filename,contents): for each word in contents emit(word, 1) Reducer(word, values): sum = 0 for each value in values: sum = sum + value emit(word, sum) WordCount Example Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 8

MapReduce Terminology Job: A «full program» an execution of a Mapper and Reducer across a data set Task: An execution of a Mapper or a Reducer on a slice of data. a.k.a Task-In-Progress (TIP) Master node runs JobTracker instance, which accepts Job requests from clients TaskTracker instances run on slave nodes, periodically query JobTracker for work TaskTracker forks separate Java process for task instances, failures isolated & restarts with same input All mappers are equivalent; so map whatever data is local to a particular node in HDFS TaskRunner launches Mapper/Reducer & knows which InputSplits should be processed; calls Mapper/Reducer for each record from the InputSplit Ex: InputSplit each 64MB file chunk; RecordReader each line in chunk, also InputFormat identifies the InputSplit(i.e. TextInputFormat) Partitioner: Used in shuffle & determines the partition number for a key Credits: [3] http://www.cloudera.com/wp-content/uploads/2010/01/4-programmingwithhadoop.pdf Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 9

The WordCount Example function map(string name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(string word, Iterator partialcounts): // word: a word // partialcounts: a list of aggregated partial counts sum = 0 for each pc in partialcounts: sum += pc emit (word, sum) Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 10

The WordCount Example Dataset: Splits: We are not what we want to be, but at least we are not what we used to be. InputSplit-1 InputSplit-2 InputSplit-3 InputSplit-4 (k1,v1) (k2,v2) (k3,v3) (k4,v4) (k5,v5) We are not what we want to be, but at least we are not what we used to be. InputSplits are read and processed via TextInputFormat Parses input Generates key-value pairs: (key=offset, value=line-contents) InputSplit boundaries expanded to newline \n Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 11

The WordCount Example (k1,v1) (k2,v2) (k3,v3) (k4,v4) (k5,v5) We are not what we want to be, but at least we are not what we used to be. Map(k1,v1) Map(k2,v2) <we, 1> <are, 1> <not, 1> <what, 1> <we, 1> <want, 1> <to, 1> <be, 1> Map(k3,v3) <but, 1> <at, 1> <least, 1> Map(k4,v4) <we, 1> <are, 1> <not, 1> <what, 1> Map(k5,v5) <we, 1> <used, 1> <to, 1> <be, 1> <we, 4> Reduce(k,v[]) <we, 1> <we, 1> <we, 1> <we, 1> <are, 2> <not, 2> Reduce(k,v[]) Reduce(k,v[]) <are, 1> <are, 1> <not, 1> <not, 1> <what, 2> Reduce(k,v[]) <what, 1> <what, 1>...... Shuffle/Sort Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 12

Setting up Hadoop 3 modes of setup Standalone: Single Java process, for verification & debugging Pseudo-distributed: Single machine, but JobTracker & NameNode on different processes Fully-distributed: JobTracker & NameNode on different machines together with other slave machines Let s try standalone: Download the latest stable release: http://hadoop.apache.org/core/releases.html Extract files: tar xvz hadoop-1.0.4*.tar.gz Set the following in conf/hadoop-env.sh JAVA_HOME=/usr/java/default In Hadoop directory Create an input folder: $~/hadoop> mkdir input Download & extract the sample: $~/hadoop/input> wget http://www.systems.ethz.ch/sites/default/files/hadoopwords.tar 0.gz Run the word count example $~/hadoop> bin/hadoop jar hadoop-examples-*.jar wordcount input/ out/ See the results in out/ Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 13

Dissecting the Word Count code The source code of Word Count is src/examples/org/apache/hadoop/examples/wordcount.java Mapper class: public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } } Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 14

Reducer class: Dissecting the Word Count code public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 15

Job setup: Dissecting the Word Count code public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 16

Pseudo-distributed Setup Hadoop still runs in a single machine but simulates distributed setup by different processes for JobTracker & NameNode Change the configuration files as follows: conf/core-site.xml: <configuration> <property> <name> fs.default.name </name> <value> hdfs://localhost:9000 </value> </property> <configuration> conf/hdfs-site.xml: <configuration> <property> <name> dfs.replication </name> <value> 1 </value> </property> </configuration> conf/mapred-site.xml <configuration> <property> <name> mapred.job.tracker </name> <value> localhost:9001 </value> </property> <configuration> Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 17

Setup SSH, HDFS and start Hadoop Check whether you can connect without a passphrase: $> ssh localhost If not, setup by executing the following: $> ssh-keygen -t rsa -P '' $> cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Format and make the HDFS ready: $> bin/hadoop namenode -format Start Hadoop daemons: $> bin/start-all.sh Browse the web-interface of NameNode & JobTracker NameNode - http://localhost:50070/ JobTracker - http://localhost:50030/ Copy input files to the HDFS $> bin/hadoop dfs put localinput dfsinput Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 18

Job Tracker Web-Interface Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 19

NameNode Web-Interface Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 20

HDFS Commands $> ~/hadoop> bin/hadoop dfs [-ls <path>] [-du <path>] [-cp <src> <dst>] [-rm <path>] [-put <localsrc> <dst>] [-copyfromlocal <localsrc> <dst>] [-movefromlocal <localsrc> <dst>] [-get [-crc] <src> <localdst>] [-cat <src>] [-copytolocal [-crc] <src> <locdst>] [-movetolocal [-crc] <src> <locdst>] [-mkdir <path>] [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>] [-help [cmd]] Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 21

Example Input Data is a sample set of tweets from Twitter as follows (one tweet per line): {``text``:``tweet contents #TAG1 #TAG2``,..., ``hashtags``: [ {``text``:``tag1``,...},..., {``text``:``tag2``,...} ],... } \n Output the tags that occurs more than 10 times in the sample data set along with their occurrence counts. Sample output: TAG1 11 TAG2 50 TAG3 19... Implement by modifying from the WordCount.java Compile your source by: (you might need to download Apache Commons CLI first) > cd src/examples > javac -cp../../hadoop-core-1.0.4.jar:../../../lib/commonscli-1.2.jar org/apache/hadoop/examples/hashtagfreq.java Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 22

Example Input Data is a sampled set of trades on stock market for a single day on 06/01/2006. The contents are as follows: Description: @SYMBOL DATE EX TIME PRICE SIZE IBM 06/01/2006 N 49813 84.2200 100 \n IBM 06/01/2006 N 38634 84.0100 100 \n SUN 06/01/2006 N 46684 85.4200 300 \n SUN 06/01/2006 N 44686 85.6600 100 \n Task: Compute the total volume of trades for each stock ticker and return all stocks having a volume higher than a given value from commandline. In SQL: SELECT symbol, SUM(price*size) AS volume FROM Ticks GROUP BY symbol HAVING volume > V Example total volume for IBM: 84.22*100+84.01*100 = 16823 Sample output: IBM 16823 Let s assume filter = 20K IBM 16823 SUN 25711.66...... Implement by modifying from the WordCount.java Create a directory in your $HADOOP_HOME, let s say stocks/ Copy src/org/apache/hadoop/examples/wordcount.java to stocks/ Modify the code & name accordingly Compile: javac -cp hadoop-core-0.20.203.0.jar:lib/commons-cli-1.2.jar stocks/stockvolume.java Copy dataset to input/ : http://www.systems.ethz.ch/education/hs11/nis/project/stock_data.tar.gz Run: > bin/hadoop stocks/stockvolume input/ output/ Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 23

Setting Job Specific Parameters Set in the main, before submitting the Job: job.getconfiguration().setint("filter", Integer.parseInt(otherArgs[2])); Use inside map() or reduce(): context.getconfiguration().getint("filter", -1); See Hadoop API for other details: http://hadoop.apache.org/common/docs/current/api/index.html Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 24

Solution: Mapper Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 25

Solution: Reducer Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 26

Solution: The Job Setup Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 27

References [1] http://developer.yahoo.com/hadoop/tutorial/ [2] http://hadoop.apache.org/common/docs/current/hdfs_design.html [3] http://www.cloudera.com/wp-content/uploads/2010/01/4- ProgrammingWithHadoop.pdf [4] http://hadoop.apache.org/common/docs/current/api/index.html Happy Coding Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 28