Extreme computing lab exercises Session one



Similar documents
Extreme computing lab exercises Session one

Hadoop Shell Commands

Hadoop Shell Commands

HDFS File System Shell Guide

File System Shell Guide

Hadoop Streaming. Table of contents

Data Science Analytics & Research Centre

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Cloud Computing. Chapter Hadoop

Introduction to Cloud Computing

map/reduce connected components

HDFS. Hadoop Distributed File System

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Hadoop Hands-On Exercises

HDFS Installation and Shell

Tutorial for Assignment 2.0

Hadoop MultiNode Cluster Setup

TP1: Getting Started with Hadoop

Basic Hadoop Programming Skills

Introduction to HDFS. Prasanth Kothuri, CERN

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Introduction to Hadoop

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Distributed Filesystems

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

RDMA for Apache Hadoop User Guide

Data Intensive Computing Handout 5 Hadoop

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Data Intensive Computing Handout 6 Hadoop

Apache Hadoop new way for the company to store and analyze big data

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Tutorial for Assignment 2.0

HSearch Installation

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Introduction to HDFS. Prasanth Kothuri, CERN

PassTest. Bessere Qualität, bessere Dienstleistungen!

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

CS2510 Computer Operating Systems Hadoop Examples Guide

Hadoop Hands-On Exercises

Hands-on Exercises with Big Data

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop Installation MapReduce Examples Jake Karnes

Unix Sampler. PEOPLE whoami id who

Single Node Hadoop Cluster Setup

Introduction to MapReduce and Hadoop

MapReduce. Tushar B. Kute,

How To Use Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST II

Recommended Literature for this Lecture

ratings.dat ( UserID::MovieID::Rating::Timestamp ) users.dat ( UserID::Gender::Age::Occupation::Zip code ) movies.dat ( MovieID::Title::Genres )

Package hive. January 10, 2011

How To Install Hadoop From Apa Hadoop To (Hadoop)

A. Aiken & K. Olukotun PA3

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Running Hadoop on Windows CCNP Server

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Command Line - Part 1

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Tutorial Guide to the IS Unix Service

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Hadoop (pseudo-distributed) installation and configuration

Package hive. July 3, 2015

To reduce or not to reduce, that is the question

Hadoop Training Hands On Exercise

An Introduction to Using the Command Line Interface (CLI) to Work with Files and Directories

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Apache Hadoop. Alexandru Costan

Big Data and Scripting map/reduce in Hadoop

Big Data 2012 Hadoop Tutorial

Assignment 1: MapReduce with Hadoop

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)


Hadoop Basics with InfoSphere BigInsights

Chase Wu New Jersey Ins0tute of Technology

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Linux command line. An introduction to the Linux command line for genomics. Susan Fairley

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

GraySort and MinuteSort at Yahoo on Hadoop 0.23

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Hadoop Basics with InfoSphere BigInsights

Hadoop/MapReduce Workshop

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Tutorial- Counting Words in File(s) using MapReduce

Transcription:

Extreme computing lab exercises Session one Miles Osborne (original: Sasa Petrovic) October 23, 2012 1 Getting started First you need to access the machine where you will be doing all the work. Do this by typing: ssh namenode If this machine is busy, try another one (for example bw1425n13, bw1425n14 or any other number up to bw1425n24). Hadoop is installed on Dice under /opt/hadoop/hadoop-0.20.2 (this will also be referred to as the Hadoop home directory). The hadoop command can be found in the /opt/hadoop/hadoop-0.20.2/bin directory. To make sure you can run Hadoop command without having to be in the same directory, use your favorite editor (emacs) to edit the ~/.benv le and add the following line: PATH=$PATH:/opt/hadoop/hadoop-0.20.2/bin/ Don't forget to save the changes! After that, run the following command: source ~/.benv to make sure new changes take place right away. After you do this, run hadoop dfs -ls / If you get a listing of directories on HDFS, you've successfully congured everything. If not, make sure you do ALL of the described steps EXACTLY as they appear in this document. 2 HDFS Hadoop consists of two major parts: a distributed lesystem (HDFS for Hadoop DFS) and the mechanism for running jobs. Files on HDFS are distributed across the network and replicated on dierent nodes for reliability and speed. When a job requires a particular le, it is fetched from the machine/rack nearest to the machines that are actually executing the job. You will now proceed to do some simple commands in HDFS. REMEMBER: all the commands mentioned here refer to HDFS shell commands, NOT to UNIX commands which may have the same name. You should keep all your source and data within a directory named /user/<your_matriculation_number> which should already be created for you. 1. Make sure that your home directory exists: s hadoop dfs -ls /user/sxxxxxxx where sxxxxxxx should be your matriculation number. To create a directory in Hadoop: hadoop dfs -mkdir dirname Create directories /user/sxxxxxxx/data/input, /user/sxxxxxxx/source, and /user/sxxxxxxx/data/output in a similar way (these directories will NOT have been created for you, so you need to create them yourself). Conrm that you've done the right thing by typing 1

hadoop dfs -ls /user/sxxxxxxx For example, if your matriculation number is s0123456, you should see something like: Found 2 items drwxr-xr-x - s0123456 s0123456 0 2011-10-19 09:55 /user/s0123456/data drwxr-xr-x - s0123456 s0123456 0 2011-10-19 09:54 /user/s0123456/source 2. Copy the le /user/sasa/data/example1 to /user/sxxxxxxx/data/output by typing: hadoop dfs -cp /user/sasa/data/example1 /user/sxxxxxxx/data/output 3. Obviously, example1 doesn't belong there. Move it from /user/sxxxxxxx/data/output to /user/sxxxxxxx/data/input where it belongs and delete the /user/sxxxxxxx/data/output directory: hadoop dfs -mv /user/sxxxxxxx/data/output/example1 /user/sxxxxxxx/data/input hadoop dfs -rmr /user/sxxxxxxx/data/output/ 4. Examine the contents of example1 using cat and then tail: hadoop dfs -cat /user/sxxxxxxx/data/input/example1 hadoop dfs -tail /user/sxxxxxxx/data/input/example1 5. Create an empty le named example2 in /user/sxxxxxxx/data/input. Use test to check if it exists and that it is indeed zero length. hadoop dfs -touchz /user/sxxxxxxx/data/input/example2 6. Remove the le example2: hadoop dfs -test -z /user/sxxxxxxx/data/input/example2 hadoop dfs -rm /user/sxxxxxxx/data/input/example2 2.1 List of HDFS commands What follows is a list of useful HDFS shell commands. cat copy les to stdout, similar to UNIX cat command. hadoop dfs -cat /user/hadoop/file4 copyfromlocal copy single src, or multiple srcs from local le system to the destination lesystem. Source has to be a local le reference. hadoop dfs -copyfromlocal localfile /user/hadoop/file1 copytolocal copy les to the local le system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option. Destination must be a local le reference. hadoop dfs -copytolocal /user/hadoop/file localfile 2

cp copy les from source to destination. This command allows multiple sources as well in which case the destination must be a directory. Similar to UNIX cp command. hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2 getmerge take a source directory and a destination le as input and concatenate les in src into the destination local le. Optionally addnl can be set to enable adding a newline character at the end of each le. hadoop dfs -getmerge /user/hadoop/mydir/ ~/result_file ls for a le returns stat on the le with the format: filename <number of replicas> size modification_date modification_time permissions userid groupid For a directory it returns list of its direct children as in UNIX, with the format: dirname <dir> modification_time modification_time permissions userid groupid hadoop dfs -ls /user/hadoop/file1 lsr recursive version of ls. Similar to UNIX ls -R command. hadoop dfs -lsr /user/hadoop/ mkdir create a directory. Behaves similar to UNIX mkdir -p command creating parent directories along the path (for bragging rights, what is the dierence?) hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 mv move les from source to destination similar to UNIX mv command. This command allows multiple sources as well in which case the destination needs to be a directory. Moving les across lesystems is not permitted. hadoop dfs -mv /user/hadoop/file1 /user/hadoop/file2 rm delete les, similar to UNIX rm command. Only deletes empty directories and les. hadoop dfs -rm /user/hadoop/file1 rmr recursive version of rm. Same as rm -r on UNIX. hadoop dfs -rmr /user/hadoop/dir1/ tail Displays last kilobyte of the le to stdout. Similar to UNIX tail command. Options: -f output appended data as the le grows (follow) hadoop dfs -tail /user/hadoop/file1 test perform various test. Options: -e check to see if the le exists. Return 0 if true. -z check to see if the le is zero length. Return 0 if true. -d check return 1 if the path is directory else return 0. hadoop dfs -test -e /user/hadoop/file1 touchz create a le of zero length. Similar to UNIX touch command. hadoop dfs -touchz /user/hadoop/file1 3

3 Running jobs 3.1 Computing π NOTE: in this example we use the hadoop-0.20.2-examples.jar le. This le can be found in /opt/hadoop/hadoop-0.20.2/, so make sure to use the full path to that le if you are not running the example from the /opt/hadoop/hadoop-0.20.2/ directory. This example estimates the mathematical constant π to some error. The error depends on the number of samples we have (more samples = more accurate estimate). Run the example as follows: hadoop jar hadoop-0.20.2-examples.jar pi <nmaps> <nsamples> Task where <nmaps> is the number of mapper jobs, and <nsamples> is the number of samples. Try the following combinations for <nmaps> and <nsamples> and see how the running time and precision change: Number of maps Number of samples Time (s) ˆπ 2 10 5 10 10 10 2 100 10 100 Do the results match your expectations? How many samples are needed to approximate the third digit after the decimal dot correctly? 3.2 Demo: Word counting Task Hadoop has a number of demo applications and here we will look at the canonical task of word counting. We will count the number of times each word appears in a document. For this purpose, we will use the /user/sasa/data/example3 le, so rst copy that le to your input directory. Second, make sure you delete your output directory before running the job or the job will fail. We run the wordcount example by typing: hadoop jar hadoop-0.20.2-examples.jar wordcount <in> <out> where in and out are the input and output directories, respectively. After running the example, examine (using ls) the contents of the output directory. From the output directory, copy the le part-r-00000 to a local directory (somewhere in your home directory) and examine the contents. Was the job successful? 3.3 Running streaming jobs Hadoop streaming is a utility that allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer. The way it works is very simple: input is converted into lines which are fed to the stdin of the mapper process. The mapper processes this data and writes to stdout. Lines from the stdout of the mapper process are converted into key/value pairs by splitting them on the rst tab character. 1 The key/value pairs are fed to the stdin of the reducer process which collects and processes them. Finally, the reducer writes to stdout which is the nal output of the program. Everything will become much clearer through examples later. It is important to note that with Hadoop streaming mappers and reducers can be any programs that read from stdin and write to stdout, so the choice of the programming language is left to the programmer. Here, we will use Python. How to actually run the job Suppose you have your mapper, mapper.py, and your reducer, reducer.py, and the input is in /user/hadoop/input/. How do you run your streaming job? It's similar to running the DFS examples from the previous section, with minor dierences: 1 Of course, this is only the default behavior and can be changed. 4

hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \ -input /user/hadoop/input \ -output /user/hadoop/output \ -mapper mapper.py \ -reducer reducer.py Note the dierences: we always have to specify contrib/streaming/hadoop-0.20.2-streaming.jar (this le can be found in /opt/hadoop/hadoop-0.20.2/ directory so you either have to be in this directory when running the job, or specify the full path to the le) as the jar to run, and the particular mapper and reducer we use are specied through -mapper and -reducer options. In case that the mapper and/or reducer are not already present on the remote machine (which will often be the case), we also have to package the actual les in the job submission. Assuming that neither mapper.py nor reducer.py were present on the machines in the cluster, the previous job would be run as hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \ -input /user/hadoop/input \ -output /user/hadoop/output \ -mapper mapper.py \ -file mapper.py \ -reducer reducer.py \ -file reducer.py Here, the -file option species that the le is to be copied to the cluster. This can be very useful for also packaging any auxiliary les your program might use (dictionaries, conguration les, etc). Each job can have multiple -file options. We will run a simple example to demonstrate how streaming jobs are run. Copy the le: /user/sasa/source/random-mapper.py from HDFS to a local directory (directory on the local machine, NOT on HDFS). This mapper simply generates a random number for each word in the input le, hence the input le in your input directory can be anything. Run the job by typing: hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \ -input /user/sxxxxxxx/data/input \ -output /user/sxxxxxxx/data/output \ -mapper random-mapper.py \ -file random-mapper.py IMPORTANT NOTE: for Hadoop to know how to properly run your Python scripts, you MUST include the following line as the FIRST line in all your mappers/reducers: #!/usr/bin/python What happens when instead of using mapper.py you use /bin/cat as a mapper? What happens when you use /bin/cat as a mapper AND as a reducer? Setting job conguration Various job options can be specied on the command line, we will cover the most used ones in this section. The general syntax for specifying additional conguration variables is -jobconf <name>=<value> To avoid having your job named something like streamjob5025479419610622742.jar, you can specify an alternative name through mapred.job.name variable. For example, -jobconf mapred.job.name="my job" Run the random-mapper.py example again, this time naming your job Random job <matriculation_number>, where <matriculation_number> is your matriculation number. After you run the job (and preferably before it nishes), open the browser and go to http://hcrc1425n01.inf.ed.ac. uk:50030/. In the list of running jobs look for the job with the name you gave it and click on it. You can see various statistics about your job try to nd the number of reducers used. How many reducers did you use? If your job nished before you had a chance to open the browser, it will be in the list of nished jobs, not the list of running jobs, but you can still see all the same information by clicking on it. 5

3.4 Secondary sorting As was mentioned earlier, the key/value pairs are obtained by splitting the mapper output on the rst tab character in the line. This can be changed using stream.map.output.field.separator and stream.num.map.output.key.fields variables. For example, if I want the key to be everything up to the second - character in the line, I would add the following: -jobconf stream.map.output.field.separator=- \ -jobconf stream.num.map.output.key.fields=2 Hadoop also comes with a partitioner class org.apache.hadoop.mapred.lib.keyfieldbasedpartitioner which is useful for cases where you want to perform a secondary sort on the keys. Imagine you have the following list of IPs: 192.168.2.1 190.191.34.38 161.53.72.111 192.168.1.1 161.53.72.23 You want to partition the data so that addresses with the rst 16 bits are processed by the same reducer. However, you also want each reducer to see the data sorted according to the rst 24 bits of the address. Using the mentioned partitioner class you can tell Hadoop how to group the data to be processed by the reducers. You do this using the following options: -jobconf map.output.key.field.separator=. -jobconf num.key.fields.for.partition=2 Task The rst option tells Hadoop what character to use as a separator (just like in the previous example), and the second one tells how many elds from the key to use for partitioning. Knowing this, here is how we would solve the IP address example (assuming that the addresses are in /user/hadoop/input): hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \ -input /user/hadoop/input \ -output /user/hadoop/output \ -mapper cat \ -reducer cat \ -partitioner org.apache.hadoop.mapred.lib.keyfieldbasedpartitioner \ -jobconf stream.map.output.field.separator=. \ -jobconf stream.num.map.output.key.fields=3 \ -jobconf map.output.key.field.separator=. \ -jobconf num.key.fields.for.partition=2 The line with -jobconf num.key.fields.for.partition=2 tells Hadoop to partition IPs based on the rst 16 bits (rst two numbers), and -jobconf stream.num.map.output.key.fields=3 tells it to sort the IPs according to everything before the third separator (the dot in this case) this corresponds to the rst 24 bits of the address. Copy the le /user/sasa/data/secondary to your input directory. Lines in this le have the following format: LastName.FirstName.Address.PhoneNo Using what you learned in this section, your task is to: 1. Partition the data so that all people with the same last name go to the same reducer. 2. Partition the data so that all people with the same last name go to the same reducer, and also make sure that the lines are sorted according to rst name. 3. Partition the data so that all the people with the same rst and last name go to the same reducer, and that they are sorted according to address. 6