Hadoop Hands-On Exercises

Similar documents

Hadoop Hands-On Exercises

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Hadoop Tutorial GridKa School 2011

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Command Line Crash Course For Unix

Basic Hadoop Programming Skills

Extreme computing lab exercises Session one

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

TP1: Getting Started with Hadoop

Hadoop Streaming. Table of contents

Introduction to HDFS. Prasanth Kothuri, CERN

Introduction to Cloud Computing

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Extreme computing lab exercises Session one

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

How To Install Hadoop From Apa Hadoop To (Hadoop)

Single Node Hadoop Cluster Setup

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Running Hadoop on Windows CCNP Server

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig

Hadoop Training Hands On Exercise

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

Hadoop Installation MapReduce Examples Jake Karnes

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

Recommended Literature for this Lecture

Hadoop Scripting with Jaql & Pig

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Click Stream Data Analysis Using Hadoop

IBM Smart Cloud guide started

Outline of Tutorial. Hadoop and Pig Overview Hands-on

How To Use Hadoop

To reduce or not to reduce, that is the question

HSearch Installation

Step 5: This is the final step in which I observe how many times each word is associated to a word. And

ITG Software Engineering

Hadoop (pseudo-distributed) installation and configuration

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Cloud Computing. Chapter Hadoop

CS2510 Computer Operating Systems Hadoop Examples Guide

Tutorial- Counting Words in File(s) using MapReduce

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

American International Journal of Research in Science, Technology, Engineering & Mathematics

Data Intensive Computing Handout 5 Hadoop

Hadoop Shell Commands

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

HDFS File System Shell Guide

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

File System Shell Guide

Programming with Pig. This chapter covers

A Crash Course on UNIX

Hadoop MultiNode Cluster Setup

STEP 4 : GETTING LIGHTTPD TO WORK ON YOUR SEAGATE GOFLEX SATELLITE

Other Map-Reduce (ish) Frameworks. William Cohen

Single Node Setup. Table of contents

Data Intensive Computing Handout 6 Hadoop

Introduction to HDFS. Prasanth Kothuri, CERN

map/reduce connected components

Virtual Machine (VM) For Hadoop Training

USER GUIDE. Snow Inventory Client for Unix Version Release date Document date

MapReduce. Tushar B. Kute,

Data Intensive Computing Handout 4 Hadoop

Hadoop Shell Commands

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

HDFS Installation and Shell

Extending Remote Desktop for Large Installations. Distributed Package Installs

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Easily parallelize existing application with Hadoop framework Juan Lago, July 2011

Unix Sampler. PEOPLE whoami id who

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Introduction to Apache Pig Indexing and Search

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Pig Laboratory. Additional documentation for the laboratory. Exercises and Rules. Tstat Data

Hadoop Basics with InfoSphere BigInsights

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

研發專案原始程式碼安裝及操作手冊. Version 0.1

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

HDFS. Hadoop Distributed File System

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Introduction to Hadoop

Adobe Marketing Cloud Using FTP and sftp with the Adobe Marketing Cloud

Monitoring a Linux Mail Server

Hadoop Job Oriented Training Agenda

Linux Overview. Local facilities. Linux commands. The vi (gvim) editor

cloud-kepler Documentation

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

Transcription:

Hadoop Hands-On Exercises Lawrence Berkeley National Lab July 2011

We will Training accounts/user Agreement forms Test access to carver HDFS commands Monitoring Run the word count example Simple streaming with Unix commands Streaming with simple scripts Streaming Census example Pig Examples Additional Exercises 2

Login and Environment ssh [username]@carver.nersc.gov echo $SHELL should be bash http://magellan.nersc.gov (Go to Using Magellan -> Creating a SOCKS proxy) Printed Handouts Online http://tinyurl.com/6frxxur /global/scratch/sd/lavanya/hadooptutorial 3

Environment Setup $ ssh [username]@carver.nersc.gov $ echo $SHELL If your shell doesn t show /bin/bash please change your shell $ bash Setup your environment to use Hadoop on Magellan system $ module load tig hadoop 4

Hadoop Command hadoop command [genericoptions] [commandoptions] Examples:command fs, jar, job [genericoptions] - -conf, -D, -files, -libjars, -archives [commandoptions] - -ls, -submit 5

HDFS Commands [1] $ hadoop fs ls If you see an error do the following where [username] is your training account username $ hadoop fs -mkdir /user/[username] $ vi testfile1 [ Repeat for testfile2] This is file 1 This is to test HDFS $ hadoop fs -mkdir input $ hadoop fs -put testfile* input You can get help on commands $ hadoop fs -help 6

HDFS Commands [2] $ hadoop fs -cat input/testfile1 $ hadoop fs -cat input/testfile* Download the files from HDFS into a directory called input and check there is a input directory. $ hadoop fs -get input input $ ls input/ 7

Monitoring http://maghdp01.nersc.gov:50030/ http://maghdp01.nersc.gov:50070/ $ hadoop job -list 8

Wordcount Example Input in HDFS $ hadoop fs -mkdir wordcount-in $ hadoop fs -put /global/scratch/sd/lavanya/ hadooptutorial/wordcount/* wordcount-in/ Run example $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/hadoop-0.20.2+228-examples.jar wordcount wordcount-in wordcount-op View output $ hadoop fs -ls wordcount-op $ hadoop fs -cat wordcount-op/part-r-00000 $ hadoop fs -cat wordcount-op/p* grep Darcy 9

Wordcount: Number of reduces $ hadoop dfs -rmr wordcount-op $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/hadoop-0.20.2+228-examples.jar wordcount -Dmapred.reduce.tasks=4 wordcount-in wordcount-op http://maghdp01.nersc.gov:50030/ 10

Wordcount: GPFS Setup permissions for Hadoop user [ONE-TIME] $ mkdir /global/scratch/sd/[username]/hadoop $ chmod -R 755 /global/scratch/sd/[username] $ chmod -R 777 /global/scratch/sd/[username]/hadoop/ Run Job $ hadoop jar /usr/common/tig/hadoop /hadoop-0.20.2+228/hadoop-0.20.2+228-examples.jar wordcount Dfs.default.name=file://// /global/scratch/sd/lavanya/ hadooptutorial/wordcount/ /global/scratch/sd/[username] hadoop/wordcount-gpfs/ Set perms for yourself $ fixperms.sh /global/scratch/sd/[username]/hadoop/wordcountgpfs/ 11

Streaming with Unix Commands $ hadoop jar $HADOOP_HOME/contrib/streaming/ hadoop*-streaming.jar -input wordcount-in -output wordcount-streaming-op -mapper /bin/cat -reducer / usr/bin/wc $ hadoop fs -cat wordcount-streaming-op/p* GPFS $ hadoop jar $HADOOP_HOME/contrib/streaming/ hadoop*-streaming.jar -Dfs.default.name=file:/// input /global/scratch/sd/lavanya/hadooptutorial/ wordcount/ -output /global/scratch/sd/[username]/ hadoop/wordcount-streaming-op -mapper /bin/cat reducer /usr/bin/wc $ fixperms.sh /global/scratch/sd/[username]/hadoop/ 12 wordcount-streaming-op

Streaming with Scripts $ mkdir simple-streaming-example $ cd simple-streaming-example $ vi cat.sh cat Now let us test this $ hadoop fs -mkdir cat-in $ hadoop fs -put /global/scratch/sd/lavanya/ hadooptutorial/cat/in/* cat-in/ $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/contrib/streaming/ hadoop*streaming*.jar -mapper cat.sh -input cat-in output cat-op -file cat.sh 13

Streaming with scripts Number of reducers and mappers $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/contrib/streaming/ hadoop*streaming*.jar -Dmapred.reduce.tasks=0 mapper cat.sh -input cat-in -output cat-op -file cat.sh $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/contrib/streaming/ hadoop*streaming*.jar Dmapred.min.split.size=91212121212 -mapper cat.sh -input cat-in -output cat-op -file cat.sh 14

Census sample $ mkdir census $ cd census $ cp /global/scratch/sd/lavanya/hadooptutorial/ census/censusdata.sample. $ mkdir census $ cd census $ cp /global/scratch/sd/lavanya/hadooptutorial/ census/censusdata.sample. 15

Mapper #The code is available in $ vi mapper.sh while read line; do if [[ "$line" == *Alabama* ]]; then echo "Alabama 1" fi if [[ "$line" == *Alaska* ]]; then echo -e "Alaska\t1" fi done $ chmod 755 mapper.sh $ cat censusdata.sample./mapper.sh 16

Census Run $ hadoop fs -mkdir census $ hadoop fs -put /global/scratch/sd/lavanya/ hadooptutorial/census/censusdata.sample census/ $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/contrib/streaming/ hadoop*streaming*.jar -mapper mapper.sh -input census -output census-op -file mapper.sh reducer / usr/bin/wc $ hadoop fs -cat census-op/p* 17

Census Run: Mappers and Reducers $ hadoop fs -rmr census-op $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/contrib/streaming/ hadoop*streaming*.jar -Dmapred.map.tasks=10 Dmapred.reduce.tasks=2 -mapper mapper.sh -input census -output census-op/ -file mapper.sh -reducer / usr/bin/wc 18

Census: Custom Reducer $ vi reducer.sh last_key="alabama" while read line; do key=`echo $line cut -f1 -d' '` val=`echo $line cut -f2 -d' '` if [[ "$last_key" = "$key" ]];then let "count=count+1"; else echo "**" $last_key $count last_key=${key}; count=1; fi done echo "**" $last_key $count 19

Census Run with custom reducer $ hadoop fs -rmr census-op $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/contrib/streaming/ hadoop*streaming*.jar -Dmapred.map.tasks=10 Dmapred.reduce.tasks=2 -mapper mapper.sh -input census -output census-op -file mapper.sh -reducer reducer.sh -file reducer.sh 20

Pig Basic Operations LOAD loads data into a relational form FOREACH..GENERATE Adds or removes fields (columns) GROUP Group data on a field JOIN Join two relations DUMP/STORE Dump query to terminal or file There are others, but these will be used for the exercises today

Pig Example Find the number of gene hits for each model in an hmmsearch (>100GB of output, 3 Billion Lines) bash# cat * cut f 2 sort uniq -c > hits = LOAD /data/bio/*' USING PigStorage() AS (id:chararray,model:chararray, value:float);! > amodels = FOREACH hits GENERATE model;! > models = GROUP amodels BY model;! > counts = FOREACH models GENERATE group,count (amodels) as count;! > STORE counts INTO 'tcounts' USING PigStorage();!

Pig - LOAD Example: hits = LOAD 'load4/*' USING PigStorage() AS (id:chararray, model:chararray,value:float);! Pig has several built-in data types (chararray, float, integer) PigStorage can parse standard line oriented text files. Pig can be extended with custom load types written in Java. Pig doesn t read any data until triggered by a DUMP or STORE

Pig FOREACH..GENERATE, GROUP Example: amodel = FOREACH model GENERATE hits;! models = GROUP amodels BY model;! counts = FOREACH models GENERATE group,count (amodels) as count;! Use FOREACH..GENERATE to pick of specific fields or generate new fields. Also referred to as a projection GROUP will create a new record with the group name and a bag of the tuples in each group You can reference a specific field in a bag with <bag>.field (i.e. amodels.model) You can use aggregate functions like COUNT, MAX, etc on a bag

Pig Important Points Nothing really happens until a DUMP or STORE is performed. Use FILTER and FOREACH early to remove unneeded columns or rows to reduce temporary output Use PARALLEL keyword on GROUP operations to run more reduce tasks

Pig - Exercise Using the census data (path), compute the number of records for each state.