Hadoop Hands-On Exercises

Similar documents

Hadoop Hands-On Exercises

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Extreme computing lab exercises Session one

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Introduction to Cloud Computing

Basic Hadoop Programming Skills

Extreme computing lab exercises Session one

Introduction to HDFS. Prasanth Kothuri, CERN

Command Line Crash Course For Unix

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

TP1: Getting Started with Hadoop

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Single Node Hadoop Cluster Setup

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

How To Install Hadoop From Apa Hadoop To (Hadoop)

Hadoop MultiNode Cluster Setup

Hadoop Installation MapReduce Examples Jake Karnes

Single Node Setup. Table of contents

Running Hadoop on Windows CCNP Server

Hadoop Training Hands On Exercise

Hadoop Streaming. Table of contents

IBM Smart Cloud guide started

Hadoop (pseudo-distributed) installation and configuration

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Hadoop Shell Commands

HSearch Installation

Introduction to HDFS. Prasanth Kothuri, CERN

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Tutorial- Counting Words in File(s) using MapReduce

Monitoring a Linux Mail Server

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Extending Remote Desktop for Large Installations. Distributed Package Installs

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

To reduce or not to reduce, that is the question

CS2510 Computer Operating Systems Hadoop Examples Guide

Hadoop Shell Commands

HDFS File System Shell Guide

File System Shell Guide

map/reduce connected components

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Unix Sampler. PEOPLE whoami id who

STEP 4 : GETTING LIGHTTPD TO WORK ON YOUR SEAGATE GOFLEX SATELLITE

HDFS Installation and Shell

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Syntax: cd <Path> Or cd $<Custom/Standard Top Name>_TOP (In CAPS)

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Virtual Machine (VM) For Hadoop Training

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

cloud-kepler Documentation

Hadoop Basics with InfoSphere BigInsights

IBM Software Hadoop Fundamentals

How To Use Hadoop

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Apache Hadoop new way for the company to store and analyze big data

Linux Overview. Local facilities. Linux commands. The vi (gvim) editor

ULTEO OPEN VIRTUAL DESKTOP V4.0

There s a variety of software that can be used, but the approach described here uses freely available Cygwin software: (1) Cygwin/X (2) Cygwin/openssh

研發專案原始程式碼安裝及操作手冊. Version 0.1

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Using a login script for deployment of Kaspersky Network Agent to Mac OS X clients

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Running Kmeans Mapreduce code on Amazon AWS

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Birmingham Environment for Academic Research. Introduction to Linux Quick Reference Guide. Research Computing Team V1.0

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Linux command line. An introduction to the Linux command line for genomics. Susan Fairley

MapReduce. Tushar B. Kute,

HDFS. Hadoop Distributed File System

SparkLab May 2015 An Introduction to

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Data Intensive Computing Handout 5 Hadoop

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

ITG Software Engineering

Secure Shell Demon setup under Windows XP / Windows Server 2003

AmbrosiaMQ-MuleSource ESB Integration

Source Code Management for Continuous Integration and Deployment. Version 1.0 DO NOT DISTRIBUTE

Tutorial Guide to the IS Unix Service

Answers to Even-numbered Exercises

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Fred Hantelmann LINUX. Start-up Guide. A self-contained introduction. With 57 Figures. Springer

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

A Crash Course on UNIX

Basic C Shell. helpdesk@stat.rice.edu. 11th August 2003

Data Intensive Computing Handout 6 Hadoop

13. Configuring FTP Services in Knoppix

LSN 10 Linux Overview

Transcription:

Hadoop Hands-On Exercises Lawrence Berkeley National Lab Oct 2011

We will Training accounts/user Agreement forms Test access to carver HDFS commands Monitoring Run the word count example Simple streaming with Unix commands Streaming with simple scripts Streaming Census example Pig Examples Additional Exercises 2

Instructions http://tinyurl.com/nerschadoopoct 3

Login and Environment ssh [username]@carver.nersc.gov echo $SHELL should be bash 4

Remote Participants Visit: http://maghdp01.nersc.gov:50030/ http://magellan.nersc.gov (Go to Using Magellan -> Creating a SOCKS proxy) 5

Environment Setup $ ssh [username]@carver.nersc.gov $ echo $SHELL If your shell doesn t show /bin/bash please change your shell $ bash Setup your environment to use Hadoop on Magellan system $ module load tig hadoop 6

Hadoop Command hadoop command [genericoptions] [commandoptions] Examples:command fs, jar, job [genericoptions] - -conf, -D, -files, -libjars, -archives [commandoptions] - -ls, -submit 7

HDFS Commands [1] $ hadoop fs ls If you see an error do the following where [username] is your training account username $ hadoop fs -mkdir /user/[username] $ vi testfile1 [ Repeat for testfile2] This is file 1 This is to test HDFS $ hadoop fs -mkdir input $ hadoop fs -put testfile* input You can get help on commands $ hadoop fs -help 8

HDFS Commands [2] $ hadoop fs -cat input/testfile1 $ hadoop fs -cat input/testfile* Download the files from HDFS into a directory called input and check there is a input directory. $ hadoop fs -get input input $ ls input/ 9

Monitoring http://maghdp01.nersc.gov:50030/ http://maghdp01.nersc.gov:50070/ $ hadoop job -list 10

Wordcount Example Input in HDFS $ hadoop fs -mkdir wordcount-in $ hadoop fs -put /global/scratch/sd/lavanya/ hadooptutorial/wordcount/* wordcount-in/ Run example $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/hadoop-0.20.2+228-examples.jar wordcount wordcount-in wordcount-op View output $ hadoop fs -ls wordcount-op $ hadoop fs -cat wordcount-op/part-r-00000 $ hadoop fs -cat wordcount-op/p* grep Darcy 11

Wordcount: Number of reduces $ hadoop dfs -rmr wordcount-op $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/hadoop-0.20.2+228-examples.jar wordcount -Dmapred.reduce.tasks=4 wordcount-in wordcount-op http://maghdp01.nersc.gov:50030/ 12

Wordcount: GPFS Setup permissions for Hadoop user [ONE-TIME] $ mkdir /global/scratch/sd/[username]/hadoop $ chmod -R 755 /global/scratch/sd/[username] $ chmod -R 777 /global/scratch/sd/[username]/hadoop/ Run Job $ hadoop jar /usr/common/tig/hadoop /hadoop-0.20.2+228/hadoop-0.20.2+228-examples.jar wordcount Dfs.default.name=file://// /global/scratch/sd/lavanya/ hadooptutorial/wordcount/ /global/scratch/sd/[username] hadoop/wordcount-gpfs/ Set perms for yourself $ fixperms.sh /global/scratch/sd/[username]/hadoop/wordcountgpfs/ 13

Streaming with Unix Commands $ hadoop jar $HADOOP_HOME/contrib/streaming/ hadoop*-streaming.jar -input wordcount-in -output wordcount-streaming-op -mapper /bin/cat -reducer / usr/bin/wc $ hadoop fs -cat wordcount-streaming-op/p* 14

Streaming with Unix Commands/ GPFS $ hadoop jar $HADOOP_HOME/contrib/streaming/ hadoop*-streaming.jar -Dfs.default.name=file:/// input /global/scratch/sd/lavanya/hadooptutorial/ wordcount/ -output /global/scratch/sd/[username]/ hadoop/wordcount-streaming-op -mapper /bin/cat reducer /usr/bin/wc $ fixperms.sh /global/scratch/sd/[username]/hadoop/ wordcount-streaming-op 15

Streaming with Scripts $ mkdir simple-streaming-example $ cd simple-streaming-example $ vi cat.sh cat Now let us test this $ hadoop fs -mkdir cat-in $ hadoop fs -put /global/scratch/sd/lavanya/ hadooptutorial/cat/in/* cat-in/ $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/contrib/streaming/ hadoop*streaming*.jar -mapper cat.sh -input cat-in output cat-op -file cat.sh 16

Streaming with scripts Number of reducers and mappers $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/contrib/streaming/ hadoop*streaming*.jar -Dmapred.reduce.tasks=0 mapper cat.sh -input cat-in -output cat-op -file cat.sh $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/contrib/streaming/ hadoop*streaming*.jar Dmapred.min.split.size=91212121212 -mapper cat.sh -input cat-in -output cat-op -file cat.sh 17

Census sample $ mkdir census $ cd census $ cp /global/scratch/sd/lavanya/hadooptutorial/ census/censusdata.sample. $ mkdir census $ cd census $ cp /global/scratch/sd/lavanya/hadooptutorial/ census/censusdata.sample. 18

Mapper #The code is available in $ vi mapper.sh while read line; do if [[ "$line" == *Alabama* ]]; then echo "Alabama 1" fi if [[ "$line" == *Alaska* ]]; then echo -e "Alaska\t1" fi done $ chmod 755 mapper.sh $ cat censusdata.sample./mapper.sh 19

Census Run $ hadoop fs -mkdir census $ hadoop fs -put /global/scratch/sd/lavanya/ hadooptutorial/census/censusdata.sample census/ $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/contrib/streaming/ hadoop*streaming*.jar -mapper mapper.sh -input census -output census-op -file mapper.sh reducer / usr/bin/wc $ hadoop fs -cat census-op/p* 20

Census Run: Mappers and Reducers $ hadoop fs -rmr census-op $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/contrib/streaming/ hadoop*streaming*.jar -Dmapred.map.tasks=10 Dmapred.reduce.tasks=2 -mapper mapper.sh -input census -output census-op/ -file mapper.sh -reducer / usr/bin/wc 21

Census: Custom Reducer $ vi reducer.sh last_key="alabama" while read line; do key=`echo $line cut -f1 -d' '` val=`echo $line cut -f2 -d' '` if [[ "$last_key" = "$key" ]];then let "count=count+1"; else echo "**" $last_key $count last_key=${key}; count=1; fi done echo "**" $last_key $count 22

Census Run with custom reducer $ hadoop fs -rmr census-op $ hadoop jar /usr/common/tig/hadoop/ hadoop-0.20.2+228/contrib/streaming/ hadoop*streaming*.jar -Dmapred.map.tasks=10 Dmapred.reduce.tasks=2 -mapper mapper.sh -input census -output census-op -file mapper.sh -reducer reducer.sh -file reducer.sh 23