CS 455 Spring 2015. Word Count Example



Similar documents
Centrify Server Suite For MapR 4.1 Hadoop With Multiple Clusters in Active Directory

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

HADOOP CLUSTER SETUP GUIDE:

Hands-on Exercises with Big Data

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Basic Hadoop Programming Skills

Install Hadoop on Ubuntu and run as standalone

Recommended Literature for this Lecture

Hadoop Training Hands On Exercise

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

MapReduce. Tushar B. Kute,

CS2510 Computer Operating Systems Hadoop Examples Guide

Assignment 1: MapReduce with Hadoop

Hadoop Setup Walkthrough

How To Install Hadoop From Apa Hadoop To (Hadoop)

Centrify Identity and Access Management for Hortonworks

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

python hadoop pig October 29, 2015

Setting up Hadoop with MongoDB on Windows 7 64-bit

MapReduce, Hadoop and Amazon AWS

Hadoop Tutorial. General Instructions

coreservlets.com Hadoop Course

CDH 5 Quick Start Guide

Virtual Machine (VM) For Hadoop Training

Hadoop Installation MapReduce Examples Jake Karnes

Hadoop MultiNode Cluster Setup

Hadoop (pseudo-distributed) installation and configuration

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

How MapReduce Works 資碩一 戴睿宸

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Hadoop 2.6 Configuration and More Examples

Running Hadoop on Windows CCNP Server

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

CS 378 Big Data Programming. Lecture 2 Map- Reduce

CS 378 Big Data Programming

IDS 561 Big data analytics Assignment 1

COURSE CONTENT Big Data and Hadoop Training

Tutorial- Counting Words in File(s) using MapReduce

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Running Knn Spark on EC2 Documentation

XMLVend Protocol Message Validation Suite

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

How to get started with Hadoop and your favorite databases

How to Run Spark Application

How to Install and Configure EBF15328 for MapR or with MapReduce v1

To reduce or not to reduce, that is the question

SAS Marketing Automation 4.4. Unix Install Instructions for Hot Fix 44MA10

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Extreme computing lab exercises Session one

5 HDFS - Hadoop Distributed System

How To Use Hadoop

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Apache Flume and Apache Sqoop Data Ingestion to Apache Hadoop Clusters on VMware vsphere SOLUTION GUIDE

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Rumen. Table of contents

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Using BAC Hadoop Cluster

Project 5 Twitter Analyzer Due: Fri :59:59 pm

H2O on Hadoop. September 30,

Single Node Hadoop Cluster Setup

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

HADOOP - MULTI NODE CLUSTER

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

YARN and how MapReduce works in Hadoop By Alex Holmes

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

HP SiteScope. Hadoop Cluster Monitoring Solution Template Best Practices. For the Windows, Solaris, and Linux operating systems

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware


An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Ankush Cluster Manager - Hadoop2 Technology User Guide

Important Notice. (c) Cloudera, Inc. All rights reserved.

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

A. Aiken & K. Olukotun PA3

Symantec Enterprise Solution for Hadoop Installation and Administrator's Guide 1.0

docs.hortonworks.com

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

The Hadoop Eco System Shanghai Data Science Meetup

HDFS Snapshots and Beyond

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Pivotal HD Enterprise

How to upload - copy PowerChute Network Shutdown installation files to VMware VMA from a PC

6. How MapReduce Works. Jari-Pekka Voutilainen

Fair Scheduler. Table of contents

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Extreme computing lab exercises Session one

MapReduce Evaluator: User Guide

Transcription:

CS 455 Spring 2015 Word Count Example Before starting, make sure that you have HDFS and Yarn running, using sbin/start-dfs.sh and sbin/start-yarn.sh Download text copies of at least 3 books from Project Gutenberg: (http://www.gutenberg.org/) st-vrain> la total 1284 -rw------- 1 class 84358 Mar 26 23:47 Indiscreet_Lettert.txt -rw------- 1 class 792920 Mar 26 23:48 Tale_Of_Two_Cities.txt -rw------- 1 class 421884 Mar 26 23:49 Tom_Sawyer.txt Create a directory in your local space on HDFS to store these books: st-vrain> $HADOOP_HOME/bin/hdfs dfs -mkdir /cs455 st-vrain> $HADOOP_HOME/bin/hdfs dfs -mkdir /cs455/books st-vrain> $HADOOP_HOME/bin/hdfs dfs -ls /cs455/ Found 1 items drwxr-xr-x - cs455 supergroup 0 2015-03-26 23:51 /cs455/books Move the books from NFS into HDFS: st-vrain> $HADOOP_HOME/bin/hdfs dfs -put *.txt /cs455/books st-vrain> $HADOOP_HOME/bin/hdfs dfs -ls /cs455/books Found 3 items -rw-r--r-- 3 cs455 supergroup 84358 2015-03-26 23:55 /cs455/books/indiscreet_lettert.txt -rw-r--r-- 3 cs455 supergroup 792920 2015-03-26 23:55 /cs455/books/tale_of_two_cities.txt -rw-r--r-- 3 cs455 supergroup 421884 2015-03-26 23:55 /cs455/books/tom_sawyer.txt You can also check that the books are there via the HDFS web portal: Page 1 of 5

Download the source code of the word count example from CS 455 course web site. (link: http://www.cs.colostate.edu/~cs455/cs455-wordcount-sp15.tar.gz) http://www.cs.colostate.edu/~cs455/cs455-wordcount- wget sp15.tar.gz Extract the tarball. tar xvf cs455-wordcount-sp15.tar.gz This includes an Ant build file called build.xml. This is used to compile source and package it into a jar. After compiling, it will create the jar file inside the./dist directory. You can use this build.xml file as it is for HW3-PC. Type ant to compile the source and create the jar file. st-vrain> ant Buildfile: /s/bach/a/class/cs455/sp15-hadoop/word-count/build.xml init: compile: dist: BUILD SUCCESSFUL Total time: 0 seconds st-vrain> ls./dist/ wordcount.jar st-vrain> Page 2 of 5

Run the jar in yarn: st-vrain> $HADOOP_HOME/bin/hadoop jar dist/wordcount.jar cs455.hadoop.wordcount.wordcountjob /cs455/books /cs455/wordcount-out 2015-03-27 00:36:53,833 INFO [main] client.rmproxy (RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at st-vrain/129.82.47.128:46783 2015-03-27 00:36:54,325 WARN [main] mapreduce.jobsubmitter (JobSubmitter.java:copyAndConfigureFiles(153)) - Hadoop commandline option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 2015-03-27 00:36:54,606 INFO [main] input.fileinputformat (FileInputFormat.java:listStatus(281)) - Total input paths to process : 3 2015-03-27 00:36:54,696 INFO [main] mapreduce.jobsubmitter (JobSubmitter.java:submitJobInternal(494)) - number of splits:3 2015-03-27 00:36:54,909 INFO [main] mapreduce.jobsubmitter (JobSubmitter.java:printTokens(583)) - Submitting tokens for job: job_1427438142863_0001 2015-03-27 00:36:55,231 INFO [main] impl.yarnclientimpl (YarnClientImpl.java:submitApplication(251)) - Submitted application application_1427438142863_0001 2015-03-27 00:36:55,270 INFO [main] mapreduce.job (Job.java:submit(1300)) - The url to track the job: http://stvrain:8088/proxy/application_1427438142863_0001/ 2015-03-27 00:36:55,271 INFO [main] mapreduce.job (Job.java:monitorAndPrintJob(1345)) - Running job: job_1427438142863_0001 2015-03-27 00:37:01,455 INFO [main] mapreduce.job (Job.java:monitorAndPrintJob(1366)) - Job job_1427438142863_0001 running in uber mode : false 2015-03-27 00:37:01,456 INFO [main] mapreduce.job (Job.java:monitorAndPrintJob(1373)) - map 0% reduce 0% 2015-03-27 00:37:07,530 INFO [main] mapreduce.job (Job.java:monitorAndPrintJob(1373)) - map 100% reduce 0% 2015-03-27 00:37:16,599 INFO [main] mapreduce.job (Job.java:monitorAndPrintJob(1373)) - map 100% reduce 100% 2015-03-27 00:37:17,631 INFO [main] mapreduce.job (Job.java:monitorAndPrintJob(1384)) - Job job_1427438142863_0001 completed successfully 2015-03-27 00:37:17,773 INFO [main] mapreduce.job (Job.java:monitorAndPrintJob(1391)) - Counters: 49 File System Counters FILE: Number of bytes read=549269 FILE: Number of bytes written=1522267 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1299517 HDFS: Number of bytes written=314863 HDFS: Number of read operations=12 HDFS: Number of large read operations=0 Page 3 of 5

HDFS: Number of write operations=2 Job Counters Launched map tasks=3 Launched reduce tasks=1 Data-local map tasks=3 Total time spent by all maps in occupied slots (ms)=10788 Total time spent by all reduces in occupied slots (ms)=6841 Total time spent by all map tasks (ms)=10788 Total time spent by all reduce tasks (ms)=6841 Total vcore-seconds taken by all map tasks=10788 Total vcore-seconds taken by all reduce tasks=6841 Total megabyte-seconds taken by all map tasks=11046912 Total megabyte-seconds taken by all reduce tasks=7005184 Map-Reduce Framework Map input records=27088 Map output records=226606 Map output bytes=2171352 Map output materialized bytes=549281 Input split bytes=355 Combine input records=226606 Combine output records=38119 Reduce input groups=29082 Reduce shuffle bytes=549281 Reduce input records=38119 Reduce output records=29082 Spilled Records=76238 Shuffled Maps =3 Failed Shuffles=0 Merged Map outputs=3 GC time elapsed (ms)=128 CPU time spent (ms)=8450 Physical memory (bytes) snapshot=956612608 Virtual memory (bytes) snapshot=3741761536 Total committed heap usage (bytes)=805306368 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1299162 File Output Format Counters Bytes Written=314863 Check output in HDFS: Page 4 of 5

st-vrain> $HADOOP_HOME/bin/hdfs dfs -ls /cs455/wordcount-out Found 2 items -rw-r--r-- 3 cs455 supergroup 0 2015-03-27 00:37 /cs455/wordcount-out/_success -rw-r--r-- 3 cs455 supergroup 314863 2015-03-27 00:37 /cs455/wordcount-out/part-r-00000 Check the output in the web portal. Click on part-r-00000 file and it will be downloaded. NOTE: if you run this repeatedly, you will need to either modify your output folder name, or delete it between runs Check the following link for the complete set of HDFS commands. [http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoophdfs/hdfscommands.html] Page 5 of 5