Word count example Abdalrahman Alsaedi



Similar documents
How To Write A Mapreduce Program In Java.Io (Orchestra)

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Word Count Code using MR2 Classes and API

Data Science in the Wild

Hadoop: Understanding the Big Data Processing Method

map/reduce connected components

CS54100: Database Systems

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Introduction to MapReduce and Hadoop

MR-(Mapreduce Programming Language)

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Introduction To Hadoop

By Hrudaya nath K Cloud Computing

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Download and install Download virtual machine Import virtual machine in Virtualbox

Hadoop Configuration and First Examples

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

HPCHadoop: MapReduce on Cray X-series

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Programming in Hadoop Programming, Tuning & Debugging

Hadoop Basics with InfoSphere BigInsights

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

LANGUAGES FOR HADOOP: PIG & HIVE

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

BIG DATA ANALYTICS HADOOP PERFORMANCE ANALYSIS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Hadoop Tutorial. General Instructions

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Introduction to MapReduce and Hadoop

Big Data Management and NoSQL Databases

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Hadoop, Hive & Spark Tutorial

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Tutorial- Counting Words in File(s) using MapReduce

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

IDS 561 Big data analytics Assignment 1

Lecture #1. An overview of Big Data

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

MapReduce Tutorial. Table of contents

Implementations of iterative algorithms in Hadoop and Spark

Internals of Hadoop Application Framework and Distributed File System

Creating a Java application using Perfect Developer and the Java Develo...

Data management in the cloud using Hadoop

Running Hadoop on Windows CCNP Server

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hands-on Exercises with Big Data

Connecting Hadoop with Oracle Database

Hadoop Map-Reduce Tutorial

Data Science Analytics & Research Centre

Cloud Computing Era. Trend Micro

MapReduce Programming with Apache Hadoop Viraj Bhat

Discover Hadoop. MapReduce Flexible framework for data analysis. Bonus. Network & Security. Special. Step by Step: Configuring.

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

BIG DATA APPLICATIONS

Introduction to MapReduce

Introduc8on to Apache Spark

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop Installation MapReduce Examples Jake Karnes

Getting to know Apache Hadoop

The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager

Why Spark Is the Next Top (Compute) Model

An Implementation of Sawzall on Hadoop

BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE

Extreme Computing. Hadoop MapReduce in more detail.

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Assignment 1 Introduction to the Hadoop Environment

Important Notice. (c) Cloudera, Inc. All rights reserved.

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

19 Putting into Practice: Large-Scale Data Management with HADOOP

HowTo Hadoop. Devaraj Das

Introduction to Big Data Science. Wuhui Chen

NOSQL DATABASE SYSTEMS

Data Deluge. Billions of users connected through the Internet. Storage getting cheaper

Introduction to Cloud Computing

Eclipse installation, configuration and operation

PaperStream Connect. Setup Guide. Version Copyright Fujitsu

Mrs: MapReduce for Scientific Computing in Python

Monitoring Oracle Enterprise Performance Management System Release Deployments from Oracle Enterprise Manager 12c

Amazon Glacier. Developer Guide API Version

Transcription:

Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program Prerequisites Steps a. S3 bucket b. EC2 key pair 1. Login to your account in AWS 2. From Amazon Web Services, go to EMR 3. Click on Create Cluster 4. From Create Cluster page choose Go to advance option 5. Now click on Configure sample application 6. From Select sample application, click the drop-down menu and select Word Count. In Output location just change the <bucket-name> with one of your buckets that found in S3. Click Ok 7. In Security and Access part choose your key pair from the drop-down menu 8. As a notification you will see in Steps part that you have one step called Word count. 9. Click Create Cluster. The page of creating cluster will appear and the state of your cluster is Starting. The cluster now in creating process and after the creation finished the state of your cluster will be changed to Running. 10. Go to S3 console and choose the bucket that you choose in step 6. you will find the output folder that contains the output, its some thing like in figure below.

Second: Create your own word count program using Eclipse As a another way that we can create our own file of word count, we can use Eclipse to write the word count program, as following. Prerequisites: a. install Eclipse in your machine. b. install Hadoop in your machine. Note: you will find links to install above packages in Topic Covered of Dr Gupta web page Steps 1. Start Eclipse 2. from file go to New project, then select Java project. 3. enter the name for your project, let say wordcount, and press finish. 4. Now we need to include some Hadoop dependenies to our project. To do so right click on the project name in project browser, and choose build bath and then Configure Build Path...

5. click on libraries, then Add external JARs then go to the location where you installed hadoop in your machine, select all JARs. Click Apply, and then OK. 6. We need to add more JARs. Do same thing that we did in step 5 but this time we will choose Client subfolder that found in Haddop folder. Select all JARs. Click Apply, and then OK. 7. Now our project redy to write some code in it. Right click on the project name and select new then package, name the package as wordcount, click finsh. Now right click on scr folder and select new and then class, name the class wordcount and click finish. 8. copy the code the following code and paste it in the class. package wordcount; import java.io.ioexception; import java.util.*; import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); 9. Note here that in some cases you find tow errors that indicated in step public class WordCount { and in step JobConf conf = new JobConf(WordCount.class); this because your class note exactly has the same name, so change your class ofr change this name as your class.

10. Now we need to compile the class before export it. First click on the pakege name and go to Run As, then Run Configuration, in Main class type the name of your class, and click Apply, and then Run. 11. As a last step. Go to file>export>java>jar file, then select your file and give it a name, and specific location. You need this location, because you will upload this JAR to S3 later. 12. To Run this file in AWS we follow steps from 1 to 4 that we mentioned in First: Using AWS word count program. Also we need to upload our JAR to one of the S3 buckets that we have, and we need to upload the file that we want to count the word on it. Note Some times you will find that there is some other applications that will be installed in your cluster this will be mentioned in section Software Application/ Application to be installed, you do not need any application other than Hadoop, so delete all other application by pressing X. 13. In Security and Access part choose your key pair from the drop-down menu. 14. In step section select Custom JAR from the drop-down menu and press Configure and add. Give a name for your step, let say WordCountCS6030. In JAR location browse the location that you saved you JAR in it. In Argument we need to specify the location of the file that the word count will work on it and the location of the folder that will contains the output file(s), write the following: s3n://<location_of_text_file>/your_text_file s3n://<output_bucket>/ 15. in Action on failure you can specify what AWS will do if your cluster fail to run your JAR. Specify the action that you want, let say Terminate cluste. Finally click Add. 16. Click Create Cluster. The page of creating cluster will appear and the state of your cluster is Starting. The cluster now in creating process and after the creation finished the state of your cluster will be changed to Running. 17. Go to S3 console and choose the bucket that you choose as an output bucket. you will find the output folder that contains the output.

Third: Using Hue to run word count program To run the word count example in Hue follow the following steps: 1. Run your Cloudera VM 2. go to Hue in task bar. 3. Go to File Browser 4. from the folders that displayed click on: 5.You will reach Oozie folder, click on this folder, and then click on New and then on Directory. Give a name for your directory, let say WordCount 6. click on your new folder and click on Upload next to New in previous figure. Now upload the file that you want to count the word on it. Let say we will work on the same file that we have from our CS6030 class.

7. Now, go to Query Editor, in the above task bar and select Pig from the drop down meny. 8. in the editor that you will see copy and paste the following Pig Latin code: A = LOAD '/oozie/wordcount' USING TextLoader() AS (words:chararray); B = FOREACH A GENERATE FLATTEN(TOKENIZE(*)); C = GROUP B BY $0; D = FOREACH C GENERATE group, COUNT(B); STORE D INTO '/oozie/wcresults'; 9. click on Submit in the Editor in left side. 10. Click on Save to save the script. 11. to see the result click on File Browser and then go to Oozie folder you will see their a folder called named wcresult. Click on this folder to see the result, it will be some thing like figure below.