To reduce or not to reduce, that is the question



Similar documents
Basic Hadoop Programming Skills

Hadoop Tutorial. General Instructions

MapReduce. Tushar B. Kute,

IDS 561 Big data analytics Assignment 1

Running Hadoop on Windows CCNP Server

Extreme computing lab exercises Session one

TP1: Getting Started with Hadoop

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Installation MapReduce Examples Jake Karnes

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Hadoop Basics with InfoSphere BigInsights

Using BAC Hadoop Cluster

How To Use Hadoop

Hadoop (pseudo-distributed) installation and configuration

Hadoop Basics with InfoSphere BigInsights

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Java Language Tools COPYRIGHTED MATERIAL. Part 1. In this part...

CS 378 Big Data Programming

Hands-on Exercises with Big Data

Online Chapter B. Running Programs

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Cloud Computing. Chapter Hadoop

Version Control with. Ben Morgan

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Hadoop Setup. 1 Cluster

Step 5: This is the final step in which I observe how many times each word is associated to a word. And

Hadoop Training Hands On Exercise

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

How To Install Hadoop From Apa Hadoop To (Hadoop)

cloud-kepler Documentation

Hadoop Hands-On Exercises

A. Aiken & K. Olukotun PA3

Introduction to Cloud Computing

Tutorial- Counting Words in File(s) using MapReduce

Hadoop MultiNode Cluster Setup

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

SparkLab May 2015 An Introduction to

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

S. Bouzefrane. How to set up the Java Card development environment under Windows? Samia Bouzefrane.

CS2510 Computer Operating Systems Hadoop Examples Guide

WA1826 Designing Cloud Computing Solutions. Classroom Setup Guide. Web Age Solutions Inc. Copyright Web Age Solutions Inc. 1

Virtual Machine (VM) For Hadoop Training

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Netbeans IDE Tutorial for using the Weka API

Workshop on Hadoop with Big Data

Building Programs with Make. What does it mean to Build a Program?

Hadoop Hands-On Exercises

CS 145: NoSQL Activity Stanford University, Fall 2015 A Quick Introdution to Redis

CS 103 Lab Linux and Virtual Machines

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Android: Setup Hello, World: Android Edition. due by noon ET on Wed 2/22. Ingredients.

Single Node Hadoop Cluster Setup

Hypercosm. Studio.

Extreme computing lab exercises Session one

map/reduce connected components

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Penetration Testing Lab. Reconnaissance and Mapping Using Samurai-2.0

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Assignment 1: MapReduce with Hadoop

Linux Overview. Local facilities. Linux commands. The vi (gvim) editor

Hadoop Data Warehouse Manual

Week 2 Practical Objects and Turtles

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Java Client Side Application Basics: Decompiling, Recompiling and Signing

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Single Node Setup. Table of contents

IDS and Penetration Testing Lab III Snort Lab

IBM Software Hadoop Fundamentals

A Short Introduction to Writing Java Code. Zoltán Majó

Network Security, ISA 656, Angelos Stavrou. Snort Lab

Creating a Java application using Perfect Developer and the Java Develo...

Recover Data Like a Forensics Expert Using an Ubuntu Live CD

Supplement I.C. Creating, Compiling and Running Java Programs from the Command Window

Eclipse Help

Project 5 Twitter Analyzer Due: Fri :59:59 pm

A Java Crib Sheet. First: Find the Command Line

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. P. N. Hilfinger

Deployment Planning Guide

Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0

Code::Blocks Student Manual

Lab 0 (Setting up your Development Environment) Week 1

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

Introduction to Synoptic

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Setting up Hadoop with MongoDB on Windows 7 64-bit

Map Reduce & Hadoop Recommended Text:

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

Microsoft Windows PowerShell v2 For Administrators

Data processing goes big

IOIO for Android Beginners Guide Introduction

using version control in system administration

Tutorial for Assignment 2.0

Hadoop 2.6 Configuration and More Examples

Transcription:

To reduce or not to reduce, that is the question 1 Running jobs on the Hadoop cluster For part 1 of assignment 8, you should have gotten the word counting example from class compiling. To start with, let s walkthrough the process of running code you ve written on the hadoop cluster. 1. Get the code compiled - If you re working on a lab machine: Eclipse already compiles the code for you, so you just need to cd into the bin directory of the project directory within the Eclipse workspace folder where the WordCount code resides. If you re don t have separate src and bin directories 1 then just cd into the base of the project directory. - If you re working on your own computer: * Go to the src directory of the project directory within the Eclipse workspace folder where the WordCount code resides. If you re don t have separate src and bin directories 1 then just cd into the base of the project directory. * Use scp or whatever approach you use to copy files to basin and copy all of the source files in their package directory structure (in this case, demos) to a directory on basin. 1 There s an option to select this when you setup a new project. In general, it s a good idea to do this. 1

2. Create a jar file * Login to basin and then cd to the folder where you copied the source files. Compile your code from the command-line: javac cp / usr / l o c a l /hadoop/hadoop core 1. 2. 1. j a r :. demos /. java In this case, there is just one package and one file, however, in general, you will need to list all of the packages that have files you want to compile at the end of the javac call. Hadoop requires that you generate a jar file with all of your code to run it. To create a jar file, make sure you re in the base directory where you just compiled your code (or where Eclipse compiled it for you) and type the following: j a r c v f demos. j a r demos /. c l a s s The first argument is the name of the jar file to create and then everything else following is what should go into the jar file. In general, you will need to list all of the packages that have files you want to include at the end of the call to jar 3. Running your code on the cluster Now that you have a jar file you can run your code on the cluster. When running hadoop programs you use the jar flag to specify the jar file that contains the code and then you specify which classes main class to run (including the package structure): hadoop j a r demos. j a r demos. WordCount You should see the usage statement printed out: WordCount <input_dir> <output_dir> Let s run the full job with some very simple data: hadoop j a r demos. j a r demos. WordCount / user /dkauchak/ lab / output This uses some data in my directory with just a couple of files as input and will output the results to your own output directory. Take a look in the output directory and you should see a single file called part-00000. If you look at the contents of this file you should see:! 1. 1 a 1 also 2 and 2 another 1 banana 1 but 1 file 3 2

has 2 is 2 more 1 some 2 text 2 the 1 this 3 with 1 word 1 words 2 2 WordCount variations Now that we can compile and run our hadoop programs, let s try out a few variants of our WordCount program. NoOpReducer As I mentioned in class, one of the best ways to debug your program (and to develop it incrementally) is to use a reducer that simply copies its input to the output. In the demos directory on the course web page I have included the NoOpReducer.java class. Download this file and put it into your Eclipse project. Now, change the line in the run that sets the reducer to be: conf.setreducerclass(noopreducer.class); Recompile your code and then run this variant on the cluster. A few thoughts on this: - Don t forget to delete your output directory before you run it. Otherwise, you re going to get an exception. - Both for this lab and when you re writing your own programs, you want to make the process of compiling, copying the files, generating the jar, etc. as simple as possible. My advice is to setup a few terminal windows and designate each window as doing a few commands. You can then simply use the up-arrow to rerun these commands for each variation. For example, my setup is: 1. one terminal session for copying over the.java files 2. one terminal session for compiling and creating the jar file 3. one terminal session for deleting old output directory, executing hadoop command and peeking at output 4. one terminal session for doing more involved things with the hdfs, etc. 3

It can take a minute or two to setup, but I can go from a code change on my laptop to running on the hadoop cluster in about 10 seconds. Assuming this runs correctly, you should see something like:! 1. 1 a 1 also 1 also 1 and 1 and 1 another 1 banana 1 but 1 file 1 file 1 file 1 has 1 has 1 is 1 is 1 more 1 some 1 some 1 text 1 text 1 the 1 this 1 this 1 this 1 with 1 word 1 words 1 words 1 In addition to just a map and reduce phase, hadoop also allows for a combiner phase. A combiner implements the reducer interface and runs in between the map phase and the reduce phase. However, there s a catch! Let s run the WordCount example with a combiner and see if you can figure out what the catch is. Uncomment the line conf.setcombinerclass(wordcountreducer.class); in WordCount.java, recompile, etc. and then rerun on the hadoop cluster. What is the combiner doing? 4

(Really think about it for a minute! Scroll down when you re ready for the answer.) 5

The combiner phase is often referred to as a mini-reduce phase. It does run a full reduce (notice above we just used our WordCountReducer), however, it only runs it locally on the same machine that the map process was run. This means that it only reduces on some of the data. Why might this be useful? Why is this useful for the word count example? 3 Inverted Index Also in the demos directory online I have included another of the classic MapReduce examples called LineIndexer. Download this file into the demos package in your Eclipse workspace and look through the code to see if you can figure out what it does. The LineIndexer example includes one thing we haven t seen before: the use of the Reporter object. Here we re using it to get the split that we re working on and then using that to get the particular filename. Once you think you ve figure out how it works, set it up to run. To run this program, you ll need to change your call to hadoop to: hadoop j a r demos. j a r demos. LineIndexer Look at the output. Did it do what you expected? This function generates what is called an inverted index, that is a mapping from words to the documents that they occur in. This is probably one of the most important structures that enables web search (and many index-based search techniques). This was one of the early motivating examples that led to the development of the MapReduce framework by Google. A few things to think about: Could we use the LineIndexerReducer as a combiner for this example? Try it out and see if you re right. If you don t think you can, you should either get the wrong answer, or more likely, an error. If you do think you can, you should get the same output as before. In this example I have used a HashSet to filter out repeated entries. While this works, it does require populating and clearing the hash set each time reduce is called. If the list of documents that a word occurs in gets very large, this can be problematic (think web scale). Can you think of an alternative using the MapReduce framework (Hint: it involves using a two phase MapReduce.)? 4 grep grep is a great command-line utility that allows you to search for text (and even regular expressions) in files. If the text is found in a line in the file (or the regex matches), that entire line is printed out. grep is very fast, however, if you had a very large file (or a number of very large files) it could still take a long time. 6

In the demos directory online, I have included a basic version of grep implemented on MapReduce. Download it and take a look at it. See if you can figure out how it works. There s one new thing for this program and that is passing information to map function. How is this done? The configure function is called whenever a mapper/reducer is constructed. Run the program on the cluster. Again, to do this you ll have to call hadoop jar demos.jar and then tell it to run the grep example. Notice this example take three command-line arguments (try searching for text or banana ). Assuming all goes well, you should get an answer back. What are the numbers that get printed before the lines? Here are a few things to try: Can you use the reducer as a combiner? If so, try it out :) Modify the code so that it prints out both the number as well as the filename. Can you get it to group the results by filename? Modify the code to take a regular expression to match, not just a word. java.util.regex should be useful. To pass a regular expression as a parameter on the command-line you ll probably need to wrap it in quotes. 7