Hadoop Installation MapReduce Examples Jake Karnes



Similar documents
Single Node Hadoop Cluster Setup

Using The Hortonworks Virtual Sandbox

Using BAC Hadoop Cluster

Basic Hadoop Programming Skills

Hadoop Tutorial. General Instructions

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

DVS-100 Installation Guide

Running Knn Spark on EC2 Documentation

Distributed convex Belief Propagation Amazon EC2 Tutorial

CDH installation & Application Test Report

MATLAB on EC2 Instructions Guide

Comsol Multiphysics. Running COMSOL on the Amazon Cloud. VERSION 4.3a

INSTALLING KAAZING WEBSOCKET GATEWAY - HTML5 EDITION ON AN AMAZON EC2 CLOUD SERVER

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

IDS 561 Big data analytics Assignment 1

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Tutorial: Using HortonWorks Sandbox 2.3 on Amazon Web Services

Running Kmeans Mapreduce code on Amazon AWS

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Cloudera Manager Training: Hands-On Exercises

How to Run Spark Application

Source Code Management for Continuous Integration and Deployment. Version 1.0 DO NOT DISTRIBUTE

cloud-kepler Documentation

Comsol Multiphysics. Running COMSOL on the Amazon Cloud. VERSION 4.3b

DVS-100 Installation Guide

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

How To Install Hadoop From Apa Hadoop To (Hadoop)

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Introduction to analyzing big data using Amazon Web Services

Hadoop Data Warehouse Manual

Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data

Hadoop (pseudo-distributed) installation and configuration

AlienVault Unified Security Management (USM) 4.x-5.x. Deploying HIDS Agents to Linux Hosts

Hadoop Basics with InfoSphere BigInsights

MapReduce, Hadoop and Amazon AWS

Deploying MongoDB and Hadoop to Amazon Web Services

ArcGIS 10.3 Server on Amazon Web Services

USER CONFERENCE 2011 SAN FRANCISCO APRIL Running MarkLogic in the Cloud DEVELOPER LOUNGE LAB

CDH 5 Quick Start Guide

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Tutorial- Counting Words in File(s) using MapReduce

Recommended File System Ownership and Privileges

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Contents Set up Cassandra Cluster using Datastax Community Edition on Amazon EC2 Installing OpsCenter on Amazon AMI References Contact

Student installation of TinyOS

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

map/reduce connected components

Amazon EFS (Preview) User Guide

Signiant Agent installation

HSearch Installation

Moving Drupal to the Cloud: A step-by-step guide and reference document for hosting a Drupal web site on Amazon Web Services

Creating an ESS instance on the Amazon Cloud

Local Caching Servers (LCS): User Manual

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

JobScheduler - Amazon AMI Installation

OpenTOSCA Release v1.1. Contact: Documentation Version: March 11, 2014 Current version:

AWS Quick Start Guide. Launch a Linux Virtual Machine Version

Extreme computing lab exercises Session one

Eucalyptus User Console Guide

Deploy the ExtraHop Discover Appliance with Hyper-V

Creating a DUO MFA Service in AWS

Hands-on Exercises with Big Data

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

CommandCenter Secure Gateway

Hadoop Basics with InfoSphere BigInsights

1. Install a Virtual Machine Download Ubuntu Ubuntu LTS Create a New Virtual Machine... 2

WA1826 Designing Cloud Computing Solutions. Classroom Setup Guide. Web Age Solutions Inc. Copyright Web Age Solutions Inc. 1

Installing (1.8.7) 9/2/ Installing jgrasp

Partek Flow Installation Guide

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

MATLAB Distributed Computing Server Cloud Center User s Guide

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Tibbr Installation Addendum for Amazon Web Services

CONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT

VXOA AMI on Amazon Web Services

Windows Firewall Configuration with Group Policy for SyAM System Client Installation

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Back Up Linux And Windows Systems With BackupPC

OCS Virtual image. User guide. Version: Viking Edition

RecoveryVault Express Client User Manual

KeyControl Installation on Amazon Web Services

Technical Support Set-up Procedure

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

CycleServer Grid Engine Support Install Guide. version 1.25

WA1791 Designing and Developing Secure Web Services. Classroom Setup Guide. Web Age Solutions Inc. Web Age Solutions Inc. 1

1. Product Information

JAMF Software Server Installation Guide for Linux. Version 8.6

Online Backup Linux Client User Manual

ms-help://ms.technet.2005mar.1033/security/tnoffline/security/smbiz/winxp/fwgrppol...

Rstudio Server on Amazon EC2

Online Backup Client User Manual Linux

Wolfr am Lightweight Grid M TM anager USER GUIDE

Online Backup Client User Manual

User Manual - Help Utility Download MMPCT. (Mission Mode Project Commercial Taxes) User Manual Help-Utility

Important Notice. (c) Cloudera, Inc. All rights reserved.

Running Hadoop on Windows CCNP Server

unisys Unisys Stealth(cloud) for Amazon Web Services Deployment Guide Release 1.0 January

WA2102 Web Application Programming with Java EE 6 - WebSphere RAD 8.5. Classroom Setup Guide. Web Age Solutions Inc. Web Age Solutions Inc.

Transcription:

Big Data Management Hadoop Installation MapReduce Examples Jake Karnes These slides are based on materials / slides from Cloudera.com Amazon.com Prof. P. Zadrozny's Slides

Prerequistes You must have an Amazon Web Services account before you begin. You can sign up for an account here: http://aws.amazon.com and click the Sign Up button. You will have to provide a credit card number during installation. You will likely incur some charges, but we can take steps to minimize these. Although Amazon has a Free Tier we will require more computing power.

Prerequistes This tutorial requires accessing a remote server through SSH. UNIX based Operating Systems (Mac and Linux Distros) have this functionality available from their terminals. Windows does not. You'll need to download and install a SSH client. PuTTY should work for our purposes. I have not personally tested this. A working understanding of Java is also expected when we discuss code examples for MapReduce.

Create EC2 Instances

Your First Server First log into your Amazon Web Console. Go to EC2 (It should be in the upper left). You should see this screen.

Creating a Security Group Click Security Groups in the left menu. Click the Create Security Group Button. Provide a Name and Description when prompted. In the bottom panel, go to the Inbound tab. Authorize all TCP communications. Authorise SSH Access on port 22. Authorize ICMP (Echo Reply). Click the button underneath the rule definiton to apply the changes.

Creating a Security Group

Creating a Security Group

Creating SSH keys Click Key Pairs in the left menu. Click the Create Key Button Provide a name for your key pair. Your private key < keypair-name >.pem will be downloaded automatically. AWS does not store the private keys. If you lose this file, you won't be able to SSH into instances you provision with this key pair. Copy the.pem file into your ~/.ssh directory.

Creating SSH Keys

Launch an EC2 Instance Click Instances in the left menu. Click the Launch Instance button Choose Ubuntu 12.04 LTS 64 bit. Go to the General Purpose tab and select m1.large. In Step 3, choose to create 4 instances. In Step 4, allocate 20GB to the root drive. Continue past Step 5. In Step 6 (Configure Security Group) choose the group that you created earlier. Ignore warnings about the security group. Choose the Key Pair you created earlier. Launch the instances!

Launch an EC2 Instance

Launch an EC2 Instance

Launch an EC2 Instance

Launch an EC2 Instance

Launch an EC2 Instance

Launch an EC2 Instance

Connect to your server Click on Instance in the left menu. Choose one of the instances you just created and copy the public DNS. Ex: ec2-54-193-59-36.us-west1.compute.amazonaws.com

Connect to your server Open a terminal on your local computer Enter the following command to ensure your private key isn't publicly viewable. chmod 400 ~/.ssh/<my-key-pair>.pem Enter the following command to connect to your Amazon instance. ssh -i ~/.ssh/<my-key-pair>.pem ubuntu@<public DNS> EX: ssh -i ~/.ssh/hadoopkey.pem ubuntu@ec2-54-19359-36.us-west-1.compute.amazonaws.com Accept the fingerprint. You are now connected!

Install Cloudera & Hadoop

What's Cloudera Manager? Cloudera was the first, and is currently, the leading provider and supporter of Apache Hadoop for Enterprise users. We will be using Cloudera Manager. Cloudera Manager is adminstrative tool for installing and maintaing Hadoop and many other tools in the Hadoop Ecosystem. CDH is Cloudera's open source distribution of Apache Hadoop.

Installing Cloudera Manager After you've connected to your instance. Enter the following command to download the Cloudera Installer. wget http://archive.cloudera.com/cm4/installer/latest/cloude ra-manager-installer.bin Execute the installer with these commands: sudo su chmod +x cloudera-manager-installer.bin./cloudera-manager-installer.bin Accept the licenses and wait for the installer to finish.

Installing Cloudera Manager

Installing Cloudera Manager

Installing Cloudera Manager

Troubleshooting If the installation pauses at any one step for more than 5 minutes, something has gone wrong. First try to cancel the installation by using CTRL+C. Exit the installater and reexecute the.bin file. If you cannot exit using CTRL+C, close the terminal window, reconnect to the server, and relauch the installer.

Using Cloudera Manager After point your browser to: http:\\<public DNS>:7180 EX: http://ec2-54-193-92-102.us-west1.compute.amazonaws.com:7180 Login using User=admin and Pass=admin

Using Cloudera Manager Select the free version and continue.

Using Cloudera Manager Launch the Classic Wizard.

Using Cloudera Manager Continue.

Using Cloudera Manager Enter the Public DNS for each of your instances. Click Search. Ensure that all instances are selected and continue.

Using Cloudera Manager Unselect Impala and Solr. We won't use them.

Using Cloudera Manager Enter ubuntu as the user. Upload the.pem file that was downloaded earlier.

Using Cloudera Manager Installing...

Using Cloudera Manager Done with this part!

Using Cloudera Manager More installing...

Using Cloudera Manager Looking good so far.

Using Cloudera Manager The system passes inspection.

Using Cloudera Manager We'll only need Core Hadoop for now.

Using Cloudera Manager Use embedded databases. Test the connection and then continue.

Using Cloudera Manager Leave the defaults here. Continue.

Using Cloudera Manager Starting services.

Using Cloudera Manager We now have Hadoop running on our cluster!

Using Hadoop and MapReduce

Getting Test Data Download the following tar.gz file to your local machine: https://drive.google.com/file/d/0b9fmxvd4btedqwzstegyaue5ctg/edit?usp=sharing Upload the file to your EC2 instance with the following command. scp -i ~/.ssh/<key FILE NAME>.pem <LOCAL PATH TO FILE>/shakespere.tar.gz ubuntu@<public DNS>:~ EX: scp -i ~/.ssh/hadoopkey.pem /home/jake/desktop/cs157b/shakes/output/shakespeare.tar.gz ubuntu@ec2-54-193-79-16.us-west-1.compute.amazonaws.com:~ Log into the same EC2 instance. Unzip the file with these commands: mkdir shakes gzip -dc shakespeare.tar.gz tar -xf - -C ~/shakes

All of Shakespeare's Work You now have all of Shakepeare's written works. Typically Hadoop works better with larger files, but these will still work for our purposes.

Deploying Test Data into HDFS Run the following command to make a directory on HDFS. The next command changes the ownership of the newly created directory to our user (ubuntu) hadoop fs -mkdir /user/ubuntu/input Load our test text files into HDFS sudo -u hdfs hadoop fs -chown -R ubuntu /user/ubuntu Create an input directory sudo -u hdfs hadoop fs -mkdir /user/ubuntu hadoop fs -put ~/shakes/* /user/ubuntu/input Our files are now replicated and distrubuted across our cluster!

Word Count Let's count how many times each word is used. The data has been normalized to remove punctuation and case sensitivity. Download the WordCount.java file to your EC2 Instance: cd ~ wget cs.cmu.edu/~abeutel/wordcount.java Let's compile the code into a jar with these commands: mkdir wordcount_classes javac -classpath /opt/cloudera/parcels/cdh/lib/hadoop-0.20mapreduce/hadoopcore.jar:/opt/cloudera/parcels/cdh/lib/hadoop/hadoop-common.jar -d wordcount_classes WordCount.java jar -cvf ~/wordcount.jar -C wordcount_classes/. Let's run it! hadoop jar ~/wordcount.jar WordCount /user/ubuntu/input /user/ubuntu/output

What Did We Just Do? We've just run our first MapReduce job! We have counted how many times each word appears. To check on the output, run the following command: hadoop dfs -cat /user/ubuntu/output/part00000 On the left side we have the individual words. On the right is the number of times they appeared in all of Shakespeare's works!

Let's Look at Code (Finally) You can download the WordCount.java file by going here: cs.cmu.edu/~abeutel/wordcount.java At a high level, we'll see a class called WordCount. It contains: 2 inner, static classes that define a single method each. Map Reduce One main method.

The Map Class LongWritable key = byte offset of the line. Text value = a single line of text OutputCollector = A collection of KV pairs that will be sent to a Reducer once all Mappers are finished. OutputCollector Text = A single word OutputCollector IntWritable = The integer one = 1

Map Method I/O Input: Output:

The Reduce Class Text key = A single word Iterator<IntWritable> value = An iterator over all of the 1 values associated with the given key (word). OutputCollector = A collection of KV pairs that will be sent to a Reducer once all Mappers are finished. OutputCollector Text = A same word OutputCollector IntWritable = The number of occurrences of that word = The sum of the ones.

Reduce Method I/O Input: Output:

Main Method

Inverted Index Let's count how many times each word is used in total and how many times it's used per file! Download the WordCount.java file to your local machine: https://drive.google.com/file/d/0b9fmxvd4btedqwszytjmmtzftxc/edit?usp=sharing Move the file to your EC2 instance. scp -i ~/.ssh/hadoopkey.pem <LOCAL PATH TO FILE>InvertedIndex.java ubuntu@<public DNS>:~ Log into your EC2 instance. Let's compile the code into a jar with these commands: mkdir invertedindex_classes javac -classpath /opt/cloudera/parcels/cdh/lib/hadoop-0.20-mapreduce/hadoopcore.jar:/opt/cloudera/parcels/cdh/lib/hadoop/hadoop-common.jar -d invertedindex_classes InvertedIndex.java jar -cvf ~/invertedindex.jar -C invertedindex_classes/. Let's run it! hadoop fs -rm -r /user/ubuntu/output hadoop jar ~/invertedindex.jar InvertedIndex /user/ubuntu/input /user/ubuntu/output

Results

What did we change? Only minor changes were needed to enhance our WordCount program into the InvertedIndex program. You should already have the InvertedIndex.java file downloaded to your computer if you want to open it and inspect for yourself.

The New Map Class LongWritable key = byte offset of the line. Text value = a single line of text OutputCollector = A collection of KV pairs that will be sent to a Reducer once all Mappers are finished. OutputCollector Text = A single word OutputCollector Text = The file name containing this line

New Map Method I/O Input: Output:

The New Reduce Class Text key = A single word Iterator<Text> value = An iterator over all of the filenames containing the given key (word). OutputCollector = A collection of KV pairs that will be sent to a Reducer once all Mappers are finished. OutputCollector Text = A same word OutputCollector Text = The number of occurrences of that word AND how many times each file contains it.

New Reduce Method I/O Input: Output:

Main Method

Retrieving files Now that we're done with MapReduce, let's get out files from HDFS to our local machines. Begin by being logged into your EC2 instance Get the files out of HDFS Now you have 2 new files in your home directory of the EC2 instance. hadoop fs -get /user/ubuntu/output/part* ~ Verify this by running: ls To download these to your local machine Log out of the EC2 instance. Enter: ~. You terminal will be returned to controlling your local machine Run this command to download the output part files: scp -i ~/.ssh/<keyfile>.pem ubuntu@<public DNS>:~/part* ~/Desktop/ You can now open the new files on your desktop in a text editor.

Terminate Your Instances After you're done using Hadoop, you want to terminate your EC2 instances. If you don't, you will continue to be charged per hour (even if you aren't actively using them)! When you terminate your instances though, you will lose ALL data/customizations. Therefore always download any necessary files to your location machine before terminating your instances. From the AWS console, click Instances in the left menu. Mark the check box for all of your instances on the left side. Click on Actions, then choose terminate. You will then see your instances shutting down. They will disappear after a few hours.