CprE 419 Lab 1: Using the Cluster, and Introduction to HDFS

Similar documents
Hadoop Installation MapReduce Examples Jake Karnes

WinSCP for Windows: Using SFTP to upload files to a server

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

TP1: Getting Started with Hadoop

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Overview. Remote access and file transfer. SSH clients by platform. Logging in remotely

Hadoop Tutorial. General Instructions

IDS 561 Big data analytics Assignment 1

Introduction to Cloud Computing

Installing a Symantec Backup Exec Agent on a SnapScale Cluster X2 Node or SnapServer DX1 or DX2. Summary

Running Knn Spark on EC2 Documentation

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Configure Backup Server for Cisco Unified Communications Manager

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Ad Hoc (Temporary) Accounts Instructions

Linux Connection Guide. by Tristan Findley

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Miami University RedHawk Cluster Connecting to the Cluster Using Windows

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Distributed File System (HDFS)

Laboration 3 - Administration

HPCC - Hrothgar Getting Started User Guide

WinSCP PuTTY as an alternative to F-Secure July 11, 2006

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

CASHNet Secure File Transfer Instructions

Guide to the Configuration and Use of SFTP Clients for Uploading Digital Treatment Planning Data to ITC

Install and configure SSH server

Word count example Abdalrahman Alsaedi

Hadoop Basics with InfoSphere BigInsights

2 Advanced Session... Properties 3 Session profile... wizard. 5 Application... preferences. 3 ASCII / Binary... Transfer

5 HDFS - Hadoop Distributed System

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hadoop Basics with InfoSphere BigInsights

Guide to the Configuration and Use of SFTP Clients for Uploading Digital Treatment Planning Data to IROC RI

Distributed convex Belief Propagation Amazon EC2 Tutorial

Introduction to HDFS. Prasanth Kothuri, CERN

CONNECTING TO DEPARTMENT OF COMPUTER SCIENCE SERVERS BOTH FROM ON AND OFF CAMPUS USING TUNNELING, PuTTY, AND VNC Client Utilities

Hadoop Training Hands On Exercise

Hadoop (pseudo-distributed) installation and configuration

WWA FTP/SFTP CONNECTION GUIDE KNOW HOW TO CONNECT TO WWA USING FTP/SFTP

Overview of Web Services API

CafePilot has 3 components: the Client, Server and Service Request Monitor (or SRM for short).

How to Backup XenServer VM with VirtualIQ

Extreme computing lab exercises Session one

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

SAS 9.4 In-Database Products

MapReduce. Tushar B. Kute,

Hadoop Basics with InfoSphere BigInsights

WinSCP Tutorial 01/28/09: Y. Liow

Using WinSCP to Transfer Data with Florida SHOTS

INASP: Effective Network Management Workshops

Access Instructions for United Stationers ECDB (ecommerce Database) 2.0

Deploying Microsoft Operations Manager with the BIG-IP system and icontrol

Please note that a username and password will be made available upon request. These are necessary to transfer files.

Single Node Setup. Table of contents

How to Run Spark Application

Running Hadoop on Windows CCNP Server

Linux Labs: mini survival guide

File Transfer Examples. Running commands on other computers and transferring files between computers

Extending Remote Desktop for Large Installations. Distributed Package Installs

ASX SFTP External User Guide

To reduce or not to reduce, that is the question

Connecting to the School of Computing Servers and Transferring Files

Cloudera Manager Training: Hands-On Exercises

Distributed File Systems

Distributed Filesystems

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

The full setup includes the server itself, the server control panel, Firebird Database Server, and three sample applications with source code.

Word Count Code using MR2 Classes and API

Tutorial- Counting Words in File(s) using MapReduce

Single Node Hadoop Cluster Setup

How to Install and Configure EBF15328 for MapR or with MapReduce v1

OSPI SFTP User Guide

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Lab 1 Beginning C Program

Configuring High Availability for VMware vcenter in RMS Distributed Setup

Basic Hadoop Programming Skills

Export & Backup Guide

Cloud Storage Quick Start Guide

File Transfer with Secure FTP

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Setting Up a Dreamweaver Site Definition for OIT s Web Hosting Server

CS2510 Computer Operating Systems Hadoop Examples Guide

HDFS: Hadoop Distributed File System

1 Recommended Readings. 2 Resources Required. 3 Compiling and Running on Linux

SOA Software API Gateway Appliance 7.1.x Administration Guide

VCL Access. VCL provides access to Linux and Windows 7 Virtual Machines. Users will only see those images that they are authorized to access.

User's Guide - Beta 1 Draft

Secure Data Transfer

Reflection DBR USER GUIDE. Reflection DBR User Guide. 995 Old Eagle School Road Suite 315 Wayne, PA USA

Recommended File System Ownership and Privileges

Using Symantec NetBackup with Symantec Security Information Manager 4.5

Project 5 Twitter Analyzer Due: Fri :59:59 pm

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

HSearch Installation

Intel Do-It-Yourself Challenge Lab 2: Intel Galileo s Linux side Nicolas Vailliet

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

map/reduce connected components

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Backing up AIR to Microsoft Windows

Transcription:

Purpose CprE 419 Lab 1: Using the Cluster, and Introduction to HDFS Department of Electrical and Computer Engineering Iowa State University Spring 2014 We have setup an environment on the cystorm cluster for large-scale data analysis. The first goal of this lab is get familiar with this environment, called Hadoop. The second goal is to get familiar with a distributed file system for storing large data sets, called HDFS, which is short for Hadoop Distributed File System. Specifically, at the end of this lab, you will know how to: Login to cystorm cluster with your user id Use HDFS and the HDFS Programming Interface Create and read files in HDFS Write, compile and execute a Hadoop program on the Cluster Submission Create a zip archive with the following and hand it in through blackboard. A write-up answering questions for each experiment in the lab. For output, cut and paste results from your terminal and summarize when necessary. Commented Code for your program. Include all source files needed for compilation. Resources You will be using Cystorm, a cluster of computers hosted at Iowa State. You will need the following before you start the lab. If you don t have these, please contact the instructors. Address of Cystorm cluster resource: cystorm.its.iastate.edu Putty/ some ssh client. The link to get Putty is at: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html Follow the below instructions to get Cystorm password Logging into Cystorm In order to access CyStorm you need to retrieve your password which is tied to your Iowa State account. Follow these steps: - Ssh into hpc02.ait.iastate.edu using Putty. Use your netid login credentials. (e.g., aalleven, and account password) - Navigate to your password file by typing cd /tmp/new/[your netid] - Look at the file by typing cat [your netid] - The file will contain a temporary password for you to use to log in to CyStorm. - Once you login you need to change your password by typing passwd You should now be logged into CyStorm and have changed your password For Windows: Login to Cystorm using Putty. Upon login you should see a login screen similar to below:

Login Screen for CyStorm The next step is to ssh into namenode. Type in ssh namenode to access the Hadoop namenode for CyStorm. HDFS Usage HDFS is a distributed file system that is used by Hadoop to store large files. It automatically splits a file into many blocks of no more than 64MB and stores the blocks on separate machines. This helps the system to store huge files, much larger than any single machine can store. The HDFS also creates replicas for each block (by default 3). This helps to make the system more fault-tolerant. HDFS is an integral part of the Hadoop eco-system. HDFS is an Open Source implementation of the Google File System. To try out different HDFS utilities, try the command: $ hadoop fs help You will notice that the HDFS utilities are very similar to that of the default UNIX filesystem utilities and hence should be easy to learn. Experiment 1 (10 points): Create a directory called /scr/<your login id>/lab1 under HDFS. (note, scr already exists) Create a new file called afile.txt on your local machine with some text (make me laugh). Then use WinSCP (or ssh on linux) to move the file to your home directory in Cystorm. Use a ****command to move this file to the directory you created. Show this to the instructor. You can browse HDFS using this link: http://hpcg10.its.iastate.edu:8080/hdfs/dfshealth.jsp Writing, Compiling, and Executing Programs using Hadoop Next we will see how to compile a Hadoop program in the Cluster. We will use the Java programming language, as that is the default language used with Hadoop. import java.io.*; import java.lang.*; import java.util.*;

import java.net.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.util.*; public class HDFSWrite { public static void main ( String [] args ) throws Exception { } // The system configuration Configuration conf = new Configuration(); // Get an instance of the Filesystem FileSystem fs = FileSystem.get(conf); String path_name = "/scr/<your ID>/Lab1/newfile"; Path path = new Path(path_name); // The Output Data Stream to write into FSDataOutputStream file = fs.create(path); // Write some data file.writechars("the first Hadoop Program!"); // Close the file and the file system instance file.close(); fs.close(); } The above Java program uses the HDFS Application Programming Interface (API) provided by Hadoop to write into a HDFS file. Paste the code into your favorite Java editor and change the following: String path_name = "/scr/<your ID>/Lab1/newfile"; to reflect your actual User ID. We will transfer this program to the Cluster, compile it and then run it to check the output. You can also compile the program on your local machine if you have the Hadoop libraries downloaded. Transferring Files from/to the Cluster For Windows: WinSCP is an open source SFTP and FTP client for Microsoft Windows. It can be used to transfer files securely between a remote and a local machine. You can get the WinSCP client from http://winscp.net/eng/index.php To connect to the Hadoop Master node: Start WinSCP. Type the IP address or host name of the Hadoop Master. Click Login. During the first login, you might get a warning message if the server's key is not cached. Click Yes to continue.

For UNIX-like systems: Issue the commands as described below from your local UNIX-like system to transfer files between the Cluster and your local system. To transfer a file from your local machine to the Hadoop Master Server: $ scp <filename> <user ID>@<server_IP>:/home/<user ID>/<filename> To transfer a file from the server to your local machine: $ scp <user ID>@<server_IP>:/home/<user ID>/<filename> <filename> Compiling and Running Hadoop Programs Compile the Java program on the namenode as follows. $ mkdir class $ javac -classpath /usr/share/hadoop/hadoop-core-1.2.1.jar -d class/ HDFSWrite.java $ jar -cvf HDFSWrite.jar -C class/. $ rm -rf class Execute the Hadoop program on namenode as follows: $ hadoop jar HDFSWrite.jar HDFSWrite Check the HDFS path "/scr/<your ID>/Lab1/newfile to verify the output of the program. You should see the file created with the contents as given in the program. You can also compile the program and generate the jar file on your desktop using Eclipse. For further instructions, see the manual on Compiling Hadoop With Eclipse that we have linked from the class website. HDFS API As seen in the last section, HDFS can also be accessed using a Java API, which allows us to read/write into the file system. For documentation, see: http://hadoop.apache.org/docs/r1.2.1/api/ Look at Package org.apache.hadoop.fs The FSDataOutputStream and FSDataInputStream can be used to read / write files in HDFS respectively. Check out their class documentations as well. Read the source code of the program that you just compiled. That would help you to better understand how the HDFS API works. Then, look at the different classes that were used and learn more about the various methods they provide. All the methods can be found in the Hadoop API documentation. Once you have browsed through the HDFS API, go onto the following experiment. Experiment 2 (40 points): Write a program using the Java HDFS API that reads the contents of the HDFS file /class/s14419x/lab1/bigdataset and computes the 8-bit XOR checksum of all bytes whose

offsets range from 5000000000 till 5000000999, both endpoints inclusive. Print out the 8-bit checksum. Attach the Java code in your submission, as well as the XOR output. For instance, the XOR checksum of the bitstring 0000000111111110000000011111111 is 00000000.