TP1: Getting Started with Hadoop



Similar documents
Single Node Setup. Table of contents

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Hadoop (pseudo-distributed) installation and configuration

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Single Node Hadoop Cluster Setup

How To Install Hadoop From Apa Hadoop To (Hadoop)

Apache Hadoop new way for the company to store and analyze big data

HADOOP - MULTI NODE CLUSTER

Installation and Configuration Documentation

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Introduction to HDFS. Prasanth Kothuri, CERN

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial- Counting Words in File(s) using MapReduce

HSearch Installation

Introduction to Cloud Computing

Hadoop Shell Commands

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop Architecture. Part 1

Distributed Filesystems

Introduction to HDFS. Prasanth Kothuri, CERN

Apache Hadoop. Alexandru Costan

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop Setup. 1 Cluster

HDFS File System Shell Guide

2.1 Hadoop a. Hadoop Installation & Configuration

Introduction to MapReduce and Hadoop

Hadoop Installation. Sandeep Prasad

Hadoop Shell Commands

Tutorial for Assignment 2.0

Installing Hadoop. Hortonworks Hadoop. April 29, Mogulla, Deepak Reddy VERSION 1.0

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

HDFS Installation and Shell

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

How To Use Hadoop

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Running Kmeans Mapreduce code on Amazon AWS

MapReduce, Hadoop and Amazon AWS

Hadoop MultiNode Cluster Setup

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Chapter 7. Using Hadoop Cluster and MapReduce

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...

HADOOP CLUSTER SETUP GUIDE:

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

RDMA for Apache Hadoop User Guide

File System Shell Guide

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Hadoop IST 734 SS CHUNG

Tutorial for Assignment 2.0

Setting up Hadoop with MongoDB on Windows 7 64-bit

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Integration Of Virtualization With Hadoop Tools

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

IDS 561 Big data analytics Assignment 1

Hadoop Setup Walkthrough

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Basics with InfoSphere BigInsights

5 HDFS - Hadoop Distributed System

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Hadoop Training Hands On Exercise

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop Tutorial. General Instructions

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

MapReduce. Tushar B. Kute,

map/reduce connected components

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop Lab - Setting a 3 node Cluster. Java -

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HADOOP MOCK TEST HADOOP MOCK TEST II

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Extending Remote Desktop for Large Installations. Distributed Package Installs

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Big Data 2012 Hadoop Tutorial

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Transcription:

TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web search applications on a large number of machines. Hadoop is a java open source implementation of MapReduce. The two fundamental subprojects are the Hadoop MapReduce framework and the HDFS. HDFS is a distributed file system that provides high throughput access to application data. It is inspired by the GFS. HDFS has master/slave architecture. The master server, called NameNode, splits files into blocks and distributes them across the cluster with replications for fault tolerance. It holds all metadata information about stored files. The HDFS slaves, the actual store of the data blocks called DataNodes, serve read/write requests from clients and propagate replication tasks as directed by the NameNode. The Hadoop MapReduce is a software framework for distributed processing of large data sets on compute clusters. It runs on the top of the HDFS. Thus data processing is collocated with data storage. It also has master/slave architecture. The master, called Job Tracker (JT), is responsible of : (a) Querying the NameNode for the block locations, (b) considering the information retrieved by the NameNode, JT schedule the tasks on the slaves, called Task Trackers (TT), and (c) monitoring the success and failures of the tasks. The goal of this TP is to study the implementation and the operation of the Hadoop Platform. We will see how to deploy the platform and how to send and to retrieve the data to/from the HDFS. Finally we will run simple examples using the MapReduce paradigm. 1

Exercise 1: Installing Hadoop platform on your local machine The goal of this exercise is to learn how to set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS). For windows users You need to install the Cygwin environment. Cygwin is a set of Unix packages ported to Microsoft Windows. It is needed to run the scripts supplied with Hadoop because they are all written for the Unix platform. To install the cygwin environment follow these steps: Download cygwin installer from http://www.cygwin.com. Run the downloaded file. Make sure you select openssh" from the Net package. This package is required for the correct functioning of the Hadoop cluster and Eclipse plug-in. Figure 1 Add c:\cygwin\bin; c:\cygwin\usr\bin; in your Path variable. Setup SSH daemon Open the Cygwin command prompt. Execute the following command: textttssh-host-config When asked if privilege separation should be used, answer no. When asked if sshd should be installed as a service, answer yes. When asked about the value of CYGWIN environment variable, enter ntsec. To check everything is fine, check the line set CYGWIN=binmode ntsec in Cygwin.bat start the CYGWIN sshd service. Question 1.1 To avoid typing your password each time you start the Hadoop daemons, create (if not already in place) a couple ofssh keys using thessh-keygen command and add the public key to the authorized keys. ssh-keygen -q -N -f $HOME/.ssh/id_rsa cat $HOME/.ssh/id_rsa.pub $HOME/.ssh/authorized_keys 2

For windows users: ssh-keygen cat /.ssh/id_rsa.pub /.ssh/authorized_keys Check the success of this step by running: ssh localhost. Question 1.2 Download the Hadoop platform from (http://hadoop.apache.org/releases.html). Choose the latest stable version (currently 1.2.1). Extract the contents of hadoop-1.2.1.tar.gz in your home. To run, Hadoop needs a JAVA_HOME environment variable specifying the directory ofjvm. Add the following lines in yourbashrc: export JAVA_HOME = <java path> export HADOOP_HOME = <your Hadoop directory path> export PATH = $PATH:$HADOOP_HOME/bin To make these changes in your open terminal type: source /.bashrc For Windows users: Copy the Hadoop archive into your home directory folder. To do so, open Cygwin and then type textttexplorer.. Now extract the contents of hadoop- 1.2.1.tar.gz in your home. Question 1.3 Before starting the Hadoop platform, edit its configuration files located in the conf/ folder. As you are using your own machine, you must configure Hadoop as follows: In masters and slaves files (used to automate ssh connections) specify the name of your machine: localhost Inconf/hadoop-env.sh add: Change # export JAVA_HOME = /usr/lib/j2sdk1.5-sun To export JAVA_HOME = <java path> For Windows users: 3

Change # export JAVA_HOME = /usr/lib/j2sdk1.5-sun To export JAVA_HOME = /cygdrive/c/java/... Inconf/core-site.xml add: <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> This indicates the name of the machine and the associated network port on which the Namenode process will run. You can check that the 9000 port on your machine is not already in use: netstat -a grep tcp grep LISTEN grep 9000 Inconf/hdfs-site.xml add: <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> 4

Inconf/mapred-site.xml add: <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration> This indicates the name of the machine and the associated network port on which the JobTracker process will run. You can check that the 9001 port on your machine is not already in use: netstat -a grep tcp grep LISTEN grep 9001 Question 1.4 We will now start the platform, and start all daemons: Format a new distributed-filesystem: $bin/hadoop namenode -format Start the hadoop daemons: $bin/start-all.sh The hadoop daemon log output is written to the$hadoop_home/logs. Check them. A nifty tool for checking whether the expected Hadoop processes are running is jps. Question 1.5 To web interface that shows you the status of the NameNode and the JobTracker is available by default at: NameNode - http://localhost:50070/ JobTracker - http://localhost:50030/ Question 1.6 To stop all the daemons running on your machine: $bin/stop-all.sh 5

Exercise 2: Using the Hadoop distributed file System (HDFS) The goal of this exercise is to learn how to perform simple operations using the Hadoop Distributed File System (HDFS). The FileSystem (FS) shell is invoked by hadoop fs <args>. All the FS shell commands take path URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. An HDFS file or directory such as/parent/child can be specified as hdfs://namenodehost/parent/child OR /parent/child FS shell operations: hadoop fs -cat URI [URI...] Copies source paths to stdout. hadoop fs -copyfromlocal <localsrc> URI hadoop fs -copytolocal [-ignorecrc] [-crc] URI <localdst> hadoop fs -get [-ignorecrc] [-crc] <src> <localdst> hadoop fs -ls <args> hadoop fs -mkdir <paths> hadoop fs -put <localsrc>... <dst> hadoop fs -rmr URI [URI...] Question 2.1 We will now generate three files LoremIpsum-100, LoremIpsum-75 and LoremIpsum- 25. Their sizes are 100MB, 75MB and 25MB respectively. For this you use three times LoremIpsum and generator.sh script file provided in the course homepage http://www. irisa.fr/kerdata/people/alexandru.costan/bigdata/./generator.sh LoremIpsum 100./generator.sh LoremIpsum 75./generator.sh LoremIpsum 25 Now copy the three files to HDFS (put) and then show their content (cat) and finally remove (rmr) them. 6

Exercise 3: Running your first MapReduce program The goal of this exercise is to execute two MapReduce examples, typically used for benchmarking, which come with the default Hadoop distribution. Question 3.1 Run the wordcount example: hadoop jar $HADOOP_HOME/hadoop-examples-*.jar wordcount input output where input is the directory containing the input files (e.g. LoremIpsum-75) while the output is the directory that will be created to store the results. Use the 75 MB data set and check the output. Then, inspect the log files generated by this job. Question 3.2 Run the Pi estimator: hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi maps samples_per_map wheremaps represents the number of map tasks used and samples_per_map the number of points processed by each map task (i.e. the number of reducers). Run the Pi estimator with 20 maps and 10000 reducers. Check the log files. 7