Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.



Similar documents
Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Installation and Configuration Documentation

Single Node Setup. Table of contents

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

HADOOP - MULTI NODE CLUSTER

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

TP1: Getting Started with Hadoop

Single Node Hadoop Cluster Setup

How To Install Hadoop From Apa Hadoop To (Hadoop)

2.1 Hadoop a. Hadoop Installation & Configuration

Integration Of Virtualization With Hadoop Tools

Apache Hadoop new way for the company to store and analyze big data

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

Hadoop Lab - Setting a 3 node Cluster. Java -

Hadoop Installation. Sandeep Prasad

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Tableau Spark SQL Setup Instructions

HSearch Installation

Tutorial for Assignment 2.0

Tutorial for Assignment 2.0

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Tutorial- Counting Words in File(s) using MapReduce

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Hadoop (pseudo-distributed) installation and configuration

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

MapReduce. Tushar B. Kute,

Running Kmeans Mapreduce code on Amazon AWS

SCHOOL OF SCIENCE & ENGINEERING. Installation and configuration system/tool for Hadoop

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

Installing Hadoop. Hortonworks Hadoop. April 29, Mogulla, Deepak Reddy VERSION 1.0

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

How to install Apache Hadoop in Ubuntu (Multi node setup)

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

How To Use Hadoop

Introduction to HDFS. Prasanth Kothuri, CERN

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

HPC (High-Performance the Computing) for Big Data on Cloud: Opportunities and Challenges

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

RHadoop Installation Guide for Red Hat Enterprise Linux

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Introduction to HDFS. Prasanth Kothuri, CERN

HDFS Installation and Shell

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

HADOOP MOCK TEST HADOOP MOCK TEST II

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Extending Remote Desktop for Large Installations. Distributed Package Installs

MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMINOUS DATA-SETS

Chase Wu New Jersey Ins0tute of Technology

Hadoop Basics with InfoSphere BigInsights

MapReduce, Hadoop and Amazon AWS

Yuji Shirasaki (JVO NAOJ)

Platfora Installation Guide

HADOOP CLUSTER SETUP GUIDE:

IDS 561 Big data analytics Assignment 1

Deploying MongoDB and Hadoop to Amazon Web Services

CDH 5 Quick Start Guide

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

CycleServer Grid Engine Support Install Guide. version 1.25

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Setting up Hadoop with MongoDB on Windows 7 64-bit

Hadoop Data Warehouse Manual

Hadoop Architecture. Part 1

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

CDH installation & Application Test Report

Introduction to Cloud Computing

Hadoop Tutorial. General Instructions

Deploying Apache Hadoop with Colfax and Mellanox VPI Solutions

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

L1: Introduction to Hadoop

MapReduce Evaluator: User Guide

A Cost Effective Virtual Cluster with Hadoop Framework for Big Data Analytics

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to

Hadoop Installation Guide

Hadoop Basics with InfoSphere BigInsights

Web Crawling and Data Mining with Apache Nutch Dr. Zakir Laliwala Abdulbasit Shaikh

HDFS. Hadoop Distributed File System

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

Transcription:

Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved.

The data analytics benchmark relies on using the Hadoop MapReduce framework to perform machine learning analysis on large-scale datasets. Apache provides an machine learning library, Mahout, that is designed to run with Hadoop and perform large-scale data analytics. Prerequisite Software Packages 1. Hadoop We recommended version 20.2, that was used and tested in our environment. 2. Mahout: Scalable machine learning and data mining library 3. Maven: A project management tool (required for installing Mahout) 4. Java JDK Download the Data Analytics benchmark Setting up Hadoop Hadoop contains two main components: the HDFS file system and the MapReduce infrastructure. HDFS is the distributed file system that stores the data on several nodes of the cluster called the data nodes. MapReduce is the component in charge of executing computational tasks on top of the data stored in the distributed file system. Installing Hadoop on a single node: 1. Create a Hadoop user (Note: This requires root privileges and the commands can be different in various Linux distributions such as useradd vs adduser): o sudo groupadd hadoop o sudo useradd -g hadoop hadoop o sudo passwd hadoop (to setup the password) 2. Preparing SSH o su - hadoop o ssh-keygen -t rsa -p "" (press enter for any prompts) o cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys o ssh localhost (answer yes to he prompt) 3. Unpack the downloaded Hadoop. Then, chown -R hadoop:hadoop hadoop-0.20.2 4. Update the.bashrc file with the following: o export HADOOP_HOME=/path/to/hadoop/ (e.g., /home/username/hadoop- 0.20.0/). o export JAVA_HOME=/pth/to/jdk (e.g., /usr/lib/jvm/java-6-sun). 5. In the HADOOP_HOME directory (the main Hadoop folder) make the following modifications to the configuration files: o In conf/core-site.xml add: <name> hadoop.tmp.dir</name> <value> /path/to/dir</value> <description> A base for other temporary directories. Set it to a directory that you have write permissions to.</description>

o o <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.scheme.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> In conf/mapred-site.xml add <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> Note that there are other parameters that should be tuned to fit the needs of your environment, such as mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum, mapred.map.tasks, mapred.reduce.tasks. For now we will rely on the defaults. In conf/hdfs-site.xml add <name>dfs.replication</name> <value>1</value> <description>default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> <name>dfs.name.dir</name> <value>/path/to/namespace/dir (dir of your choice)</value> <name>dfs.data.dir</name> <value>/path/to/file system/dir (dir of your choice)</value> 6. Format the HDFS filesystem: $HADOOP_HOME/bin/hadoop namenode -format 7. To start Hadoop: $HADOOP_HOME/bin/start-all.sh (you can start hdfs then mapreduce by start-dfs.sh and start-mapred.sh respectively) 8. To stop Hadoop: $HADOOP_HOME/bin/stop-all.sh (or stop-dfs.sh and stop-mapred.sh)

This link provides a step-by-step guide to install Hadoop on a single node. To install Hadoop on multiple nodes, you need to define the slave nodes in conf/slaves and maintain the node that you already installed Hadoop on as the master. Then, start all the nodes with start-all.sh from the master node. This link provides more details on installing Hadoop on multiple nodes. Installing Mahout 1. Download Mahout (already included in the analytics benchmark package) 2. For installing Mahout: mvn install (from MAHOUT_HOME folder) This will build the core package and the Mahout examples. You need to instal Maven to be able to perform this step. 3. For further information regarding Mahout installation please refer to Mahout website. Running the classification benchmark This benchmark runs the Wikipedia Bayes example as provided by Mahout package: 1. This benchmark uses a Wikipedia data set of ~30GB size. You can download it at the following address 2. Unzip the bz2 file to get enwiki-latest-pages-articles.xml. 3. Create the directory $MAHOUT_HOME/examples/temp and copy enwiki-latest-pagesarticles.xml into this directory. 4. The next step requires to chunk the data into pieces. For this step run the following: $MAHOUT_HOME/bin/mahout wikipediaxmlsplitter -d $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml -o wikipedia/chunks -c 64 5. After creating the chunks, verify that the chunks were written successfully into HDFS (look for chunk-*.xml files): hadoop dfs -ls wikipedia/chunks 6. Before running the benchmark, you need to create the countries based Split of the Wikipedia dataset: $MAHOUT_HOME/bin/mahout wikipediadatasetcreator -i wikipedia/chunks -o wikipediainput -c $MAHOUT_HOME/examples/src/test/resources/country.txt Check if the wikipediainput folder was created, you should see part-r-00000 file inside the wikipediainput directory hadoop fs -ls wikipediainput 7. Build the classifier model. After completion, the model will be available in HDFS, under the wikipediamodel folder: $MAHOUT_HOME/bin/mahout trainclassifier -i wikipediainput -o wikipediamodel 8. Test the classifier model:

$MAHOUT_HOME/bin/mahout testclassifier -m wikipediamodel -d wikipediainput By default the testing phase is running as a standalone application. If you want to use MapReduce for parallelizing the testing phase, use the --method option. To list all the options available use: $MAHOUT_HOME/bin/mahout testclassifier --help 9. You can measure the time to build the classifier, as well as the time to test it using the time system call.