Tutorial for Assignment 2.0



Similar documents
Tutorial for Assignment 2.0

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Apache Hadoop new way for the company to store and analyze big data

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...

Single Node Setup. Table of contents

2.1 Hadoop a. Hadoop Installation & Configuration

Chapter 7. Using Hadoop Cluster and MapReduce

Single Node Hadoop Cluster Setup

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Installation and Configuration Documentation

Integration Of Virtualization With Hadoop Tools

TP1: Getting Started with Hadoop

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

MapReduce, Hadoop and Amazon AWS

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Running Kmeans Mapreduce code on Amazon AWS

How to install Apache Hadoop in Ubuntu (Multi node setup)

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

Cloud Computing. Chapter Hadoop

Apache Hadoop. Alexandru Costan

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

MapReduce and Hadoop Distributed File System

Hadoop Setup. 1 Cluster

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Extreme computing lab exercises Session one

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

How To Use Hadoop

Hadoop MultiNode Cluster Setup

map/reduce connected components

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

MapReduce and Hadoop Distributed File System V I J A Y R A O

IDS 561 Big data analytics Assignment 1

HDFS Users Guide. Table of contents

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban

Extreme computing lab exercises Session one

Hadoop Installation. Sandeep Prasad

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

How To Install Hadoop From Apa Hadoop To (Hadoop)

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

A. Aiken & K. Olukotun PA3

HDFS. Hadoop Distributed File System

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Hadoop Tutorial. General Instructions

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

A Study of Data Management Technology for Handling Big Data

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

CSE-E5430 Scalable Cloud Computing Lecture 2

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

A Performance Analysis of Distributed Indexing using Terrier

Hadoop. Bioinformatics Big Data

Chase Wu New Jersey Ins0tute of Technology

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMINOUS DATA-SETS

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Developing a MapReduce Application

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Jeffrey D. Ullman slides. MapReduce for data intensive computing

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

HADOOP - MULTI NODE CLUSTER

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop (pseudo-distributed) installation and configuration

Hadoop Architecture. Part 1

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop Distributed File System Propagation Adapter for Nimbus

Hadoop Installation MapReduce Examples Jake Karnes

L1: Introduction to Hadoop

Hadoop Basics with InfoSphere BigInsights

Introduction to Hadoop

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

CDH installation & Application Test Report

HADOOP MOCK TEST HADOOP MOCK TEST II

Tutorial- Counting Words in File(s) using MapReduce

Deploying MongoDB and Hadoop to Amazon Web Services

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model


Transcription:

Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2012 Slides based on last years tutorials by Chris Körner, Philipp Singer 1

Review and Motivation Agenda Assignment Information Introduction to Hadoop and MapReduce Example MapReduce Application Setup pitfalls and hints 2

Review What you should have learned so far Network analysis and operations Such as degree distribution Clustering Coefficient Google s PageRank Network Evolution Computed for very small networks 3

Motivation So far these analyses do NOT scale What about networks with a huge amount of nodes and edges or GB/TB of data? Computation would take quite a long time How can we process large amounts of data? Hadoop 4

Provided Data You are given a subset of a large delicious data set which was gathered by Arkaitz Zubiaga Compressed 2.7GB [8.3GB] 102.090.935 entries on the top 10.000 tags minimum number of 3 tags per entry Tab separated: An entry represents a delicious bookmark: url tags[] 5

Co-occurrence According to wikipedia co-occurence is... the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order [3] Example Given: [dog, cat, animal] [cat, lion, animal] [cat, lion, zoo] Result: animal cat 2 dog 1 lion 1 cat animal 2 dog 1 lion 2 zoo 1 dog animal 1 cat 1 lion animal 1 cat 2 zoo 1 zoo cat 1 lion 1 6

The Assignment Team up in groups of 5 students Nominate group captain Create Subversion repository (ADD ALL TUTORS AS READERS) Implement co-occurrence calculation for the given dataset and compute it You do not have to solve it in one step just explain it in the readme.txt file Hand in your source code, a co-occurrence adjacency list of 10 tags and visualize them! See assignment document for further details 7

Hand In 1/2 Create a Subversion repository on the TUG server Name: WSWT12_<GROUPNAME> Group members as members Teaching assistents as readers 8

Structure of the repository Hand In 2/2 report.pdf (short approx. 1 page) bash scripts (optional) python/ mapper_1.py reducer_1.py readme.txt results/ co-occurences.txt (adjacency list of 5 given tags + 5 of own choice) tag_cloud_*.png (tag cloud visualization of the 10 tags) 9

IMPORTANT NOW: Team up in groups of 5 Assignment is due: Monday June 11, 2012 12:00 (noon) soft deadline 24:00 hard deadline Abgabegespräche will be on Tuesday June 12, 2012 Every team member has to attend As always: Plagiarism will not be tolerated!!!!! 10

Apache Hadoop One solution of the scaling problem Using the MapReduce paradigm Written in Java (but also other programming languages are possible) Used by Yahoo, Amazon etc. 11

http://www.heise.de/newsticker/meldung/studie-hadoop-wird-aehnlich-erfolgreich-wie-linux-1569837.html (de) http://www.h-online.com/open/news/item/analyst-hadoop-ecosystem-to-be-worth-812-million-in-2016-1570381.html (en) 12

MapReduce 1/2 Framework to support distributed computing of large data sets on clusters Developed by Google Inc. Used for data-intensive information processing Large Files/Lots of computation 13

MapReduce 2/2 Abstract view: Master splits problem in smaller parts Mapper solve sub-problem Reducer combines results from Mappers 14

http://people.apache.org/~rdonkin/hadoop-talk/hadoop.html 15

Distributed File System (DFS) Hadoop comes with a distributed file system Highly fault tolerant Splits data in blocks of 64mb (default configuration) 16

Example of a MapReduce Application 1/4 Word Count Counting occurrences of words on lots of documents To keep things simple we will use the example from Michael G. Noll's tutorial [1] uses Python reads from StdIn writes to StdOut 17

Example of MapReduce Application 2/4 Mapper 18

Example of MapReduce Application 3/4 Reducer 19

Example of MapReduce Application 4/4 It is always recommended to test the code you have written on a small sample subset Think through with pen & paper and compare results Example: cat subset.txt python mapper.py sort python reducer.py Run the code on the cluster by issuing: bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.0.jar -file /home/hadoop/mapper.py -mapper / home/hadoop/mapper.py -file /home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input $input -output $output 20

Important The presented information has been tested on the following operating systems Ubuntu Linux Debian Linux The installation on Windows machines will not be supported by us in the newsgroup and is highly not recommended We also cannot support Mac OS X (although the setup should work fine) 21

Hadoop Setup 1/2 Use a native Linux (~100-120GB free space) Do the Multi-Node-Cluster Setup For runtime reasons As described in the tutorial by Michael G. Noll [1]: Create new user hadoop on your system Use functioning DNS or /etc/hosts file for client/master lookup Download current Hadoop distribution from http://hadoop.apache.org Unpack distribution in a directory (e.g. /usr/local/hadoop) 22

Hadoop Setup 2/2 conf/hadoop-env.sh - holds environment variables and java installation conf/core-site.xml - names the host the default file system & temp data conf/mapred-site.xml - specifies the job tracker conf/masters - names the masters conf/slaves (only on master necessary) - names the slaves conf/hdfs-site.xml - specifies replication value Format DFS bin/hadoop namenode -format 23

Starting the Hadoop Cluster bin/start-dfs.sh starts HDFS daemons bin/start-mapred.sh - starts Map/Reduce daemons alternative: start-all.sh stopper scripts also available 24

Pitfalls for the Setup of Hadoop Use machines of approximately the same speed / setup Use the same directory structure for all installations of your machines Ensure that password-less ssh login is possible for all machines Avoid the term localhost and the ip 127.0.0.1 at all cost use fixed IPs or functioning DNS for your experiments Read the Log files of the Hadoop installation Use the web interface of your cluster 25

Further hints Check if enough free space is available on your harddisk partition (~100-120GB are needed) Virtual Machines (DO NOT DO IT!) Same as above: give the machine enough space Give the machine a good amount of memory (~1024MB) For local networks: Use bridging (no NAT!!!) Read the tutorials carefully! [1] Post your problems to the newsgroup Read the logfiles (hadoop/logs/) Often google offers you a solution/hint to your problem 26

Thanks for your attention! Are there any questions? 27

References [1] Michael G. Noll's Hadoop Tutorial: Single Node Cluster http://www.michael-noll.com/wiki/running_hadoop_on_ubuntu_linux_%28single- Node_Cluster%29 Multi Node Cluster http://www.michael-noll.com/wiki/running_hadoop_on_ubuntu_linux_%28multi-node_cluster %29 Writing Map/Reduce Program in Python http://www.michael-noll.com/wiki/writing_an_hadoop_mapreduce_program_in_python [2] H. Kwak, C. Lee, H. Park, and S. Moon. What istwitter, a social network or a news media? In WWW 10: Proceedings of the 19th international conference on World wide web, pages 591 600, New York, NY, USA, 2010. ACM. [3] http://en.wikipedia.org/wiki/co-occurrence 28