Tutorial for Assignment 2.0

Similar documents
Tutorial for Assignment 2.0

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Apache Hadoop new way for the company to store and analyze big data

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...

Single Node Hadoop Cluster Setup

Single Node Setup. Table of contents

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

2.1 Hadoop a. Hadoop Installation & Configuration

Installation and Configuration Documentation

TP1: Getting Started with Hadoop

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Chapter 7. Using Hadoop Cluster and MapReduce

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Hadoop MultiNode Cluster Setup

Running Kmeans Mapreduce code on Amazon AWS

Extreme computing lab exercises Session one

How to install Apache Hadoop in Ubuntu (Multi node setup)

Hadoop Setup. 1 Cluster

How To Install Hadoop From Apa Hadoop To (Hadoop)

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

IDS 561 Big data analytics Assignment 1

MapReduce, Hadoop and Amazon AWS

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

map/reduce connected components

Integration Of Virtualization With Hadoop Tools

Extreme computing lab exercises Session one

Cloud Computing. Chapter Hadoop

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop Tutorial. General Instructions

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

Hadoop (pseudo-distributed) installation and configuration

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

How To Use Hadoop

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Hadoop Installation. Sandeep Prasad

Hadoop Installation MapReduce Examples Jake Karnes

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Apache Hadoop. Alexandru Costan

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

A. Aiken & K. Olukotun PA3

HADOOP - MULTI NODE CLUSTER

MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMINOUS DATA-SETS

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

L1: Introduction to Hadoop

Tutorial- Counting Words in File(s) using MapReduce

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

HDFS Users Guide. Table of contents

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Big Data 2012 Hadoop Tutorial

MapReduce and Hadoop Distributed File System

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

HDFS. Hadoop Distributed File System

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

HPC (High-Performance the Computing) for Big Data on Cloud: Opportunities and Challenges

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Developing a MapReduce Application

Hadoop. Bioinformatics Big Data

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Data-Intensive Computing with Map-Reduce and Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Chase Wu New Jersey Ins0tute of Technology

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to MapReduce and Hadoop

Deploying MongoDB and Hadoop to Amazon Web Services

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Sriram Krishnan, Ph.D.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop Streaming. Table of contents

HADOOP MOCK TEST HADOOP MOCK TEST II

Introduction to Cloud Computing

MapReduce and Hadoop Distributed File System V I J A Y R A O

Hadoop Architecture. Part 1

A Study of Data Management Technology for Handling Big Data

CSE-E5430 Scalable Cloud Computing Lecture 2

Transcription:

Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows machines will not be supported by us in the newsgroup 1

Today's Agenda Motivation Quick introduction into Map/Reduce and Hadoop The assignment Pitfalls during the setup What you should have learned so far Network analysis and operations o such as degree distribution o Clustering Coefficient o Google's PageRank o Network Evolution Computed for very small networks 2

Motivation So far these analyzes do NOT scale - What about networks which contain millions of nodes and edges or GB/TB of data? Computation would take quite a long time How can we process large amounts of data? Apache Hadoop - One solution of the scaling problem Uses the Map/Reduce paradigm Written in Java o But also other programming languages are possible Is used by Yahoo, Amazon etc. 3

What is Map/Reduce? /1 Framework to support distributed computing of large data sets on clusters Used for data-intensive information processing Large Files/Lots of computation What is Map/Reduce? / 2 Abstract view: Master splits problem in smaller parts Mappers solve sub-problem Reducer combines results from Mappers Examples: o WordCount o Inverted Index 4

Distributed File System (DFS) Hadoop comes with a distributed file system (HDFS) Highly fault tolerant Splits data in blocks of 64mb (default configuration) Example of a Map/Reduce Application /1 Word Count - counting occurrences of words in lots of documents To keep things simple we will use the example from [1] which uses Python, reads from StdIn and writes to StdOut 5

Example of a Map/Reduce Application / 2 Example Code - Mapper Example of a Map/Reduce Application / 3 Example Code - Reducer 6

Example of a Map/Reduce Application / 4 Testing the code you have written on a small subset is always recommended! Example: cat subset.txt python mapper.py python reducer.py Run the code on the cluster by issuing: bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper / home/hadoop/mapper.py -file /home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input $input -output $output The Assignment Team up in groups of 5 students Create Subversion repository Implement TunkRank and compute it on the provided data (one iteration is sufficient) Hand in your used source code and the top 10.000 twitter users in descending order See assignment document on submission details! 7

Provided Data You are given a subset of a large twitter data set which was gathered for a scientific paper [2] o compressed 530mb Tab separated: o First column: User o Second column: Follower (user who follows user from first column) TunkRank Tool to measure the influence on Twitter The higher the TunkRank score is the more influential a Twitter user is Twitterers with high TunkRank: o Barack Obama o Kevin Rose o Steven Colbert see http://www.tunkrank.com or [3] for details 8

Hand In / 1 Create a Subversion repository on the TUG server o name: WSWT10_<GROUPNAME> o Group members as members o Teaching assistents as readers Structure of the repository Hand In / 2 report.pdf (short! - approx. 1 page) bash scripts (optional) python/ o mapper_1.py o mapper_2.py o... o readme.txt results/ o tunkrank_run_1.txt (top 10000 twitterers in descending order + their Tunkrank score) 9

Important Dates NOW: Team up in groups of 5 Assignment is due: Friday, JUNE 18th o 12:00 (noon) - soft deadline o 24:00 - hard deadline Abgabegespräche will be on JUNE 22nd o Every team member has to participate Hadoop Setup / 1 create new user hadoop on your system use functioning DNS or /etc/hosts file for client/master lookup Download current Hadoop distribution from http://hadoop.apache.org/ unpack distribution in a directory (e.g. /usr/local/hadoop/) create temp directory (e.g. /usr/local/ hadoop-datastore) 10

Hadoop Setup / 2 conf/hadoop-env.sh - holds environment variables and java installation conf/core-site.xml - names the host the default file system & temp data conf/mapred-site.xml - specifies the job tracker conf/masters - names the masters conf/slaves (only on master nescessary) - names the slaves conf/hdfs-site.xml - specifies replication value Hadoop Setup / 3 Format DFS bin/hadoop namenode -format! 11

Starting the Hadoop Cluster bin/start-dfs.sh starts HDFS daemons bin/start-mapred.sh - starts Map/ Reduce daemons alternative: start-all.sh stopper scripts also available Pitfalls for the Setup of Hadoop Use machines of approximately the same speed / setup Use the same directory structure for all installations of your machines Ensure that password-less ssh login is possible for all machines Avoid the term localhost and the ip 127.0.0.1 at all cost --> use fixed IPs or functioning DNS for your experiments Read the Log files of the Hadoop installation Use the web interface of your cluster If there are problems --> use the newsgroup 12

Thanks for your attention! Are there any questions? References [1] Michael G. Noll's Hadoop Tutorial: o Single Node Cluster o http://www.michael-noll.com/wiki/running_hadoop_on_ubuntu_linux_%28single-node_cluster%29 o Multi Node Cluster o http://www.michael-noll.com/wiki/running_hadoop_on_ubuntu_linux_%28multi-node_cluster%29 o Writing Map/Reduce Program in Python o http://www.michael-noll.com/wiki/writing_an_hadoop_mapreduce_program_in_python [2] H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In WWW 10: Proceedings of the 19th international conference on World wide web, pages 591 600, New York, NY, USA, 2010. ACM. [3] http://thenoisychannel.com/2009/01/13/a-twitteranalog-to-pagerank/ 13