Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Similar documents

Single Node Setup. Table of contents

Introduction to MapReduce and Hadoop

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

研發專案原始程式碼安裝及操作手冊. Version 0.1

TP1: Getting Started with Hadoop

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Tutorial- Counting Words in File(s) using MapReduce

HPCHadoop: MapReduce on Cray X-series

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Hadoop (pseudo-distributed) installation and configuration

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Big Data 2012 Hadoop Tutorial

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

2.1 Hadoop a. Hadoop Installation & Configuration

Hadoop Configuration and First Examples

BIG DATA, MAPREDUCE & HADOOP

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

map/reduce connected components

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

MapReduce. Tushar B. Kute,

Extreme Computing. Hadoop MapReduce in more detail.

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Single Node Hadoop Cluster Setup

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Installation and Configuration Documentation

Apache Hadoop new way for the company to store and analyze big data

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Big Data Management and NoSQL Databases

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

University of Maryland. Tuesday, February 2, 2010

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Map Reduce & Hadoop Recommended Text:

How To Use Hadoop

Tutorial for Assignment 2.0

How To Install Hadoop From Apa Hadoop To (Hadoop)

Chapter 7. Using Hadoop Cluster and MapReduce

Running Kmeans Mapreduce code on Amazon AWS

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Tutorial for Assignment 2.0

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

PassTest. Bessere Qualität, bessere Dienstleistungen!

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

MapReduce, Hadoop and Amazon AWS

Data Science in the Wild

Big Data and Apache Hadoop s MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 2

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Introduction to Hadoop

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Getting to know Apache Hadoop

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

Internals of Hadoop Application Framework and Distributed File System

Intro to Map/Reduce a.k.a. Hadoop

How To Write A Mapreduce Program In Java.Io (Orchestra)

Yuji Shirasaki (JVO NAOJ)

Data Science Analytics & Research Centre

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop Setup. 1 Cluster

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

5 HDFS - Hadoop Distributed System

Hadoop: Understanding the Big Data Processing Method

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop Streaming. Table of contents

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop Installation. Sandeep Prasad

Hadoop Architecture. Part 1

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

HDFS Installation and Shell

Scalable Computing with Hadoop

Transcription:

Distributed Recommenders Fall 2010

Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm Traditionally distributed data mining algorithms were very time consuming to implement Map-Reduce framework can reduce complexity Mahout uses the Hadoop Map-Reduce framework Slope-one (already implemented in Mahout) Distributed Nearest Neighbor User-Based Item-Based

Overview of Hadoop Hadoop is an Apache project comprised of a distributed filesystem (HDFS) and a MapReduce [1] engine Hadoop enables applications to more easily distribute large computations across a cluster of Hadoop nodes HDFS breaks files into chunks, which are stored across nodes. This is needed since a typical data file used on a Hadoop cluster is larger than a single disk on the cluster. Generally multiple copies of each chunk are kept across different nodes for redundancy and efficiency A job tracker executes MapReduce jobs against data stored in HDFS

MapReduce MapReduce is a software framework for distributing computations across a cluster of computers Mappers and reducers are processes on the node that perform map and reduce steps, respectively Both steps take a key-value pair as input Map: (K 1, V 1 ) list(k 2, V 2 ) Reduce: (K 2, list(v 2 )) list (K 3, V 3 )

MapReduce Map step o A key and value are given to the mapper o The mapper performs an operation on its input and returns a key and value in a different domain The mappers' output is then sorted and the keys combined, so that each key is unique and paired with a list of the values output for that key Reduce step o A key and list of values are given to the reducer o The reducer transforms its input into a key and final value

MapReduce Map and reduce tasks consist of multiple steps map task reduce task input split map buffer sort merge on disk merge merge merge reduce output output from other maps

Components Hadoop is comprised of several processes o Job Tracker o Task Tracker(s) o Name Node, Secondary Name Node o Data Node(s) The Job and Task Trackers handle MapReduce jobs The Name Nodes and Data Nodes provide the HDFS file system used by MapReduce jobs

Components The Name Node keeps track of file locations in the cluster The Job Tracker accepts MapReduce jobs and gives map and reduce tasks to nodes running a Tack Tracker Keeping multiple copies of files in the cluster allows the Job Tracker to better schedule tasks on nodes already containing the data needed for their map and reduce tasks Hadoop can be made aware of node location, so that tasks can be run on nodes close to the data required by the task

HDFS Example HDFS request on a 3 DataNode cluster DistributedFileSystem receives block locations from the NameNode, which are read from DataNodes by InputStream Java Class HDFS Client open Distributed FileSystem get block locations NameNode FSData InputStream DataNode DataNode DataNode

MapReduce Example Given temperature readings from multiple weather stations over a 100 year period, with one file per station per year. Problem: Find the highest temperature for each year across all stations. This problem can be run on a cluster with map reduce. Two steps: Map: Read a record and output the year and the temperature Shuffle and Sort (done for you by the Map Reduce framework) puts all year information together Reduce: Read all temperatures for a year and output the highest

MapReduce Example Mappers take data file line numbers paired with the corresponding lines Each mapper will parse the lines it receives and return a datetemperature pair indicating the temperature encountered Mapper algorithm Read input lines Find and store the year and temperature Output a year-temperature pair Input: xxxxxxxxxxmm/dd/yyyyxxxxxxxxxxxtttt Output: YYYY TTTT

MapReduce Example The Hadoop Job Tracker will collect the output of the various mappers, combining common keys and forming a list from their values This allows each reducer to work on all entries for a key Input: 1900 28.3 1900 29.1 1901 29 1901 29.4 Output: 1900 28.3, 29.1 1901 29, 29.4

MapReduce Example Reducers receive year-temperature list pairs as input Each reducer computes the highest temperature in its input list paired with its year. Different reduces can work on different years at the same time. Reducer Algorithm Read input temperature list Find maximum temperature in the list Return year-maximum temperature pair Each reducer finds the highest temperature for a given year The maximum number of reducers is limited by the number of years Input: 1900 28.3, 29.1 Output: 1900 29.1

Map Example private static final int MISSING = 9999; // temp to indicate no value public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { } String line = value.tostring(); String year = line.substring(15, 19); // format-specific offsets int airtemperature; airtemperature = Integer.parseInt(line.substring(87, 92)); if (airtemperature!= MISSING) { output.collect(new Text(year), new IntWritable(airTemperature)); }

Reduce Example public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { } int maxvalue = Integer.MIN_VALUE; while (values.hasnext()) { maxvalue = Math.max(maxValue, values.next().get()); } output.collect(key, new IntWritable(maxValue));

Mahout MapReduce Slope One Mahout provides a parallelized implementation of Slope One s preprocessing step using Hadoop and MapReduce This requires two separate MapReduce jobs Transform preferences into item-item pair differences Transform lists of differences into average differences Slope One Preprocessing Algorithm for every item i for every other item j for every user u expressing a preference for both i and j add the difference in u s preference for i and j to an average

Mahout MapReduce Slope One Distributed Algorithm Sequential Slope One Preprocessing Algorithm Step 1: Compute all preferences for a user Read the input file, lets call it File A of the form: [user, item, preference] with n entries. Build a map of user items for each user. Step 2: For each user, compute the difference for each item pair and store the difference associated with each pair. Compile lists of the form: <item1, item2> (d1, d2,.dj) where j is the number of times item1 has been obtained with item2. Step 3: For each item pair compute its average difference from its difference list Step 4: Output the results into a final result file: D <item1, item2, avg diff>

Mahout MapReduce Slope One Distributed Algorithm Distributed Slope One Preprocessing Algorithm Step 1: Compute all preferences for a user Read a portion of the input file, lets call it File A of the form: [user, item, preference] with n entries. Send n/p entries to each processor and build a map of user items for each user. Merge all p lists of user items into a single File B of the form: user ( (item1, pref1), (item2, pref2), (itemk, prefk) ) Step 2: Read File B in parallel and each processor now computes File C which is of the form: <item1, item2> (d1, d2,.dj) where j is the number of times item1 has been obtained with item2. Step 3: Read File C in parallel and make each processor compute the average difference for (c/p) item pairs, where c is the total number of item pairs. Step 4: Merge the results into a final result file: D <item1, item2, avg diff>

Slope One Performance Step 1: Compute all preferences for a user With N users and I items, this will take Distributed over p processors this is In the worst case every user has every item Step 2: Find pairs for all N users We do this for each user, so it will take With p processors this takes Sequentially we can compute running averages online Step 3: Combine pairs and compute averages (distributed only) This is the same as step 2, because we use every pair or, when distributed,

Mahout MapReduce Slope One The map and reduce steps can be distributed across multiple mappers and reducers Each box represents a separate process, each of which can be running on the same or separate Hadoop nodes Data File 101, A, 4 101, B, 3 102, A, 2 102, B, 4 Mappers 101 101 101 101 102 102 102 102 Sort & Combine 101 102 Reducers (A,B) (A,B) Sort & Combine (A,B) Reducers (A,B) output

Mahout MapReduce Slope One Mappers read lines from the input data file and return userid-(itemid,preferencevalue) pairs This output is collected by Hadoop, sorted by user id, and the values of common keys combined Data File //user,item,pref 101, A, 4 101, B, 3 102, A, 2 102, B, 4 Mapper Output Key Value (id) (item, pref) 101 (A,4) 101 (B,3) 102 (A,2) 102 (B,4) Hadoop Output Key (id) Value (list of item, pref) 101 (A,4), (B,3) 102 (A,2), (B,4)

Mahout MapReduce Slope One Reducers are passed a userid and the user s items and preferences from the mapper The reducers find the difference between each pair of userid s items and return a (itema,itemb)-difference pair for each Reducer Input Key (id) Value (item, pref) 101 (A,4), (B,3) 102 (A,2), (B,4) Reducer Output Key Value (itema, (difference) itemb) (A,B) 4-3= 1 (A,B) 2-4= -2

Mahout MapReduce Slope One A new MapReduce job is run on the reducers output An identity mapper is used, skipping the map step This output is collected and sorted, so that each (itema,itemb) pair is associated with a list of differences between the items Reducers take this as input and return (itema,itemb)-averagedifference pairs Reducer Input Key Value (itema, (list of itemb) differences) (A,B) 1, -2 Reducer Output Key Value (itema, (average itemb) difference) (A,B)

Slope One Performance Step 1: Compute all preferences for a user With N users and I items, this will take Distributed over p processors this is In the worst case every user has every item Step 2: Find pairs for all N users We do this for each user, so it will take With p processors this takes Sequentially we can compute running averages online Step 3: Combine pairs and compute averages (distributed only) This is the same as step 2, because we use every pair or, when distributed,

References [1] J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. In Proceedings Sixth Symposium on Operating System Design and Implementation (OSDI 04), 2004.

Configuring Hadoop Hadoop can be downloaded as Hadoop Common from its website http://hadoop.apache.org Hadoop v0.21 is the preferred version Extract hadoop-0.21.0.tar.gz after downloading The resulting hadoop-0.21.0 directory is your Hadoop installation and will be referred to as HADOOP_INSTALL by the Hadoop documentation Install the Java 6 JDK if needed and note its installation directory for use configuring Hadoop Something like C:\jdk1.6_02 on Windows /usr/lib/jvm/java-6-sun on Ubuntu Linux

Configuring Hadoop Some files in HADOOP_INSTALL/conf need to be edited to configure Hadoop to run as a single-node cluster In conf/hadoop-env.sh, JAVA_HOME should be set to the JDK s installation directory Two variables need to be set to specify a hostname:port for the name node and job tracker With a single node hostname can be localhost conf/mapred-site.xml s mapreduce.jobtracker.address should be set to host:9002 conf/core-site.xml s fs.default.name is hostname:9001 For example: <property> <name>fs.default.name</name> <value>hdfs://localhost:9001</value> </property>

Configuring Hadoop Passwordless SSH Hadoop uses SSH to start Hadoop components on various nodes in the cluster We only have one node, but the startup scripts will still SSH to localhost in order to run component startup commands OpenSSH s sshd needs to be installed If using Cygwin, make sure the openssh packages are installed and run ssh-host-config If using Ubuntu Linux, install openssh-server Now run ssh-keygen to generate a public and private key for use with SSH s public key authentication When prompted use a blank passphrase Now you should be able to SSH to your local machine with no password using the command: ssh localhost

Starting Hadoop After Hadoop is configured, it can be started by running bin/start-all.sh from the HADOOP_INSTALL directory It can be stopped with bin/stop-all.sh A HADOOP_INSTALL/logs directory will be created Try looking at the.log files inside if something goes wrong Try writing a file to HDFS as /test bin/hadoop dfs put <filename> /test You should be able to view it in HDFS along with other files already created by Hadoop bin/hadoop dfs ls / You can also copy it from HDFS to the local filesystem bin/hadoop dfs get /test test-file-from-hdfs