COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring 2014. Mahout



Similar documents
Mammoth Scale Machine Learning!

Machine Learning using MapReduce

Text Clustering Using LucidWorks and Apache Mahout

Hadoop Design and k-means Clustering

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Big Data and Scripting Systems build on top of Hadoop

Analytics on Big Data

Map-Reduce and Hadoop

13 File Output and Input

Cluster analysis with SPSS: K-Means Cluster Analysis

Hadoop Streaming. Table of contents

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

This material is built based on, Patterns covered in this class FILTERING PATTERNS. Filtering pattern

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Advanced Java Client API

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer

Extreme Computing. Hadoop MapReduce in more detail.

Word Count Code using MR2 Classes and API

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Getting to know Apache Hadoop

Cloudera Certified Developer for Apache Hadoop

Machine Learning Big Data using Map Reduce

Introduction to MapReduce and Hadoop

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop: challenge accepted!

Mrs: MapReduce for Scientific Computing in Python

HPCHadoop: MapReduce on Cray X-series

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Package hive. January 10, 2011

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Map-Reduce for Machine Learning on Multicore

MapReduce. Tushar B. Kute,

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

The Hadoop Eco System Shanghai Data Science Meetup

Project 5 Twitter Analyzer Due: Fri :59:59 pm

Scalable Computing with Hadoop

Programming Languages CIS 443

5 HDFS - Hadoop Distributed System

Social Media Mining. Data Mining Essentials

Data Intensive Computing Handout 6 Hadoop

Tutorial- Counting Words in File(s) using MapReduce

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Internals of Hadoop Application Framework and Distributed File System

An Introduction to Data Mining

Distributed Computing and Big Data: Hadoop and MapReduce

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Using Data Mining for Mobile Communication Clustering and Characterization

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Recommended Literature for this Lecture

HW3: Programming with stacks

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Big Data and Scripting map/reduce in Hadoop

Hadoop Basics with InfoSphere BigInsights

Fast Analytics on Big Data with H20

HDFS - Java API coreservlets.com and Dima May coreservlets.com and Dima May

The Stratosphere Big Data Analytics Platform

International Journal of Emerging Technology & Research

Analysis of Web Archives. Vinay Goel Senior Data Engineer

CS506 Web Design and Development Solved Online Quiz No. 01

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Big Data and Apache Hadoop s MapReduce

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Word count example Abdalrahman Alsaedi

DATA CLUSTERING USING MAPREDUCE

Xiaoming Gao Hui Li Thilina Gunarathne

Data Domain Profiling and Data Masking for Hadoop

Creating a Simple, Multithreaded Chat System with Java

Running Hadoop on Windows CCNP Server

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Hadoop Parallel Data Processing

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Machine Learning with MATLAB David Willingham Application Engineer

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

University of Maryland. Tuesday, February 2, 2010

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

dictionary find definition word definition book index find relevant pages term list of page numbers

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

map/reduce connected components

HiBench Installation. Sunil Raiyani, Jayam Modi

Hadoop and Map-reduce computing

BIG DATA What it is and how to use?

D06 PROGRAMMING with JAVA

Transcription:

COSC 6397 Big Data Analytics Mahout and 3 rd homework assignment Edgar Gabriel Spring 2014 Mahout Scalable machine learning library Built with MapReduce and Hadoop in mind Written in Java Focusing on three application scenarios Recommendation Systems Clustering Classifiers Multiple ways for utilizing Mahout Java Interfaces Command line interfaces 1

Classification Currently supported algorithms Naïve Baysian Classifier Hidden Markov Models Logistical Regression Random Forest Clustering Currently supported algorithms Canopy clustering K-means clustering Fuzzy k-means clustering Spectral clustering Multiple tools available to support clustering clusterdump: utility to output results of a clustering to a text file cluster visualization 2

Mahout input arguments Input data has to be sequence files and sequence vectors Sequence file: binary file containing a list of key/value pairs Classes used for the key and the value pair Generic Hadoop concept Sequence vector: binary file containing list of key/(array of values) For using mahout algorithms, key has to be text and value has to be of type VectorWritable (which is a Mahout class, not a Hadoop class) Sequence Files Creating a sequencfile using command line argument gabriel@shark>mahout seqdirectory -i /lastfm/input/ -o /lastfm/seqfiles Looking at the output of a sequence file gabriel@shark>mahout seqdumper i /lastfm/seqfiles/controldata.seq more Input Path: file:/lastfm/seqfiles/control-data.seq Key class: class org.apache.hadoop.io.text Value Class: class org.apache.mahout.math.vectorwritable Key: 0: Value: {0:28.7812,1:34.4632,2:31.3381} Key: 1: Value: {0:24.8923,1:25.741,2:27.5532} 3

Sequence File from Java Required if the original input file is not already structured in a manner that can be interpreted as key/value pair public class CreateSequenceFile { public static void main(string[] argsx) throws FileNotFoundException, IOException { String filename = "/home/gabriel/mahouttest/synthetic-controldata/input/synthetic-control.data"; String outputfilename = "/home/gabriel/mahouttest/syntheticcontrol-data/seqfile/synthetic-control.seq"; Path path = new Path(outputfilename); BufferedReader br=new BufferedReader(new FileReader(filename)); String line; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); SequenceFile.Writer writer = new SequenceFile.Writer(fs,conf,path,Text.class,VectorWritable.class); Text key = new Text(); long tempkey = 0; while( (line = br.readline())!= null ) { Scanner scanner = new Scanner (new StringReader (line) ); double[] values = new double[64] ; int i=0; while ( scanner.hasnextdouble() && i < 64 ) { values[i] = scanner.nextdouble(); i++; } DenseVector val = new DenseVector (values) ; VectorWritable vec = new VectorWritable(val); key = new Text(String.format("%d",tempkey)); writer.append(key,vec); tempkey++; } writer.close();}} 4

Using Mahout clustering The SequenceFile containing the input vectors. The SequenceFile containing the initial cluster centers. The similarity measure to be used. The convergencethreshold. The number of iterations to be done. The Vector implementation used in the input files. Using Mahout clustering 5

Distance measures Euclidean distance measure Squared Euclidean distance measure Manhattan distance measure Distance measures Cosine distance measure Tanimoto distance measure 6

Running Mahout Clustering algorithms bin/mahout kmeans -i <input vectors directory> \ -c <input clusters directory> \ -o <output working directory> \ -k <optional no. of initial clusters> \ -dm <DistanceMeasure> \ -x <maximum number of iterations> \ -cd <optional convergence delta. Default is 0.5> \ -ow <overwrite output directory if present> -cl <run input vector clustering after computing Canopies> -xm <execution method: sequential or mapreduce> mahout clusterdump -i /gabriel/clustering/canopy/clusters-0-final -- pointsdir /gabriel/clustering/canopy/clusteredpoints -o /home/gabriel/mahouttest/synthetic-control-data/canopy.out 3 rd Homework assignment Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt file) explanations to the code answers to questions Deliver electronically to gabriel@cs.uh.edu Expected by Saturday, May 3, 11.59pm No extension possible! In case of questions: ask, ask, ask! 7

Given an input file, structure of the input file is Site id1 {feature vector} Site id 2 {feature vector} Site id 3 {feature vector} With feature vector being with 2+365 values latitude, longitude, daily maximum of o3 value for that site Input file will be give as sequencefile + plain text file /bigdata-hw3/input/dailymax.txt /bigdata-hw3/input/dailymax.seq Compare the clustering quality obtained with the kmeans algorithm and with the fuzzy k-means algorithms Calculate the compactness of the clusters using c = all clusters i C (xi μ c ) 2 (xj μ c ) 2 j! C Compare the performance of the k-means clustering algorithm using different distance metrics Squared euclidean distance Manhatten distance Cosine distance 8

Typical steps for executing your job Run your job gabriel@shark>mahout kmeans i /bigdatahw3/input/dailymax.seq o /bigd48/kmeans-out c /bigdata-hw3/centers/centers.seq X 10 k 5 cl Important: -c centers.seq has to be provided but will be ignored if the k option is set -cl option ensures that the list of which point is belonging to which cluster is printed (will be in the clusteredpoints directory Setting different distance measures You can enforce the usage of a distance measure using the dm argument Full class name of the distance measure needs to be provided, i.e. org.apache.mahout.common.distance.cosinedistancemeasure org.apache.mahout.common.distance.manhattandistancemeasure org.apache.mahout.common.distance.squaredeuclideandistancemeasure 9