COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring 2014. Mahout

COSC 6397 Big Data Analytics Mahout and 3 rd homework assignment Edgar Gabriel Spring 2014 Mahout Scalable machine learning library Built with MapReduce and Hadoop in mind Written in Java Focusing on three application scenarios Recommendation Systems Clustering Classifiers Multiple ways for utilizing Mahout Java Interfaces Command line interfaces 1

Classification Currently supported algorithms Naïve Baysian Classifier Hidden Markov Models Logistical Regression Random Forest Clustering Currently supported algorithms Canopy clustering K-means clustering Fuzzy k-means clustering Spectral clustering Multiple tools available to support clustering clusterdump: utility to output results of a clustering to a text file cluster visualization 2

Mahout input arguments Input data has to be sequence files and sequence vectors Sequence file: binary file containing a list of key/value pairs Classes used for the key and the value pair Generic Hadoop concept Sequence vector: binary file containing list of key/(array of values) For using mahout algorithms, key has to be text and value has to be of type VectorWritable (which is a Mahout class, not a Hadoop class) Sequence Files Creating a sequencfile using command line argument gabriel@shark>mahout seqdirectory -i /lastfm/input/ -o /lastfm/seqfiles Looking at the output of a sequence file gabriel@shark>mahout seqdumper i /lastfm/seqfiles/controldata.seq more Input Path: file:/lastfm/seqfiles/control-data.seq Key class: class org.apache.hadoop.io.text Value Class: class org.apache.mahout.math.vectorwritable Key: 0: Value: {0:28.7812,1:34.4632,2:31.3381} Key: 1: Value: {0:24.8923,1:25.741,2:27.5532} 3

Sequence File from Java Required if the original input file is not already structured in a manner that can be interpreted as key/value pair public class CreateSequenceFile { public static void main(string[] argsx) throws FileNotFoundException, IOException { String filename = "/home/gabriel/mahouttest/synthetic-controldata/input/synthetic-control.data"; String outputfilename = "/home/gabriel/mahouttest/syntheticcontrol-data/seqfile/synthetic-control.seq"; Path path = new Path(outputfilename); BufferedReader br=new BufferedReader(new FileReader(filename)); String line; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); SequenceFile.Writer writer = new SequenceFile.Writer(fs,conf,path,Text.class,VectorWritable.class); Text key = new Text(); long tempkey = 0; while( (line = br.readline())!= null ) { Scanner scanner = new Scanner (new StringReader (line) ); double[] values = new double[64] ; int i=0; while ( scanner.hasnextdouble() && i < 64 ) { values[i] = scanner.nextdouble(); i++; } DenseVector val = new DenseVector (values) ; VectorWritable vec = new VectorWritable(val); key = new Text(String.format("%d",tempkey)); writer.append(key,vec); tempkey++; } writer.close();}} 4

Using Mahout clustering The SequenceFile containing the input vectors. The SequenceFile containing the initial cluster centers. The similarity measure to be used. The convergencethreshold. The number of iterations to be done. The Vector implementation used in the input files. Using Mahout clustering 5

Distance measures Euclidean distance measure Squared Euclidean distance measure Manhattan distance measure Distance measures Cosine distance measure Tanimoto distance measure 6

Running Mahout Clustering algorithms bin/mahout kmeans -i <input vectors directory> \ -c <input clusters directory> \ -o <output working directory> \ -k <optional no. of initial clusters> \ -dm <DistanceMeasure> \ -x <maximum number of iterations> \ -cd <optional convergence delta. Default is 0.5> \ -ow <overwrite output directory if present> -cl <run input vector clustering after computing Canopies> -xm <execution method: sequential or mapreduce> mahout clusterdump -i /gabriel/clustering/canopy/clusters-0-final -- pointsdir /gabriel/clustering/canopy/clusteredpoints -o /home/gabriel/mahouttest/synthetic-control-data/canopy.out 3 rd Homework assignment Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt file) explanations to the code answers to questions Deliver electronically to gabriel@cs.uh.edu Expected by Saturday, May 3, 11.59pm No extension possible! In case of questions: ask, ask, ask! 7

Given an input file, structure of the input file is Site id1 {feature vector} Site id 2 {feature vector} Site id 3 {feature vector} With feature vector being with 2+365 values latitude, longitude, daily maximum of o3 value for that site Input file will be give as sequencefile + plain text file /bigdata-hw3/input/dailymax.txt /bigdata-hw3/input/dailymax.seq Compare the clustering quality obtained with the kmeans algorithm and with the fuzzy k-means algorithms Calculate the compactness of the clusters using c = all clusters i C (xi μ c ) 2 (xj μ c ) 2 j! C Compare the performance of the k-means clustering algorithm using different distance metrics Squared euclidean distance Manhatten distance Cosine distance 8

Typical steps for executing your job Run your job gabriel@shark>mahout kmeans i /bigdatahw3/input/dailymax.seq o /bigd48/kmeans-out c /bigdata-hw3/centers/centers.seq X 10 k 5 cl Important: -c centers.seq has to be provided but will be ignored if the k option is set -cl option ensures that the list of which point is belonging to which cluster is printed (will be in the clusteredpoints directory Setting different distance measures You can enforce the usage of a distance measure using the dm argument Full class name of the distance measure needs to be provided, i.e. org.apache.mahout.common.distance.cosinedistancemeasure org.apache.mahout.common.distance.manhattandistancemeasure org.apache.mahout.common.distance.squaredeuclideandistancemeasure 9