K-means Implementation

Size: px

Start display at page:

Download "K-means Implementation"

Bethanie Johnston
7 years ago
Views:

1 COSC 6397 Big Data Analytics Introduction to MapReduce (II) Edgar Gabriel Spring 2014 K-means Implementation Simplified assumptions 1 iteration 2-D points, floating point coordinates One data point per line of input file 2 clusters Initial cluster centroids provided by coordinates, one centroid per line Challenge: Need a data structure to abstract a 2D point Need to have access to cluster centroids on all mapper tasks 1

2 Creating your own Writable Datatype Writable is Hadoop s serialization mechanism for writing data in and out of network, database or files Optimized for network serialization A set of basic types is provided Easy to implement your own Extends Writable interface Framework s serialization mechanisms Defines how to read and write fields org.apache.hadoop.io package To define a writable type to be used as a Key Keys must implement WritableComparable interface Extends Writable and java.lang.comparable<t> Required because keys are sorted prior reduce phase Hadoop is shipped with many default implementations of WritableComparable<T> Wrappers for primitives (String, Integer, etc...) In fact, all primitive writables mentioned in the last lecture are WritableComparable 2

3 Implement your own WritableComparable Implement 3 methods write(dataoutput) Serialize the content readfields(datainput) De-Serialize the content compareto(t) Has to return negative, zero, or positive number when comparing two elements to each other If your custom object is used as the key it will be sorted prior to reduce phase Not necessary for Writable variables package kmeansdemo; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.writable; public class TwoDPointWritable implements Writable { private FloatWritable x,y; public TwoDPointWritable() { this.x = new FloatWritable(); this.y = new public void write(dataoutput out) throws IOException { x.write(out); public void readfields(datainput in) throws IOException { x.readfields(in); y.readfields(in); 3

4 public void set ( float a, float b) { this.x.set(a); this.y.set(b); public FloatWritable getx() { return x; public FloatWritable gety() { return y; InputFormat Specification for reading data Creates Input Splits: Breaks up work into chunks calling InputFormat.getSplits Specifies how to read each split For each Mapper instance a reader is retrieved by InputFormat.createRecordReader Takes InputSplit instance as a parameter RecordReader generates key-value pairs map() method is called for each key-value pair 4

5 Predefine FileInputFormats Hadoop eco-system is packaged with many InputFormats TextInputFormat NLineInputFormat DBInputFormat TableInputFormat (HBASE) StreamInputFormat SequenceFileInputFormat Configure on a Job object job.setinputformatclass(xxxinputformat.class); If you want to use your own writable as Input to the mapper public class TwoDPointFileInputFormat extends FileInputFormat <LongWritable, TwoDPointWritable>{ public RecordReader<LongWritable, TwoDPointWritable> createrecordreader( InputSplit arg0, TaskAttemptContext arg1) throws IOException, InterruptedException { return new TwoDPointFileRecordReader(); Implements the createrecordreader interface Return a RecordReader which consists of a key/value pair (i.e. LongWritable, TwoDPointWritable) 5

6 public class TwoDPointFileRecordReader extends RecordReader<LongWritable, TwoDPointWritable>{ LineRecordReader linereader; TwoDPointWritable value; public void initialize(inputsplit inputsplit, TaskAttemptContext attempt) throws IOException, InterruptedException { linereader = new LineRecordReader(); linereader.initialize(inputsplit, attempt); public boolean nextkeyvalue() throws IOException, InterruptedException { if (!linereader.nextkeyvalue()) { return false; Scanner reader = new Scanner(new StringReader(lineReader.getCurrentValue().toString())); float x = reader.nextfloat(); float y = reader.nextfloat(); value = new TwoDPointWritable(); value.set(x,y); return true; Mapper public static class KmeansMapper extends Mapper<LongWritable, TwoDPointWritable, IntWritable, TwoDPointWritable>{ public void map(longwritable key, TwoDPointWritable value, Context context ) throws IOException, InterruptedException { float distance=0.0f, mindistance= f; int winnercentroid=-1, i=0; for ( i=0; i<2; i++ ) { FloatWritable X = value.getx(); float x = X.get(); FloatWritable Y = value.gety(); float y = Y.get(); distance = (x- centroids[i][0])*(x-centroids[i][0]) + (y-centroids[i][1]) *(y-centroids[i][1]); if ( distance < mindistance ) { mindistance = distance; winnercentroid=i; IntWritable winnercentroid = new IntWritable(winnercentroid); context.write(winnercentroid, value); System.out.printf("Map: Centroid = %d distance = %f\n", winnercentroid, mindistance); 6

7 Reducer public static class KmeansReducer extends Reducer<IntWritable,TwoDPointWritable,IntWritable,Text> { public void reduce(intwritable clusterid, Iterable<TwoDPointWritable> points, Context context) throws IOException, InterruptedException { int num = 0; float centerx=0.0f, centery=0.0f; for (TwoDPointWritable point : points) { num++; FloatWritable X = point.getx(); float x = X.get(); FloatWritable Y = point.gety(); float y = Y.get(); centerx += x; centery += y; centerx = centerx/num; centery = centery/num; String preres = String.format("%f %f", centerx, centery); Text result = new Text(preres); context.write(clusterid, result); Output Value type In our Reducer: Text Could be a variant of TwoDPointWritable as well Have to implement the class OutputFormat Have to provide an RecordWriter method similarly to input Could we have used Text for input as well? As long as split is done on a per line basis, yes and parse the text for the cluster centroids We still need the TwoDPointWritable abstraction for the intermediary output Or constantly rewriting/parsing floats to/from strings -> slow 7

8 Distributed Cache A mechanism to distribute files Make them available to MapReduce task code Has to be in hdfs does not work for local file systems yarn command provides several options to add distributed files Can also use Java API directly Supports Simple text files Jars Archives: zip, tar, tgz/tar.gz Distributed Cache Prior to task execution these files are copied locally from HDFS Files now reside on a local disk local cache Locally cached files become qualified to be deleted after all tasks utilizing cache complete Files in the local cache are deleted after a 10GB threshold is reach 8

9 Simple life-cycle of Map and Reduce The framework first calls setup(context) for each key/value pair in the split: map(key, Value, Context) Finally cleanup(context) is called Distributed Cache operation are implemented in the setup context e.g. for our kmeans Mapper public static class KmeansMapper extends Mapper<LongWritable, TwoDPointWritable, IntWritable, TwoDPointWritable>{ public final static String centerfile="centers.txt"; public float[][] centroids = new float[2][2]; public void setup(context context) throws IOException { Scanner reader = new Scanner(new FileReader(centerfile)); for (int i=0; i<2; i++ ) { int pos = reader.nextint(); centroids[pos][0] = reader.nextfloat(); centroids[pos][1] = reader.nextfloat(); public void map(longwritable key, TwoDPointWritable value, Context context ) throws IOException, InterruptedException {. 9

10 Putting our main file together public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); Job job = new Job(conf, "kmeans"); Path tocache = new Path("/centers/centers.txt"); job.addcachefile(tocache.touri()); job.createsymlink(); job.setjarbyclass(kmeans.class); job.setmapperclass(kmeansmapper.class); job.setreducerclass(kmeansreducer.class); job.setinputformatclass (TwoDPointFileInputFormat.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); job.setmapoutputkeyclass(intwritable.class); job.setmapoutputvalueclass(twodpointwritable.class);(); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); job.setoutputkeyclass(intwritable.class); job.setoutputvalueclass(text.class); System.exit(job.waitForCompletion(true)? 0 : 1); 10

11 What else is there Context provides possibility to store and retrieve counters Could be used to store the iteration id for the kmeans algorithm and read e.g. centers.txt.<iteration> id Job dependencies using JobControl class Create simple workflows Represents a graph of Jobs to run Specify dependencies in code Each map step should really handle more than one datapoint Would need to create a Vector of TwoDPointWritables Workflow with JobControl Create JobControl Will need to execute within a Thread For each Job in the workflow Construct ControlledJob Wrapper for Job instance Constructor takes in dependent jobs Add each ControlledJob to JobControl Execute JobControl in a Thread Recall JobControl implements Runnable Wait for JobControl to complete and report results Clean-up in case of a failure 11

12 Summary Resource for all Hadoop APIs and Classes reduce/ 12

Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise