Hadoop Design and k-means Clustering

Similar documents

Introduction to Cloud Computing

University of Maryland. Tuesday, February 2, 2010

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Lecture 3 Hadoop Technical Introduction CSE 490H

Extreme Computing. Hadoop MapReduce in more detail.

Big Data Management and NoSQL Databases

Introduction to MapReduce and Hadoop

Scalable Computing with Hadoop

Introduction To Hadoop

Hadoop WordCount Explained! IT332 Distributed Systems

Data-intensive computing systems

Cloudera Certified Developer for Apache Hadoop

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Big Data and Scripting map/reduce in Hadoop

Getting to know Apache Hadoop

Map Reduce & Hadoop Recommended Text:

Hadoop Streaming. Table of contents

Introduction to Parallel Programming and MapReduce

Internals of Hadoop Application Framework and Distributed File System

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Machine Learning using MapReduce

BIG DATA, MAPREDUCE & HADOOP

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

By Hrudaya nath K Cloud Computing

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Data Science in the Wild

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Map-Reduce and Hadoop

Introduction to Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

PassTest. Bessere Qualität, bessere Dienstleistungen!

INTRODUCTION TO HADOOP

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

MAPREDUCE Programming Model

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Introduction to Hadoop

map/reduce connected components

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Big Data and Apache Hadoop s MapReduce

Data Intensive Computing Handout 6 Hadoop

Big Data With Hadoop

Parallel Processing of cluster by Map Reduce

Hadoop Architecture. Part 1

Mrs: MapReduce for Scientific Computing in Python

DATA CLUSTERING USING MAPREDUCE

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HowTo Hadoop. Devaraj Das

HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

MapReduce (in the cloud)

MapReduce, Hadoop and Amazon AWS

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Hadoop Parallel Data Processing

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Keywords: Big Data, HDFS, Map Reduce, Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop SNS. renren.com. Saturday, December 3, 11

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Map Reduce / Hadoop / HDFS

Data-Intensive Computing with Map-Reduce and Hadoop

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

The Hadoop Framework

Map-Reduce for Machine Learning on Multicore

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Programming in Hadoop Programming, Tuning & Debugging

Chapter 7. Using Hadoop Cluster and MapReduce

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Survey on Scheduling Algorithm in MapReduce Framework

Data Intensive Computing Handout 5 Hadoop

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Transcription:

Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 1 / 31

Hadoop Design Outline 1 Fault Tolerance 2 Data Flow Input Output 3 MapTask Map Partition 4 ReduceTask Fetch and Sort Reduce Later in this talk: Performance and k-means Clustering Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 2 / 31

Fault Tolerance Managing Tasks JobTracker TaskTracker TaskTracker ReduceTask MapTask MapTask MapTask Design TaskTracker reports status or requests work every 10 seconds MapTask and ReduceTask report progress every 10 seconds Issues + Detects failures and slow workers quickly - JobTracker is a single point of failure Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 3 / 31

Fault Tolerance Coping With Failure Failed Tasks Rerun map and reduce as necessary. Slow Tasks Start a second backup instance of the same task. Consistency Any MapTask or ReduceTask might be run multiple times Map and Reduce should be functional Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 4 / 31

Fault Tolerance Use of Random Numbers Purpose Support randomized algorithms while remaining consistent Sampling Mapper private Random rand; void configure(jobconf conf) { rand.setseed((long)conf.getint("mapred.task.partition")); } void map(writablecomparable key, Writable value, OutputCollector output, Reporter reporter) { if (rand.nextfloat() < 0.1) { output.collect(key, value); } } Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 5 / 31

Data Flow Data Flow HDFS Input InputFormat splits and reads files Mapper Local Output HTTP Input SequenceFileOutputFormat writes serialized values Map outputs are retrieved over HTTP and merged Reduce HDFS Output OutputFormat writes a SequenceFile or text Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 6 / 31

Data Flow Input InputSplit Purpose Locate a single map task s input. Important Functions Path FileSplit.getPath(); Implementations MultiFileSplit is a list of small files to be concatenated. FileSplit is a file path, offset, and length. TableSplit is a table name, start row, and end row. Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 7 / 31

Data Flow Input RecordReader Purpose Parse input specified by InputSplit into keys and values. Handle records on split boundaries. Important Functions boolean next(writable key, Writable value); Implementations LineRecordReader reads lines. Key is an offset, value is the text. KeyValueLineRecordReader reads delimited key-value pairs. SequenceFileRecordReader reads a SequenceFile, Hadoop s binary representation of key-value pairs. Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 8 / 31

Data Flow Input InputFormat Purpose Specifies input file format by constructing InputSplit and RecordReader. Important Functions RecordReader getrecordreader(inputsplit split, JobConf job, Reporter reporter); InputSplit[] getsplits(jobconf job, int numsplits); Implementations TextInputFormat reads text files. TableInputFormat reads from a table. Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 9 / 31

Data Flow Output OutputFormat Purpose Machine or human readable output. Makes RecordWriter, which is analogous to RecordReader Important Functions RecordWriter getrecordwriter(filesystem fs, JobConf job, String name, Progressable progress); Formats SequenceFileOutputFormat writes a binary SequenceFile TextOutputFormat writes text files Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 10 / 31

MapTask MapTask Default Setup InputFormat Split files and read records MapRunnable Map all records in the task Mapper Map a record OutputCollector Consult Partitioner and save files Partitioner Assign key-value pairs to reducers Reducer Reducer Reducers retrieve files over HTTP Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 11 / 31

MapTask Map MapRunnable Purpose Sequence of map operations Default Implementation public void run(recordreader input, OutputCollector output, Reporter reporter) throws IOException { try { WritableComparable key = input.createkey(); Writable value = input.createvalue(); while (input.next(key, value)) { mapper.map(key, value, output, reporter); } } finally { mapper.close(); } } Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 12 / 31

MapTask Map Mapper Purpose Single map operation Important Functions void map(writablecomparable key, Writable value, OutputCollector output, Reporter reporter); Pre-defined Mappers IdentityMapper InverseMapper flips key and value. RegexMapper matches regular expressions set in job. TokenCountMapper implements word count map. Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 13 / 31

MapTask Partition Partitioner Purpose Decide which reducer handles map output. Important Functions int getpartition(writablecomparable key, Writable value, int numreducetasks); Implementations HashPartitioner uses key.hashcode() % numreducetasks. KeyFieldBasedPartitioner hashes only part of key. Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 14 / 31

ReduceTask Fetch and Sort Fetch and Sort Fetch TaskTracker tells Reducer where mappers are Reducer requests input files from mappers via HTTP Merge Sort Recursively merges 10 files at a time 100 MB in-memory sort buffer Calls key s Comparator, which defaults to key.compareto Important Functions int WritableComparable.compareTo(Object o); int WritableComparator.compare(WritableComparable a, WritableComparable b); Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 15 / 31

ReduceTask Reduce Reduce Important Functions void reduce(writablecomparable key, Iterator values, OutputCollector output, Reporter reporter); Pre-defined Reducers IdentityReducer LongSumReducer sums LongWritable values Behavior Reduce cannot start until all Mappers finish and their output is merged. Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 16 / 31

Using Hadoop 5 Performance Combiners 6 k-means Clustering Algorithm Implementation Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 17 / 31

Performance Performance Why We Care 10, 000 programs Average 100, 000 jobs/day 20 petabytes/day Source: Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51 (2008), 107 113. Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 18 / 31

Performance Barriers Concept Barriers wait for N things to happen Examples Reduce waits for all Mappers to finish Job waits for all Reducers to finish Search engine assembles pieces of results Moral Worry about the maximum time. This implies balance. Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 19 / 31

Performance Combiners Combiner Purpose Lessen network traffic by combining repeated keys in MapTask. Important Functions void reduce(writablecomparable key, Iterator values, OutputCollector output, Reporter reporter); Example Implementation LongSumReducer adds LongWritable values Behavior Framework decides when to call. Uses Reducer interface, but called with partial list of values. Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 20 / 31

Extended Combining Performance Combiners Problem 1000 map outputs are buffered before combining. Keys can still be repeated enough to unbalance a reduce. Two Phase Reduce 1 Run a MapReduce to combine values Use Partitioner to balance a key over Reducers Run Combiner in Mapper and Reducer 2 Run a MapReduce to reduce values Map with IdentityMapper Partition normally Reduce normally Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 21 / 31

Performance General Advice General Advice Small Work Units More inputs than Mappers Ideally, more reduce tasks than Reducers Too many tasks increases overhead Aim for constant-memory Mappers and Reducers Map Only Skip IdentityReducer by setting numreducetasks to -1 Outside Tables Increase HDFS replication before launching Keep random access tables in memory Use multithreading to share memory Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 22 / 31

k-means Clustering Data Netflix data Goal Find similar movies from ratings provided by users Vector Model Give each movie a vector Make one dimension per user Put origin at average rating (so poor is negative) Normalize all vectors to unit length Often called cosine similarity Issues - Users are biased in the movies they rate + Addresses different numbers of raters Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 23 / 31

k-means Clustering Algorithm k-means Clustering Two Dimensional Clusters Goal Cluster similar data points Approach Given data points x[i] and distance d: Select k centers c Assign x[i] to closest center c[i] Minimize i d(x[i], c[i]) d is sum of squares Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 24 / 31

k-means Clustering Algorithm Lloyd s Algorithm Algorithm 1 Randomly pick centers, possibly from data points 2 Assign points to closest center 3 Average assigned points to obtain new centers 4 Repeat 2 and 3 until nothing changes Issues - Takes superpolynomial time on some inputs - Not guaranteed to find optimal solution + Converges quickly in practice Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 25 / 31

k-means Clustering Implementation Lloyd s Algorithm in MapReduce Reformatting Data Create a SequenceFile for fast reading. Partition as you see fit. Initialization Use a seeded random number generator to pick initial centers. Iteration Load centers table in MapRunnable or Mapper. Termination Use TextOutputFormat to list movies in each cluster. Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 26 / 31

Iterative MapReduce k-means Clustering Implementation Centers Version i Points Points Mapper Reducer Mapper Reducer Find Nearest Center Key is Center, Value is Movie Average Ratings Centers Version i + 1 Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 27 / 31

k-means Clustering Implementation Direct Implementation Mapper Load all centers into RAM off HDFS For each movie, measure distance to each center Output key identifying the closest center Reducer Output average ratings of movies Issues - Brute force distance and all centers in memory - Unbalanced reduce, possibly even for large k Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 28 / 31

Two Phase Reduce k-means Clustering Implementation Implementation 1 Combine Mapper key identifies closest center, value is point. Partitioner balances centers over reducers. Combiner and Reducer add and count points. 2 Recenter IdentityMapper Reducer averages values Issues + Balanced reduce - Two phases - Mapper still has all k centers in memory Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 29 / 31

k-means Clustering Implementation Large k Implementation Map task responsible for part of movies and part of k centers. For each movie, finds closest of known centers. Output key is point, value identifies center and distance. Reducer takes minimum distance center. Output key identifies center, value is movie. Second phase averages points in each center. Issues + Large k while still fitting in RAM - Reads data points multiple times - Startup and intermediate storage costs Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 30 / 31

Exercises Exercises k-means Run on part of Netflix to cluster movies Read about and implement Canopies: http://www.kamalnigam.com/papers/canopy-kdd00.pdf Kenneth Heafield (Google Inc) Hadoop Design and k-means Clustering January 15, 2008 31 / 31