Naïve Bayes and Hadoop. Shannon Quinn



Similar documents
CSCI6900 Assignment 2: Naïve Bayes on Hadoop

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

1 Maximum likelihood estimation

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

ITG Software Engineering

MapReduce and Hadoop Distributed File System

CSCI 5417 Information Retrieval Systems Jim Martin!

CS 6220: Data Mining Techniques Course Project Description

Classification On The Clouds Using MapReduce

Analysis of MapReduce Algorithms

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Fast Analytics on Big Data with H20

Introduction to Learning & Decision Trees

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Big Data and Scripting map/reduce in Hadoop

Distributed Computing and Big Data: Hadoop and MapReduce

Data Mining Classification: Decision Trees

Machine Learning Capacity and Performance Analysis and R

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

MapReduce: Algorithm Design Patterns

Bayes and Naïve Bayes. cs534-machine Learning

A Performance Analysis of Distributed Indexing using Terrier

University of Maryland. Tuesday, February 2, 2010

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

MapReduce and Hadoop Distributed File System V I J A Y R A O

CS455 - Lab 10. Thilina Buddhika. April 6, 2015

B490 Mining the Big Data. 0 Introduction

HT2015: SC4 Statistical Data Mining and Machine Learning

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

Bayesian networks - Time-series models - Apache Spark & Scala

Hadoop Performance Diagnosis By Post-execution Log Analysis

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Extreme Computing. Hadoop MapReduce in more detail.

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Machine Learning in Spam Filtering

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Hadoop Architecture. Part 1

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

An Introduction to Data Mining

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Knowledge Discovery and Data Mining

Classification Problems

Data Mining Yelp Data - Predicting rating stars from review text

Introduction to Parallel Programming and MapReduce

Big Data and Scripting. (lecture, computer science, bachelor/master/phd)

Similarity Search in a Very Large Scale Using Hadoop and HBase

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Introduction to Hadoop

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

CSE-E5430 Scalable Cloud Computing Lecture 2

Learning from Data: Naive Bayes

MapReduce for Statistical NLP/Machine Learning

Class Imbalance Learning in Software Defect Prediction

A Logistic Regression Approach to Ad Click Prediction

Hadoop and Map-Reduce. Swati Gore

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Decision Tree Learning on Very Large Data Sets

Hadoop SNS. renren.com. Saturday, December 3, 11

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Data Mining. Nonlinear Classification

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Energy Efficient MapReduce

Lecture 10: Regression Trees

Hadoop Ecosystem B Y R A H I M A.

Developing MapReduce Programs

Relational Processing on MapReduce

GraySort on Apache Spark by Databricks

Spark and the Big Data Library

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Library Services

Content-Based Recommendation

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Tutorial for Assignment 2.0

Big Data Processing with Google s MapReduce. Alexandru Costan

Bayesian Spam Filtering

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Systems CS 5965/6965 FALL 2015

Developing a MapReduce Application

Lecture 9: Introduction to Pattern Analysis

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Categorical Data Visualization and Clustering Using Subjective Factors

Predict Influencers in the Social Network

Map-Reduce for Machine Learning on Multicore

Transcription:

Naïve Bayes and Hadoop Shannon Quinn

http://xkcd.com/ngram-charts/

Coupled Temporal Scoping of Relational Facts. P.P. Talukdar, D.T. Wijaya and T.M. Mitchell. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2012 Understanding Semantic Change of Words Over Centuries. D.T. Wijaya and R. Yeniterzi. In Workshop on Detecting and Exploiting Cultural Diversity on the Social Web (DETECT), 2011 at CIKM 2011

Some of A Joint Distribution A B C D E p is the effect of the 0.00036 is the effect of a 0.00034. The effect of this 0.00034 to this effect : 0.00034 be the effect of the not the effect of any 0.00024 does not affect the general 0.00020 does not affect the question 0.00020 any manner affect the principle 0.00018

Flashback Size of table: 2 15 =32768 (if all binary) è avg of 1.5 examples per row Actual m = 1,974,927, 360 (if continuous attributes binarized) P( E) = rows matching E P(row) Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. [Kohavi, 1996] Number of Instances: 48,842 Number of Attributes: 14 (in UCI s copy of dataset) + 1; 3 (here)

Naïve Density Estimation The problem with the Joint Estimator is that it just mirrors the training data (and is *hard* to compute). We need something which generalizes more usefully. Copyright Andrew W. Moore The naïve model generalizes strongly: Assume that each attribute is distributed independently of any of the other attributes.

Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A ^ ~B ^ C ^ ~D)? Copyright Andrew W. Moore

Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A ^ ~B ^ C ^ ~D)? P(A) P(~B) P(C) P(~D) Copyright Andrew W. Moore

Naïve Distribution General Case Suppose X 1,X 2,,X d are independently distributed. Pr( X = x1,..., X d = xd ) = Pr( X1 = x1 )... Pr( X d = 1 d x ) So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand. How do we learn this? Copyright Andrew W. Moore

Learning a Naïve Density Estimator #records with X i = xi P( X i = xi ) = #records MLE P( X i = x i ) = #records with X i = x #records + m i + mq Dirichlet (MAP) Another trivial learning algorithm! Copyright Andrew W. Moore

Independently Distributed Data Review: A and B are independent if Pr(A,B)=Pr(A)Pr(B) Sometimes written: A and B are conditionally independent given C if Pr(A,B C)=Pr(A C)*Pr(B C) Written A B A B C Copyright Andrew W. Moore

Bayes Classifiers If we can do inference over Pr(X,Y) in particular compute Pr(X Y) and Pr(Y). We can compute Pr( Y X,..., X ) = 1 d Pr( X1,..., X d Y ) Pr( Y ) Pr( X,..., X ) 1 d

Can we make this interesting? Yes! Key ideas: Pick the class variable Y Instead of estimating P(X 1,,X n,y) = P(X 1 )* *P(X n )*Y, estimate P(X 1,,X n Y) = P(X 1 Y)* *P(X n Y) Or, assume P(X i Y)=Pr(X i X 1,,X i-1,x i+1, X n,y) Or, that X i is conditionally independent of every X j, j!=i, given Y. How to estimate? MLE

The Naïve Bayes classifier v1 Dataset: each example has A unique id id Why? For debugging the feature extractor d attributes X 1,,X d Each X i takes a discrete value in dom(x i ) One class label Y in dom(y) You have a train dataset and a test dataset

The Naïve Bayes classifier v1 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X j =x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): d Compute Pr(y,x 1,.,x d ) = Pr( X x Y y') j = j = Pr( Y = j= 1 d Pr( X j = x j, Y = y') = Pr( Y = y') j 1 Pr( Y y') = = Return the best y y')

The Naïve Bayes classifier v1 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X j =x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): d Compute Pr(y,x 1,.,x d ) = Pr( X x Y y') j = j = Pr( Y = j= 1 d C( X j = x j Y = y') C( Y = y') = j 1 C( Y y') = = C( Y = ANY ) This will overfit, so Return the best y y')

The Naïve Bayes classifier v1 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X j =x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): d Compute Pr(y,x 1,.,x d ) = Pr( X x Y y') j = j = Pr( Y = j= 1 d C( X j = x j Y = y') + mq j C( Y = y') + mq j = j C Y y m = 1 ( = ') + C( Y = ANY ) + m Return the best y where: q j = 1/ dom(x j ) q y = 1/ dom(y) m=1 This will underflow, so y')

The Naïve Bayes classifier v1 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X j =x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): Compute log Pr(y,x 1,.,x d ) = C( X = log j C( Y Y Return the best y j = x j = = y') + m y') + mq j C( Y = + log C( Y = where: q j = 1/ dom(x j ) q y = 1/ dom(y) m=1 y') + mq ANY ) + m j

The Naïve Bayes classifier v2 For text documents, what features do you use? One common choice: X 1 = first word in the document X 2 = second word in the document X 3 = third X 4 = But: Pr(X 13 =hockey Y=sports) is probably not that different from Pr(X 11 =hockey Y=sports) so instead of treating them as different variables, treat them as different copies of the same variable

The Naïve Bayes classifier v1 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X j =x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): d Compute Pr(y,x 1,.,x d ) = Pr( X x Y y') j = j = Pr( Y = j= 1 d Pr( X j = x j, Y = y') = Pr( Y = y') j 1 Pr( Y y') = = Return the best y y')

The Naïve Bayes classifier v2 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X j =x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): d Compute Pr(y,x 1,.,x d ) = Pr( X x Y y') j = j = Pr( Y = j= 1 d Pr( X j = x j, Y = y') = Pr( Y = y') j 1 Pr( Y y') = = Return the best y y')

The Naïve Bayes classifier v2 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X=x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): d Compute Pr(y,x 1,.,x d ) = Pr( X x Y y') = j = Pr( Y = j= 1 d Pr( X = x j, Y = y') = Pr( Y = y') j 1 Pr( Y y') = = Return the best y y')

The Naïve Bayes classifier v2 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X=x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): Compute log Pr(y,x 1,.,x d ) = # = log C(X = x Y = y')+ mq & j x % $ j C(X = ANY ^Y = y')+ m ( + log C(Y = y')+ mq y ' C(Y = ANY )+ m Return the best y where: q j = 1/ V q y = 1/ dom(y) m=1

Complexity of Naïve Bayes You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X=x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): Compute log Pr(y,x 1,.,x d ) = # = log C(X = x Y = y')+ mq & j x % $ j C(X = ANY ^Y = y')+ m ( + log C(Y = y')+ mq y ' C(Y = ANY )+ m Sequential reads Complexity: O(n), n=size of train where: q j = 1/ V q y = 1/ dom(y) m=1 Sequential reads Return the best y Complexity: O( dom(y) *n ), n =size of test

MapReduce!

Inspiration not Plagiarism This is not the first lecture ever on Mapreduce I borrowed from William Cohen, who borrowed from Alona Fyshe, and she borrowed from: Jimmy Lin http://www.umiacs.umd.edu/~jimmylin/cloud-computing/sigir-2009/lin-mapreduce-sigir2009.pdf Google http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html http://code.google.com/edu/submissions/mapreduce/listing.html Cloudera http://vimeo.com/3584536

Introduction to MapReduce 3 main phases Map (send each input record to a key) Sort (put all of one key in the same place) Handled behind the scenes in Hadoop Reduce (operate on each key and its set of values) Terms come from functional programming: map(lambda x:x.upper(), [ shannon", p", quinn"])è [ SHANNON', P', QUINN'] reduce(lambda x,y:x+"-"+y, [ shannon", p", quinn"])è shannon-p-quinn

MapReduce overview Map Shuffle/Sort Reduce

MapReduce in slow motion Canonical example: Word Count Example corpus:

others: KeyValueInputFormat SequenceFileInputFormat

Is any part of this wasteful? Remember - moving data around and writing to/reading from disk are very expensive operations No reducer can start until: all mappers are done data in its partition has been sorted

Common pitfalls You have no control over the order in which reduces are performed You have no control over the order in which you encounter reduce values Sort of (more later). The only ordering you should assume is that Reducers always start after Mappers

Common pitfalls You should assume your Maps and Reduces will be taking place on different machines with different memory spaces Don t make a static variable and assume that other processes can read it They can t. It appear that they can when run locally, but they can t No really, don t do this.

Common pitfalls Do not communicate between mappers or between reducers overhead is high you don t know which mappers/reducers are actually running at any given point there s no easy way to find out what machine they re running on because you shouldn t be looking for them anyway

Assignment 1 Naïve Bayes on Hadoop Document classification Due Thursday, Jan 29 by 11:59:59pm

Other reminders If you haven t yet signed up for a student presentation, do so! I will be assigning slots starting Friday at 5pm, so sign up before then You need to be on BitBucket to do this Mailing list for questions Office hours Mondays 9-10:30am