Random Forests and Ferns. David Capel

Similar documents
Decision Trees from large Databases: SLIQ

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

FastKeypointRecognitioninTenLinesofCode

Local features and matching. Image classification & object localization

Data Mining. Nonlinear Classification

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Classification Problems

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Comparison of Data Mining Techniques used for Financial Data Analysis

Distributed forests for MapReduce-based machine learning

Data Mining Classification: Decision Trees

How To Make A Credit Risk Model For A Bank Account

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Gerry Hobbs, Department of Statistics, West Virginia University

STA 4273H: Statistical Machine Learning

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Machine Learning Logistic Regression

Predicting Flight Delays

The Delicate Art of Flower Classification

Data Mining for Knowledge Management. Classification

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Car Insurance. Havránek, Pokorný, Tomášek

L25: Ensemble learning

Classification of Bad Accounts in Credit Card Industry

Learning from Diversity

Data Mining Practical Machine Learning Tools and Techniques

Knowledge Discovery and Data Mining

The Operational Value of Social Media Information. Social Media and Customer Interaction

Data, Measurements, Features

Data Mining - Evaluation of Classifiers

Probabilistic Latent Semantic Analysis (plsa)

Model Combination. 24 Novembre 2009

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Beating the MLB Moneyline

Spam Detection A Machine Learning Approach

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Employer Health Insurance Premium Prediction Elliott Lui

MS1b Statistical Data Mining

Question 2 Naïve Bayes (16 points)

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Basics of Statistical Machine Learning

Chapter 6. The stacking ensemble approach

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Fast Analytics on Big Data with H20

1 Maximum likelihood estimation

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

D-optimal plans in observational studies

Lecture 10: Regression Trees

Supervised Learning (Big Data Analytics)

Simple and efficient online algorithms for real world applications

CSCI567 Machine Learning (Fall 2014)

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Data Mining Algorithms Part 1. Dejan Sarka

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables

Classification and Prediction

Bayes and Naïve Bayes. cs534-machine Learning

Package bigrf. February 19, 2015

Introduction to Learning & Decision Trees

Leveraging Ensemble Models in SAS Enterprise Miner

Better credit models benefit us all

Object Recognition. Selim Aksoy. Bilkent University

Forecasting stock markets with Twitter

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Data Mining Methods: Applications for Institutional Research

Sub-class Error-Correcting Output Codes

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Microsoft Azure Machine learning Algorithms

Applied Multivariate Analysis - Big data analytics

Environmental Remote Sensing GEOG 2021

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Social Media Mining. Data Mining Essentials

Less naive Bayes spam detection

Practical Tour of Visual tracking. David Fleet and Allan Jepson January, 2006

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

Linear Classification. Volker Tresp Summer 2015

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Classification algorithm in Data mining: An Overview

The Data Mining Process

The Scientific Data Mining Process

Predicting borrowers chance of defaulting on credit loans

II. RELATED WORK. Sentiment Mining

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Knowledge Discovery and Data Mining

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Fraud Detection for Online Retail using Random Forests

Semantic Image Segmentation and Web-Supervised Visual Learning

Monday Morning Data Mining

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Supporting Online Material for

Statistical Machine Learning

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Transcription:

Random Forests and Ferns David Capel

276 Fergus, Zisserman and Perona The Multi-class Classification Problem 276 Fergus, Zisserman and Perona Training: Labelled exemplars representing multiple classes Handwritten digits Planes 276 Faces Cars Cats Fergus, Zisserman and Perona Classifying: to which class does this new example belong??? Figure 1. 1.Some sample images from thethe datasets. Note thethe large variation in scale in, in, forfor example, thethe cars (rear) database. Figure Some sample images from datasets. Note large variation in scale example, cars (rear) databas from both http://www.vision. caltech.edu/html-files/archive.html andand http://www.robots. ox.ac.uk except forfor the from both http://www.vision. caltech.edu/html-files/archive.html http://www.robots. ox.ac.ukvgg/data/, vgg/data/, except

The Multi-class Classification Problem A classifier is a mapping H from feature vectors f to discrete class labels C f =(f 1,f 2,..., f N ) C {c 1,c 2,..., c K } ( = H : f C Numerous choices of feature space F are possible, e.g. Raw pixel values Texton histograms Color histograms Oriented filter banks

The Multi-class Classification Problem We have a (large) database of labelled exemplars D m =(f m,c m ) for m =1...M [ Problem: Given such training data, learn the mapping H { H : f C = Even better: learn the posterior distribution over class label conditioned on the features: P (C = c k f 1,f 2,..., f N ) argmax ( =..and obtain classifier H as the mode of the posterior: H(f) = argmax k P (C = c k f 1,f 2,..., f N )

How do we represent and learn the mapping? In Bishop s book, we saw numerous ways to build multi-class classifiers, e.g. Direct, non-parametric learning of class posterior (histograms) K-Nearest Neighbours Fisher Linear Discriminants Relevance Vector Machines Multi-class SVMs f1 Class A Class B f2 Direct, non-parametric k-nearest Neighbours

Binary Decision Trees Decision trees classify features by a series of Yes/No questions At each node, the feature space is split according to the outcome of some binary decision criterion The leaves are labelled with the class C corresponding the feature reached via that path through the tree B 1 (f)=true? No Yes B 2 (f)=true? B 3 (f)=true? No Yes No Yes c 1 c 2 c 1 c 1

Binary Decision Trees Decision trees classify features by a series of Yes/No questions At each node, the feature space is split according to the outcome of some binary decision criterion The leaves are labelled with the class C corresponding the feature reached via that path through the tree f B 1 (f)=true? No Yes B 2 (f)=true? B 3 (f)=true? No Yes No Yes c 1 c 2 c 1 c 1

Binary Decision Trees Decision trees classify features by a series of Yes/No questions At each node, the feature space is split according to the outcome of some binary decision criterion The leaves are labelled with the class C corresponding the feature reached via that path through the tree f B 1 (f)=true? No Yes B 2 (f)=true? B 3 (f)=true? No Yes No Yes c 1 c 2 c 1 c 1

Binary Decision Trees Decision trees classify features by a series of Yes/No questions At each node, the feature space is split according to the outcome of some binary decision criterion The leaves are labelled with the class C corresponding the feature reached via that path through the tree f B 1 (f)=true? No Yes B 2 (f)=true? B 3 (f)=true? No Yes No Yes c 1 c 2 c 1 c 1

Binary Decision Trees: Training To train, recursively partition the training data into subsets according to some Yes/No tests on the feature vectors Partitioning continues until each subset contains features of a single class

Binary Decision Trees: Training To train, recursively partition the training data into subsets according to some Yes/No tests on the feature vectors Partitioning continues until each subset contains features of a single class

Binary Decision Trees: Training To train, recursively partition the training data into subsets according to some Yes/No tests on the feature vectors Partitioning continues until each subset contains features of a single class

Binary Decision Trees: Training To train, recursively partition the training data into subsets according to some Yes/No tests on the feature vectors Partitioning continues until each subset contains features of a single class

Binary Decision Trees: Training To train, recursively partition the training data into subsets according to some Yes/No tests on the feature vectors Partitioning continues until each subset contains features of a single class A B C C D

Common decision rules Axis-aligned splitting Threshold on a single feature at each node Very fast to evaluate f n >T General plane splitting Threshold on a linear combination of features More expensive but produces smaller trees w f >T

Choosing the right split The split location may be chosen to optimize some measure of classification performance of the child subsets Encourage child subsets with lower entropy (confusion) fraction of each class p k f n > T? Gini impurity I G = p left k (1 p left k )+ p right k (1 p right k ) k k Information gain (entropy reduction) I E = k p left k log p left k k p right k log p right k

Pros and Cons of Decision Trees Advantages Training can be fast and easy to implement Easily handles a large number of input variables Drawbacks Clearly, it is always possible to construct a decision tree that scores 100% classification accuracy on the training set But they tend to over-fit and do not generalize very well Hence, performance on the testing set will be far less impressive So, what can be done to overcome these problems?

Random Forests Ho, Tin (1995) "Random Decision Forests" Int'l Conf. Document Analysis and Recognition Breiman, Leo (2001) "Random Forests" Machine Learning Basic idea Somehow introduce randomness into the tree-learning process Build multiple, independent trees based on the training set When classifying an input, each tree votes for a class label The forest output is the consensus of all the tree votes If the trees really are independent, the performance should improve with more trees

How to introduce randomness? Bagging Generate randomized training sets by sampling with replacement from the full training set (bootstrap sampling) Full training set Random bag D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 D 9 D 10 D 11 D 12 D 4 D 9 D 3 D 4 D 12 D 10 D 10 D 7 D 3 D 1 D 6 D 1 Feature subset selection Choose different random subsets of the full feature vector to generate each tree Full feature vector Feature subset f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 f 10 f 4 f 6 f 7 f 9 f 10

Random Forests Input feature vector f D 1 1 Tree 1 Tree 2 Tree L D 1 2 D 1 L D 2 1 D 3 1 D 2 2 D 3 2 D 2 L D 3 L c 1 c 2 c 1 c 1 c 2 c 1 c 1 c 2 c 1 c 2 c 2 c 1 Votes Class 1 Class 2

Performance comparison Results from Ho 1995 Handwritten digits (10 classes) 20x20 pixel binary images 60000 training, 10000 testing samples Feature set f1 : raw pixel values (400 features) Feature set f2 : f1 plus gradients (852 features in total) Uses full training set for every tree (no bagging) Compares different features subset sizes (100 and 200 resp)

Performance comparison 100 % Correct 90 80 70..................................................................................................................................................................................................... cap (f2), 200d cap (f1), 200d cap (f2), 100d cap (f1), 100d 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Trees in Forests

Revision: Naive Bayes Classifiers We would like to evaluate the posterior probability over class label Bayes rules tells us that this is equivalent to But learning the joint likelihood distributions over all features is most likely intractable! argmax P (C k f 1,f 2,..., f N ) k rgmax ( ) ( argmax k P (f 1,f 2,..., f N C k )P (C k ) Naive Bayes makes the simplifying assumption that features are conditionally independent given the class label P (f 1,f 2,..., f N C k )= N i=1 P (f i C k ) (likelihood x prior)

Revision: Naive Bayes Classifiers Class(f) argmax k P (C k ) N n=1 P (f n C k ) This independence assumption is usually false! The resulting approximation tends to grossly underestimate the true posterior probabilities However... It is usually easy to learn the 1-d conditional densities P(f i C k ) It often works rather well in practice! Can we do better without sacrificing simplicity?

Ferns : Semi-Naive Bayes M. Özuysal et al. Fast Keypoint Recognition in 10 Lines of Code CVPR 2007 Group of features into L small sets of size S (called Ferns) F l = {f l,1,f l,2,..., f l,s } where features f n are the outcome of a binary test on the input vector, such that f n : {0,1} { } L P (f 1,f 2,..., f N C k )= P (F l C k ) Assume groups are conditionally independent, hence Learn the class-conditional distributions for each group and apply Bayes rule to obtain posterior, Class(f) argmax k l=1 P (C k ) L l=1 P (F l C k )

Naive vs Semi-Naive Bayes Full joint class-conditional distribution: Intractable to estimate P (f 1,f 2,..., f N C k ) Naive approximation: Too simplistic Poor approximation to true posterior { } L P (f 1,f 2,..., f N C k )= P (F l C k ) Semi-Naive: Balance of complexity and tractability N P (f 1,f 2,..., f N C k )= P (f i C k ) Trade complexity/performance by choice of Fern size (S), NumFerns (L) i=1 l=1

So how does a single Fern work? A Fern applies a series of S binary tests to the input vector I e.g relative intensities of a pair of pixels: f 1 (I) = I(x a,y a ) > I(x b,y b ) -> true f 2 (I) = I(x c,y c ) > I(x d,y d ) -> false This gives an S-digit binary code for the feature which may be interpreted as an integer in the range [0... 2 S -1] 0 1 0 1 1 11 binary code decimal It s really just a simple hashing scheme that drops input vectors into one of 2 S buckets

So how does a single Fern work? The output of a fern when applied to a large number of input vectors of the same class is a multinomial distribution p(f C) 0 1 2 3...... 2 S F

Training a Fern Apply the fern to each labelled training example D m= (I m,c m ) and compute its output F(D m ) Learn multinomial densities p(f C k ) as histograms of fern output for each class p(f C 0 ) 0 2 S p(f C 1 ) 0 2 S p(f C 2 ) 0 2 S 0 2 S p(f C K )

Classifying using a single Fern Given a test input, simply apply the fern and look-up the posterior distribution over class label p(f C 0 ) Test input I 0 2 S p(f C1) 0 2 S p(f C2) Apply fern F(I)=[00011] = 3 0 2 S Normalize distribution C 1 C 2 C 3 C K Class posterior 0 2 S F(I)=3 p(f C K ) p(c k F )= p(f C k) k p(f C k)

Adding randomness: an ensemble of ferns A single fern does not give great classification performance But.. we can build an ensemble of independent ferns by randomly choosing different subsets of features e.g. F 1 = {f 2, f 7, f 22, f 5, f 9 } F 2 = {f 4, f 1, f 11, f 8, f 3 } F 3 = {f 6, f 31, f 28, f 11, f 2 } F 1 F 2 Finally, combine their outputs using Semi-Naive Bayes: L Class(f) argmax P (C k ) P (F l C k ) k F 3 l=1

Classifying using Random Ferns Having randomly selected and trained a collection of ferns, classifying new inputs involves only simple look-up operations Fern 1 Fern 2 Fern L Posterior over class label

One small subtlety... Even for moderate size ferns, the output range can be large e.g. fern size S=10, output range = [0... 2 10 ] = [0... 1024] Even with large training set, many buckets may see no samples These zero-probabilities p(f C k ) will be problematic for our Semi-Naive Bayes! 0 1 2 3...... 2 S So.. assume a Dirichlet prior on the distributions P(F C k ) Assigns a baseline low probability to all possible fern outputs: p(f = z C k )= N(F = z C k )+1 2S 1 z=0 (N(F = z C k) + 1) where N(F=z C k ) is the number of times we observe fern output equal to z in the training set for class C k

Comparing Forests and Ferns Forests Decision trees directly learn the posterior P(C k F) Applies different sequence of tests in each child node Training time grows exponentially with tree depth Combine tree hypotheses by averaging Ferns Learn class-conditional distributions P(F C k ) Applies the same sequence of tests to every input vector Training time grows linearly with fern size S Combine hypothesis using Bayes rule (multiplication) Tree classifier Fern classifier

Application: Fast Keypoint Matching Ozuysal et al. use Random Ferns for keypoint recognition Similar to typical application of SIFT descriptors Very robust to affine viewing deformations Very fast to evaluate (13.5 microsec per keypoint)

Training Each keypoint to be recognized is a separate class Training sets are generated by synthesizing random affine deformations of the image patch (10000 samples) Synthesized viewpoint variation Features are pairwise intensity comparisons: f n (I) = I(x a,y a ) > I(x b,y b ) Fern size S=10 (randomly selected pixel pairs) Ensembles of 5 to 50 ferns are tested

Classification Rate vs Number of Ferns 100 90 80 Average Classification Rate 70 60 50 40 30 20 10 Ferns Random Trees 0 0 5 10 15 20 25 30 35 40 45 50 # of Structures (Depth/Size 10) own against different classifier parameters for the fern and tree ba 300 classes were used Also compares to Random Trees of equivalent size

Classification Rate vs Number of Classes (Keypoints) 100 90 80 Average Classification Rate 70 60 50 40 30 20 10 Ferns (30 ferns with size 10) Random Trees (30 trees with depth 10) 0 0 250 500 750 1000 1250 1500 1750 2000 # of Classes han Random Tree based methods. For both figures ferns with s 30 random ferns of size 10 were used

Drawback: Memory requirements Fern classifiers can be very memory hungry, e.g Fern size = 11 Number of ferns = 50 Number of classes = 1000 RAM = 2 fern size * sizeof(float) * NumFerns * NumClasses = 2048 * 4 * 50 * 1000 = 400 MBytes!

Conclusions? Random Ferns are easy to understand and easy to train Very fast to perform classification once trained Provide a probabilistic output (how accurate though?) Appear to outperform Random Forests Can be very memory hungry!