Machine Learning Algorithms for Classification. Rob Schapire Princeton University

Similar documents

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Source. The Boosting Approach. Example: Spam Filtering. The Boosting Approach to Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

Support Vector Machines Explained

Support Vector Machines with Clustering for Training with Very Large Datasets

Introduction to Support Vector Machines. Colin Campbell, Bristol University

1 What is Machine Learning?

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

A Simple Introduction to Support Vector Machines

Classification algorithm in Data mining: An Overview

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Knowledge Discovery from patents using KMX Text Analytics

: Introduction to Machine Learning Dr. Rita Osadchy

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Support Vector Machine (SVM)

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Machine Learning Final Project Spam Filtering

Machine Learning Capacity and Performance Analysis and R

Model Combination. 24 Novembre 2009

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Machine Learning in Spam Filtering

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Data Mining Classification: Decision Trees

Social Media Mining. Data Mining Essentials

Supervised Learning (Big Data Analytics)

Data Mining. Nonlinear Classification

Making Sense of the Mayhem: Machine Learning and March Madness

Big Data Analytics CSCI 4030

Data Mining Methods: Applications for Institutional Research

Knowledge Discovery and Data Mining

How Boosting the Margin Can Also Boost Classifier Complexity

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Data Mining - Evaluation of Classifiers

How To Make A Credit Risk Model For A Bank Account

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Local features and matching. Image classification & object localization

Comparison of Data Mining Techniques used for Financial Data Analysis

An Introduction to Machine Learning

Scalable Developments for Big Data Analytics in Remote Sensing

Introduction to Learning & Decision Trees

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Active Learning with Boosting for Spam Detection

Chapter 6. The stacking ensemble approach

Lecture 3: Linear methods for classification

Random forest algorithm in big data environment

Statistical Machine Learning

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

6 Classification and Regression Trees, 7 Bagging, and Boosting

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Learning is a very general term denoting the way in which agents:

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Machine Learning Introduction

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

MA2823: Foundations of Machine Learning

Data mining techniques: decision trees

Fast Analytics on Big Data with H20

Using multiple models: Bagging, Boosting, Ensembles, Forests

Author Gender Identification of English Novels

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Machine learning for algo trading

Lecture 10: Regression Trees

Knowledge Discovery and Data Mining

Beating the NCAA Football Point Spread

Classification of Bad Accounts in Credit Card Industry

Support Vector Machines

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Decision Trees from large Databases: SLIQ

CS570 Data Mining Classification: Ensemble Methods

The Artificial Prediction Market

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Knowledge Discovery and Data Mining

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

SVM Ensemble Model for Investment Prediction

Content-Based Recommendation

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

D-optimal plans in observational studies

Decompose Error Rate into components, some of which can be measured on unlabeled data

Why Ensembles Win Data Mining Competitions

Simple and efficient online algorithms for real world applications

REVIEW OF ENSEMBLE CLASSIFICATION

Identifying SPAM with Predictive Models

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Employer Health Insurance Premium Prediction Elliott Lui

Tree based ensemble models regularization by convex optimization

Data, Measurements, Features

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Transcription:

Machine Learning Algorithms for Classification Rob Schapire Princeton University

Machine Learning studies how to automatically learn to make accurate predictions based on past observations classification problems: classify examples into given set of categories new example labeled training examples machine learning algorithm classification rule predicted classification

Examples of Classification Problems text categorization (e.g., spam filtering) fraud detection optical character recognition machine vision (e.g., face detection) natural-language processing (e.g., spoken language understanding) market segmentation (e.g.: predict if customer will respond to promotion) bioinformatics (e.g., classify proteins according to their function).

Characteristics of Modern Machine Learning primary goal: highly accurate predictions on test data goal is not to uncover underlying truth methods should be general purpose, fully automatic and off-the-shelf however, in practice, incorporation of prior, human knowledge is crucial rich interplay between theory and practice emphasis on methods that can handle large datasets

Why Use Machine Learning? advantages: often much more accurate than human-crafted rules (since data driven) humans often incapable of expressing what they know (e.g., rules of English, or how to recognize letters), but can easily classify examples don t need a human expert or programmer automatic method to search for hypotheses explaining data cheap and flexible can apply to any learning task disadvantages need a lot of labeled data error prone usually impossible to get perfect accuracy

This Talk machine learning algorithms: decision trees conditions for successful learning boosting support-vector machines others not covered: neural networks nearest neighbor algorithms Naive Bayes bagging random forests. practicalities of using machine learning algorithms

Decision Trees

Example: Good versus Evil problem: identify people as good or bad from their appearance sex mask cape tie ears smokes class training data batman male yes yes no yes no Good robin male yes yes no no no Good alfred male no no yes no no Good penguin male no no yes no yes Bad catwoman female yes no no yes no Bad joker male no no no no no Bad test data batgirl female yes yes no yes no?? riddler male yes no no no no??

A Decision Tree Classifier no tie yes no cape yes no smokes yes bad good good bad

How to Build Decision Trees choose rule to split on divide data using splitting rule into disjoint subsets batman robin alfred penguin catwoman joker no batman robin catwoman joker tie yes alfred penguin

How to Build Decision Trees choose rule to split on divide data using splitting rule into disjoint subsets repeat recursively for each subset stop when leaves are (almost) pure batman robin alfred penguin catwoman joker tie no tie yes no yes batman robin catwoman joker alfred penguin

How to Choose the Splitting Rule key problem: choosing best rule to split on: batman robin alfred penguin catwoman joker batman robin alfred penguin catwoman joker tie cape no yes no yes batman robin catwoman joker alfred penguin alfred penguin catwoman joker batman robin

How to Choose the Splitting Rule key problem: choosing best rule to split on: batman robin alfred penguin catwoman joker batman robin alfred penguin catwoman joker tie cape no yes no yes batman robin catwoman joker alfred penguin alfred penguin catwoman joker batman robin idea: choose rule that leads to greatest increase in purity

How to Measure Purity want (im)purity function to look like this: (p = fraction of positive examples) impurity 0 1/2 p 1 commonly used impurity measures: entropy: p lnp (1 p)ln(1 p) Gini index: p(1 p)

Kinds of Error Rates training error = fraction of training examples misclassified test error = fraction of test examples misclassified generalization error = probability of misclassifying new random example

A Possible Classifier no mask yes no smokes yes no cape yes ears bad sex ears no yes male female no tie cape good bad smokes no yes no yes no yes yes good bad good bad good good bad perfectly classifies training data BUT: intuitively, overly complex

Another Possible Classifier bad no mask yes good overly simple doesn t even fit available data

Accuracy Tree Size versus Accuracy 0 10 20 30 40 50 60 70 80 100 90 0.5 50 0.55 data On test data 0.6 On training 40 0.65 test 0.7 30 0.75 0.8 20 train 0.85 0.9 error (%) 10 0 50 100 tree size trees must be big enough to fit training data (so that true patterns are fully captured) BUT: trees that are too big may overfit (capture noise or spurious patterns in the data) significant problem: can t tell best tree size from training error

Overfitting Example fitting points with a polynomial underfit ideal fit overfit (degree = 1) (degree = 3) (degree = 20)

Building an Accurate Classifier for good test peformance, need: enough training examples good performance on training set classifier that is not too complex ( Occam s razor ) classifiers should be as simple as possible, but no simpler simplicity closely related to prior expectations

Building an Accurate Classifier for good test peformance, need: enough training examples good performance on training set classifier that is not too complex ( Occam s razor ) classifiers should be as simple as possible, but no simpler simplicity closely related to prior expectations measure complexity by: number bits needed to write down number of parameters VC-dimension

Example Training data:

Good and Bad Classifiers Good: sufficient data low training error simple classifier Bad: insufficient data training error classifier too high too complex

Theory can prove: (generalization error) (training error) + Õ with high probability d = VC-dimension m = number training examples ( ) d m

Controlling Tree Size typical approach: build very large tree that fully fits training data, then prune back pruning strategies: grow on just part of training data, then find pruning with minimum error on held out part find pruning that minimizes (training error) + constant (tree size)

Decision Trees best known: C4.5 (Quinlan) CART (Breiman, Friedman, Olshen & Stone) very fast to train and evaluate relatively easy to interpret but: accuracy often not state-of-the-art

Boosting

Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@att.com Rob, can you review a paper... non-spam From: xa412@hotmail.com Earn money without working!!!!... spam.. goal: have computer learn from examples to distinguish spam from non-spam.

Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@att.com Rob, can you review a paper... non-spam From: xa412@hotmail.com Earn money without working!!!!... spam.. goal: have computer learn from examples to distinguish spam from non-spam main observation: easy to find rules of thumb that are often correct If v1agr@ occurs in message, then predict spam hard to find single rule that is very highly accurate.

The Boosting Approach devise computer program for deriving rough rules of thumb apply procedure to subset of emails obtain rule of thumb apply to 2nd subset of emails obtain 2nd rule of thumb repeat T times

Details how to choose examples on each round? concentrate on hardest examples (those most often misclassified by previous rules of thumb) how to combine rules of thumb into single prediction rule? take (weighted) majority vote of rules of thumb

Boosting boosting = general method of converting rough rules of thumb into highly accurate prediction rule technically: assume given weak learning algorithm that can consistently find classifiers ( rules of thumb ) at least slightly better than random, say, accuracy 55% given sufficient data, a boosting algorithm can provably construct single classifier with very high accuracy, say, 99%

AdaBoost given training examples (x i,y i ) where y i { 1,+1}

AdaBoost given training examples (x i,y i ) where y i { 1,+1} for t = 1,...,T: train weak classifier ( rule of thumb ) h t on D t

AdaBoost given training examples (x i,y i ) where y i { 1,+1} initialize D 1 = uniform distribution on training examples for t = 1,...,T: train weak classifier ( rule of thumb ) h t on D t

AdaBoost given training examples (x i,y i ) where y i { 1,+1} initialize D 1 = uniform distribution on training examples for t = 1,...,T: train weak classifier ( rule of thumb ) h t on D t choose α t > 0 compute new distribution D t+1 : for each example i: { e α t (< 1) if y multiply D t (x i ) by i = h t (x i ) e αt (> 1) if y i h t (x i ) renormalize

AdaBoost given training examples (x i,y i ) where y i { 1,+1} initialize D 1 = uniform distribution on training examples for t = 1,...,T: train weak classifier ( rule of thumb ) h t on D t choose α t > 0 compute new distribution D t+1 : for each example i: { e α t (< 1) if y multiply D t (x i ) by i = h t (x i ) e αt (> 1) if y i h t (x i ) renormalize ( ) output final classifier H final (x) = sign t α t h t (x)

Toy Example D 1 weak classifiers = vertical or horizontal half-planes

Round 1 h 1 D 2 ε 1 =0.30 α 1 =0.42

Round 2!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! h 2 D 3 ε 2 =0.21 α 2 =0.65

Round 3 '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( '& '& '& '& )( )( )( )( )( )( )( )( )( )( )( )( $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "#,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* +,-* + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + h 3 ε 3 =0.14 α 3 =0.92

Final Classifier 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 6 6 6 6 67 7 7 67 7 7 67 67 67 67 67 67 67 67 67 67 67 67 67 67././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././................./ / /./ / /./ / /./ / /./ / /./ / /./ / /./ / /././././././././././././././././././././././././././././././././././././././././././././././././././././././././ :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; : : : : : : : : : : : : : : : : : : :; ; ; :; ; ; :; ; ; :; ; ; :; ; ; :; ; ; :; ; ; :; ; ; :; ; ; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; :; H = sign 0.42 + 0.65 + 0.92 final < = < = < = < = < = < = < = < = < = < = < = < = < = < = < = < = < = < = < < < < < = = = < = = = < = < = < = < = < = < = < = < = < = < = < = < = < = < = 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 = 10>? 10>? 10>? 8 10>? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 89 89 89 10>? 10>? 10>? 8 10>? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 89 89 89 10>? 10>? 10>? 8 10>? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 89 89 89 10>? 10>? 10>? 8 10>? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 89 89 89 10>? 10>? 10>? 8 10>? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 89 89 89 10>? 10>? 10>? 8 10>? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 89 89 89 10>? 10>? 10>? 8 10>? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 89 89 89 10>? 10>? 10>? 8 10>? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 89 89 89 10>? 10>? 10>? 8 10>? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 89 89 89 10>? 10>? 10>? 8 10>? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 >? 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89 10 10 10 8 10 89 89 89 89 89 89 89 89 89 89 89 89

Theory: Training Error weak learning assumption: each weak classifier at least slightly better than random i.e., (error of h t on D t ) 1/2 γ for some γ > 0 given this assumption, can prove: training error(h final ) e 2γ2 T

How Will Test Error Behave? (A First Guess) 1 0.8 error 0.6 0.4 0.2 test train 20 40 60 80 100 # of rounds ( T) expect: training error to continue to drop (or reach zero) test error to increase when H final becomes too complex Occam s razor overfitting hard to know when to stop training

Actual Typical Run error 20 15 10 5 0 C4.5 test error test train 10 100 1000 # of rounds (T ) (boosting C4.5 on letter dataset) test error does not increase, even after 1000 rounds (total size > 2,000,000 nodes) test error continues to drop even after training error is zero! # rounds 5 100 1000 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1 Occam s razor wrongly predicts simpler rule is better

The Margins Explanation key idea: training error only measures whether classifications are right or wrong should also consider confidence of classifications

The Margins Explanation key idea: training error only measures whether classifications are right or wrong should also consider confidence of classifications recall: H final is weighted majority vote of weak classifiers

The Margins Explanation key idea: training error only measures whether classifications are right or wrong should also consider confidence of classifications recall: H final is weighted majority vote of weak classifiers measure confidence by margin = strength of the vote empirical evidence and mathematical proof that: large margins better generalization error (regardless of number of rounds) boosting tends to increase margins of training examples (given weak learning assumption)

Application: Human-computer Spoken Dialogue [with Rahim, Di Fabbrizio, Dutton, Gupta, Hollister & Riccardi] application: automatic store front or help desk for AT&T Labs Natural Voices business caller can request demo, pricing information, technical support, sales agent, etc. interactive dialogue

How It Works computer speech Human raw utterance text to speech text response automatic speech recognizer text dialogue manager predicted category natural language understanding NLU s job: classify caller utterances into 24 categories (demo, sales rep, pricing info, yes, no, etc.) weak classifiers: test for presence of word or phrase

Application: Detecting Faces [Viola & Jones] problem: find faces in photograph or movie weak classifiers: detect light/dark rectangles in image many clever tricks to make extremely fast and accurate

Boosting fast (but not quite as fast as other methods) simple and easy to program flexible: can combine with any learning algorithm, e.g. C4.5 very simple rules of thumb provable guarantees state-of-the-art accuracy tends not to overfit (but occasionally does) many applications

Support-Vector Machines

Geometry of SVM s given linearly separable data

Geometry of SVM s δ δ given linearly separable data margin = distance to separating hyperplane choose hyperplane that maximizes minimum margin intuitively: want to separate + s from s as much as possible margin = measure of confidence

Theoretical Justification let δ = minimum margin R = radius of enclosing sphere then VC-dim ( ) R 2 δ so larger margins lower complexity independent of number of dimensions in contrast, unconstrained hyperplanes in R n have VC-dim = (# parameters) = n + 1

Finding the Maximum Margin Hyperplane examples x i,y i where x i R n, y i { 1,+1} find hyperplane v x = 0 with v = 1

Finding the Maximum Margin Hyperplane examples x i,y i where x i R n, y i { 1,+1} find hyperplane v x = 0 with v = 1 margin = y(v x) maximize: δ subject to: y i (v x i ) δ and v = 1

Finding the Maximum Margin Hyperplane examples x i,y i where x i R n, y i { 1,+1} find hyperplane v x = 0 with v = 1 margin = y(v x) maximize: δ subject to: y i (v x i ) δ and v = 1 set w = v/δ w = 1/δ

Finding the Maximum Margin Hyperplane examples x i,y i where x i R n, y i { 1,+1} find hyperplane v x = 0 with v = 1 margin = y(v x) maximize: δ subject to: y i (v x i ) δ and v = 1 set w = v/δ w = 1/δ minimize: 1 2 w 2 subject to: y i (w x i ) 1

Convex Dual form Lagrangian, set / w = 0 w = i α i y i x i get quadratic program: maximize α i 1 2 α i α j y i y j x i x j i i,j subject to: α i 0 α i = Lagrange multiplier > 0 support vector key points: optimal w is linear combination of support vectors dependence on x i s only through inner products maximization problem is convex with no local maxima

What If Not Linearly Separable? answer #1: penalize each point by distance from margin 1, i.e., minimize: 1 2 w 2 +constant max{0,1 y i (w x i )} answer #2: map into higher dimensional space in which data becomes linearly separable i

Example not linearly separable

Example not linearly separable map x = (x 1,x 2 ) Φ(x) = (1,x 1,x 2,x 1 x 2,x 2 1,x2 2 )

Example not linearly separable map x = (x 1,x 2 ) Φ(x) = (1,x 1,x 2,x 1 x 2,x1 2,x2 2 ) hyperplane in mapped space has form a + bx 1 + cx 2 + dx 1 x 2 + ex1 2 + fx2 2 = 0 = conic in original space linearly separable in mapped space

Why Mapping to High Dimensions Is Dumb can carry idea further e.g., add all terms up to degree d then n dimensions mapped to O(n d ) dimensions huge blow-up in dimensionality

Why Mapping to High Dimensions Is Dumb can carry idea further e.g., add all terms up to degree d then n dimensions mapped to O(n d ) dimensions huge blow-up in dimensionality statistical problem: amount of data needed often proportional to number of dimensions ( curse of dimensionality ) computational problem: very expensive in time and memory to work in high dimensions

How SVM s Avoid Both Problems statistically, may not hurt since VC-dimension independent of number of dimensions ((R/δ) 2 ) computationally, only need to be able to compute inner products Φ(x) Φ(z) sometimes can do very efficiently using kernels

Example (continued) modify Φ slightly: x = (x 1,x 2 ) Φ(x) = (1, x 1, x 2, x 1 x 2,x1,x 2 2) 2

Example (continued) modify Φ slightly: x = (x 1,x 2 ) Φ(x) = (1, 2x 1, 2x 2, 2x 1 x 2,x1,x 2 2) 2

Example (continued) modify Φ slightly: x = (x 1,x 2 ) Φ(x) = (1, 2x 1, 2x 2, 2x 1 x 2,x1,x 2 2) 2 then Φ(x) Φ(z) = 1 + 2x 1 z 1 + 2x 2 z 2 + 2x 1 x 2 z 1 z 2 + x 2 1 z2 1 + x2 2 z2 2 = (1 + x 1 z 1 + x 2 z 2 ) 2 = (1 + x z) 2 simply use in place of usual inner product

Example (continued) modify Φ slightly: x = (x 1,x 2 ) Φ(x) = (1, 2x 1, 2x 2, 2x 1 x 2,x1,x 2 2) 2 then Φ(x) Φ(z) = 1 + 2x 1 z 1 + 2x 2 z 2 + 2x 1 x 2 z 1 z 2 + x 2 1 z2 1 + x2 2 z2 2 = (1 + x 1 z 1 + x 2 z 2 ) 2 = (1 + x z) 2 simply use in place of usual inner product in general, for polynomial of degree d, use (1 + x z) d very efficient, even though finding hyperplane in O(n d ) dimensions

Kernels kernel = function K for computing K(x,z) = Φ(x) Φ(z) permits efficient computation of SVM s in very high dimensions K can be any symmetric, positive semi-definite function (Mercer s theorem) some kernels: polynomials Gaussian exp ( x z 2 /2σ ) defined over structures (trees, strings, sequences, etc.) evaluation: w Φ(x) = α i y i Φ(x i ) Φ(x) = α i y i K(x i,x) time depends on # support vectors

SVM s versus Boosting both are large-margin classifiers (although with slightly different definitions of margin) both work in very high dimensional spaces (in boosting, dimensions correspond to weak classifiers) but different tricks are used: SVM s use kernel trick boosting relies on weak learner to select one dimension (i.e., weak classifier) to add to combined classifier

Application: Text Categorization [Joachims] goal: classify text documents e.g.: spam filtering e.g.: categorize news articles by topic need to represent text documents as vectors in R n : one dimension for each word in vocabulary value = # times word occurred in particular document (many variations) kernels don t help much performance state of the art

Application: Recognizing Handwritten Characters [Cortes & Vapnik] examples are 16 16 pixel images, viewed as vectors in R 256 7 7 4 8 0 1 4 kernels help: degree error dimensions 1 12.0 256 2 4.7 33000 3 4.4 10 6 4 4.3 10 9 5 4.3 10 12 6 4.2 10 14 7 4.3 10 16 human 2.5 to choose best degree: train SVM for each degree choose one with minimum VC-dimension (R/δ) 2

SVM s fast algorithms now available, but not so simple to program (but good packages available) state-of-the-art accuracy power and flexibility from kernels theoretical justification many applications

Other Machine Learning Problem Areas supervised learning classification regression predict real-valued labels rare class / cost-sensitive learning unsupervised no labels clustering density estimation semi-supervised in practice, unlabeled examples much cheaper than labeled examples how to take advantage of both labeled and unlabeled examples active learning how to carefully select which unlabeled examples to have labeled on-line learning getting one example at a time

Practicalities

Getting Data more is more want training data to be like test data use your knowledge of problem to know where to get training data, and what to expect test data to be like

Choosing Features use your knowledge to know what features would be helpful for learning redundancy in features is okay, and often helpful most modern algorithms do not require independent features too many features? could use feature selection methods usually preferable to use algorithm designed to handle large feature sets

Choosing an Algorithm first step: identify appropriate learning paradigm classification? regression? labeled, unlabeled or a mix? class proportions heavily skewed? goal to predict probabilities? rank instances? is interpretability of the results important? (keep in mind, no guarantees)

Choosing an Algorithm first step: identify appropriate learning paradigm classification? regression? labeled, unlabeled or a mix? class proportions heavily skewed? goal to predict probabilities? rank instances? is interpretability of the results important? (keep in mind, no guarantees) in general, no learning algorithm dominates all others on all problems SVM s and boosting decision trees (as well as other tree ensemble methods) seem to be best off-the-shelf algorithms even so, for some problems, difference in performance among these can be large, and sometimes, much simpler methods do better

Choosing an Algorithm (cont.) sometimes, one particular algorithm seems to naturally fit problem, but often, best approach is to try many algorithms use knowledge of problem and algorithms to guide decisions e.g., in choice of weak learner, kernel, etc. usually, don t know what will work until you try be sure to try simple stuff! some packages (e.g. weka) make easy to try many algorithms, though implementations are not always optimal

Testing Performance does it work? which algorithm is best? train on part of available data, and test on rest if dataset large (say, in 1000 s), can simply set aside 1000 random examples as test otherwise, use 10-fold cross validation break dataset randomly into 10 parts in turn, use each block as a test set, training on other 9 blocks

Testing Performance does it work? which algorithm is best? train on part of available data, and test on rest if dataset large (say, in 1000 s), can simply set aside 1000 random examples as test otherwise, use 10-fold cross validation break dataset randomly into 10 parts in turn, use each block as a test set, training on other 9 blocks repeat many times use same train/test splits for each algorithm might be natural split (e.g., train on data from 2004-06, test on data from 2007) however, can confound results bad performance because of algorithm, or change of distribution?

Selecting Parameters sometimes, theory can guide setting of parameters, possibly based on statistics measurable on training set other times, need to use trial and test, as before danger: trying too many combinations can lead to overfitting test set break data into train, validation and test sets set parameters using validation set measure performance on test set for selected parameter settings or do cross-validation within cross-validation trying many parameter settings is also very computationally expensive

Running Experiments automate everything! write one script that does everything at the push of a single button fewer errors easy to re-run (for instance, if computer crashes in middle of experiment) have explicit, scientific record in script of exact experiments that were executed

Running Experiments automate everything! write one script that does everything at the push of a single button fewer errors easy to re-run (for instance, if computer crashes in middle of experiment) have explicit, scientific record in script of exact experiments that were executed if running many experiments: put result of each experiment in a separate file use script to scan for next experiment to run based on which files have or have not already been created makes very easy to re-start if computer crashes easy to run many experiments in parallel if have multiple processors/computers also need script to automatically gather and compile results

If Writing Your Own Code R and matlab are great for easy coding, but for speed, may need C or java debugging machine learning algorithms is very tricky! hard to tell if working, since don t know what to expect run on small cases where can figure out answer by hand test each module/subroutine separately compare to other implementations (written by others, or written in different language) compare to theory or published results

Summary central issues in machine learning: avoidance of overfitting balance between simplicity and fit to data machine learning algorithms: decision trees boosting SVM s many not covered looked at practicalities of using machine learning methods (will see more in lab)

Further reading on machine learning in general: Decision trees: Boosting: Support-vector machines: Ethem Alpaydin. Introduction to machine learning. MIT Press, 2004. Christopher M. Bishop. Pattern recognition and machine learning. Springer, 2006. Richard O. Duda, Peter E. Hart and David G. Stork. Pattern Classification (2nd ed.). Wiley, 2000. Trevor Hastie, Robert Tibshirani and Jerome Friedman. The Elements of Statistical Learning : Data Mining, Inference, and Prediction. Springer, 2001. Tom M. Mitchell. Machine Learning. McGraw Hill, 1997. Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998. Leo Breiman, Jerome H. Friedman, Richard A. Olshen and Charles J. Stone. Classification and Regression Trees. Wadsworth & Brooks, 1984. J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Ron Meir and Gunnar Rätsch. An Introduction to Boosting and Leveraging. In Advanced Lectures on Machine Learning (LNAI2600), 2003. www-ee.technion.ac.il/ rmeir/publications/meirae03.pdf Robert E. Schapire. The boosting approach to machine learning: An overview. In Nonlinear Estimation and Classification, Springer, 2003. www.cs.princeton.edu/ schapire/boost.html Christopher J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2):121 167, 1998. research.microsoft.com/ cburges/papers/svmtutorial.pdf Nello Cristianni and John Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. www.support-vector.net