Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Similar documents
CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Lecture Slides for INTRODUCTION TO. ETHEM ALPAYDIN The MIT Press, Lab Class and literature. Friday, , Harburger Schloßstr.

Data Mining - Evaluation of Classifiers

Statistical Machine Learning

Support Vector Machines Explained

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Boosting.

Cross-validation for detecting and preventing overfitting

Supervised Learning (Big Data Analytics)

Data Mining. Nonlinear Classification

LCs for Binary Classification

Data Mining Practical Machine Learning Tools and Techniques

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Support Vector Machines with Clustering for Training with Very Large Datasets

Chapter 6. The stacking ensemble approach

Big Data Analytics CSCI 4030

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Question 2 Naïve Bayes (16 points)

Monotonicity Hints. Abstract

Lecture 2: The SVM classifier

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Machine Learning and Pattern Recognition Logistic Regression

A Simple Introduction to Support Vector Machines

Interactive Machine Learning. Maria-Florina Balcan

Homework Assignment 7

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Introduction to Learning & Decision Trees

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Classification algorithm in Data mining: An Overview

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

4.1 Introduction - Online Learning Model

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Linear Threshold Units

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Lecture 8. Confidence intervals and the central limit theorem

Support Vector Machines for Classification and Regression

CS570 Data Mining Classification: Ensemble Methods

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Professor Anita Wasilewska. Classification Lecture Notes

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Data Mining Techniques Chapter 6: Decision Trees

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Knowledge Discovery and Data Mining

Model Combination. 24 Novembre 2009

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Support Vector Machine (SVM)

Econometrics Simple Linear Regression

Decompose Error Rate into components, some of which can be measured on unlabeled data

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

An Introduction to Neural Networks

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Spam detection with data mining method:

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

171:290 Model Selection Lecture II: The Akaike Information Criterion

Version Spaces.

Large Margin DAGs for Multiclass Classification

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Learning is a very general term denoting the way in which agents:

Machine Learning Algorithms for Classification. Rob Schapire Princeton University

Machine Learning Final Project Spam Filtering

Course 395: Machine Learning

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Knowledge Discovery and Data Mining

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

1 The Line vs Point Test

FilterBoost: Regression and Classification on Large Datasets

Lecture 9: Introduction to Pattern Analysis

Ordinal Regression. Chapter

The Set Covering Machine

BALTIC OLYMPIAD IN INFORMATICS Stockholm, April 18-22, 2009 Page 1 of?? ENG rectangle. Rectangle

Model Validation Techniques

Basics of Statistical Machine Learning

Knowledge Discovery from patents using KMX Text Analytics

Learning Agents: Introduction

Data Mining Classification: Decision Trees

CHAPTER 2 Estimating Probabilities

L13: cross-validation

1 Problem Statement. 2 Overview

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

Principal components analysis

MACHINE LEARNING IN HIGH ENERGY PHYSICS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Machine Learning in Spam Filtering

CSE 473: Artificial Intelligence Autumn 2010

Knowledge Discovery and Data Mining

Transcription:

Lecture Slides for INTRODUCTION TO Machine Learning By: alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Postedited by: R. Basili

Learning a Class from Examples Class C of a family car Prediction: Is car x a family car? Knowledge extraction: What do people expect from a family car? Output: Positive (+) and negative ( ) examples Input representation: x 1 : price, x 2 : engine power 2

Training set X X = x t N { t,r } t = 1 label x = x x 1 2 r = 1 if 0 if x x is positive is negative 3

Class C ( p price p ) AND ( e engine power e ) 1 2 1 2 In general we do not know C(x). 4

Hypothesis class H h( x) = 1 if h classifies x as positive 0 if h classifies x as negative Empirical error: N E ( h X ) = 1 x t = 1 ( ( t ) t h r ) 5

S, G, and the Version Space most specific hypothesis, S most general hypothesis, G h H, between S and G is consistent and make up the version space (Mitchell, 1997) 6

Probably Approximately Correct (PAC) Learning How many training examples are needed so that the tightest rectangle S which will constitute our hypothesis, will be probably approximately correct? We want to be confident (above a level) that The error probability is bounded by some value 7

Probably Approximately Correct (PAC) Learning In PAC learning, given a class C and examples drawn from some unknown but fixed distribution p(x), we want to find the number of examples N, such that with probability at least 1 - δ, h has error at most ε? (Blumer et al., 1989) P( C h ε) δ where C h is the region of difference between C and h. 8

Probably Approximately Correct (PAC) Learning How many training examples m should we have, such that with probability at least 1 - δ, h has error at most ε? (Blumer et al., 1989) Let prob. of a + ex. in each strip be at most ε/4 Pr that random ex. misses a strip: 1- ε/4 Pr that m random instances miss a strip: (1 - ε/4) m Pr that m random instances instances miss 4 strips: 4(1 - ε/4) m We want 1-4(1 - ε/4) m 1- δ or 4(1 - ε/4) m δ Using (1 - x) exp( -x) it follows that 4exp(-εm/4) δ and m (4/ε)ln(4/δ) OR Divide by 4, take ln... and show that m (4/ε)ln(4/δ) 9

Probably Approximately Correct (PAC) Learning How many training examples m should we have, such that with probability at least 1 - δ, h has error at most ε? (Blumer et al., 1989) m (4/ε)ln(4/δ) mincreases slowly with 1/ε and 1/δ Say ε=1% with confidence 95%, pick m >= 1752 Say ε=10% with confidence 95%, pick m >= 175 10

VC (Vapnik-Chervonenkis) Dimension N points can be labeled in 2 N ways as +/ H shatters N if there exists a set of N points such that h H is consistent with all of these possible labels: Denoted as: VC(H ) = N Measures the capacity of H Any learning problem definable by N examples can be learned with no error by a hypothesis drawn from H What is the VC dimension of axis-aligned rectangles? 11

Formal Definition 12

VC (Vapnik-Chervonenkis) Dimension H shatters N if there exists N points and h H such that h is consistent for any labelings of those N points. VC(axis aligned rectangles) = 4 13

VC (Vapnik-Chervonenkis) Dimension What does this say about using rectangles as our hypothesis class? VC dimension is pessimistic: in general we do not need to worry about all possible labelings It is important to remember that one can choose the arrangement of points in the space, but then the hypothesis must be consistent with all possible labelings of those fixed points. 14

Examples x α f y est denotes +1 f(x,w) = sign(x.w) denotes -1 15

Examples x α f y est denotes +1 f(x,w,b) = sign(x.w+b) denotes -1 16

Shattering Question: Can the following f shatter the following points? f(x,w) = sign(x.w) Answer: Yes. There are four possible training set types to consider: w=(0,1) w=(-2,3) w=(2,-3) w=(0,-1) 17

VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w.x-b), what is the VC-dimension? h=m+1 Lines in 2D can shatter 3 points Planes in 3D space can shatter 4 points... 18

Examples x α f y est denotes +1 f(x,b) = sign(x.x b) denotes -1 19

Shattering Question: Can the following f shatter the following points? f(x,b) = sign(x.x-b) 20

Shattering Question: Can the following f shatter the following points? f(x,b) = sign(x.x-b) Answer: Yes. Hence, the VC dimension of circles on the origin is at least 2. 21

Note that if we pick two points at the same distance to the origin, they cannot be shattered. But we are interested if all possible labellings of some n-points can be shattered. How about 3 for circles on the origin (Can you find 3 points such that all possible labellings can be shattered?)? 22

Model Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique solution The need for inductive bias, assumptions about H Generalization: How well a model performs on new data Different machines have different amounts of power. Tradeoff between: More power: Can model more complex classifiers but might overfit. Less power: Not going to overfit, but restricted in what it can model. Overfitting: H more complex than C or f Underfitting: H less complex than C or f 23

Triple Trade-Off There is a trade-off between three factors (Dietterich, 2003): 1. Complexity of H, c (H), 2. Training set size, N, 3. Generalization error, E, on new data As N, E As c (H), first E and then E 24

25

Complexity Complexity is a measure of a set of classifiers, not any specific (fixed) classifier Many possible measures degrees of freedom description length Vapnik-Chervonenkis (VC) dimension etc. 26

Expected and Empirical error 27

28

29

30

32

A learning machine A learning machine f takes an input x and transforms it, somehow using weights α, into a predicted output y est = +/- 1 α is some vector of adjustable parameters α x f y est 33

Some definitions Given some machine f Define: 1 R( α) = TESTERR( α) = E y f ( x, α) = 2 Probability of Misclassification R emp ( α) R 1 = TRAINERR( α) = R k = 1 1 2 y k f ( x k, α) = Fraction Training Set misclassified R = #training set data points 34

Vapnik-Chervonenkis dimension TESTERR( α) 1 = E y f ( x, α) 2 TRAINERR( α) = R 1 1 yk f ( x k, α) R 1 2 k = Given some machine f, let h be its VC dimension (h does not depend on the choice of training set) Let R be the number of training examples Vapnik showed that with probability 1-η TESTERR( α) TRAINERR( α) + h (log(2r / h) + 1) log( η / 4) R This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f 35

VC-dimension as measure of complexity TESTERR( α) TRAINERR( α) + h (log(2r / h) + 1) log( η / 4) R i f i TRAINERR VC-Conf Probable upper bound on TESTERR Choice 1 f 1 2 f 2 3 f 3 4 f 4 5 f 5 6 f 6 36

Using VC-dimensionality People have worked hard to find VC-dimension for.. Decision Trees Perceptrons Neural Nets Decision Lists Support Vector Machines And many many more All with the goals of 1. Understanding which learning machines are more or less powerful under which circumstances 2. Using Structural Risk Minimization for to choose the best learning machine 37

Alternatives to VC-dim-based model selection Cross Validation To estimate generalization error, we need data unseen during training. We split the data as Training set (50%) Validation set (25%) Test (publication) set (25%) Resampling when there is few data 38

Alternatives to VC-dim-based model selection What could we do instead of the scheme below? 1. Cross-validation i f i TRAINERR 10-FOLD-CV-ERR Choice 1 f 1 2 f 2 3 f 3 4 f 4 5 f 5 6 f 6 39

Extra Comments An excellent tutorial on VC-dimension and Support Vector Machines C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html 40

What you should know The definition of a learning machine: f(x,α) The definition of Shattering Be able to work through simple examples of shattering The definition of VC-dimension Be able to work through simple examples of VCdimension Structural Risk Minimization for model selection Awareness of other model selection methods 41