An introduction to statistical learning theory. Fabrice Rossi TELECOM ParisTech November 2008

Similar documents
Statistical Machine Learning

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

171:290 Model Selection Lecture II: The Akaike Information Criterion

Data Mining - Evaluation of Classifiers

Linear Threshold Units

Basics of Statistical Machine Learning

Question 2 Naïve Bayes (16 points)

Adaptive Online Gradient Descent

Un point de vue bayésien pour des algorithmes de bandit plus performants

CSCI567 Machine Learning (Fall 2014)

Lecture 3: Linear methods for classification

STA 4273H: Statistical Machine Learning

Online Learning, Stability, and Stochastic Gradient Descent

Introduction to Support Vector Machines. Colin Campbell, Bristol University

1 if 1 x 0 1 if 0 x 1

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

HT2015: SC4 Statistical Data Mining and Machine Learning

Machine Learning Final Project Spam Filtering

Data Mining Practical Machine Learning Tools and Techniques

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

Big Data Analytics CSCI 4030

Machine Learning in Spam Filtering

Notes from Week 1: Algorithms for sequential prediction

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

The CUSUM algorithm a small review. Pierre Granjon

Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms. Stéphan Clémençon

Introduction to Online Learning Theory

Christfried Webers. Canberra February June 2015

Introduction to Learning & Decision Trees

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Big Data & Scripting Part II Streaming Algorithms

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Support Vector Machines Explained

Several Views of Support Vector Machines

Non Parametric Inference

Simple and efficient online algorithms for real world applications

Universal Algorithm for Trading in Stock Market Based on the Method of Calibration

Large-Scale Similarity and Distance Metric Learning

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Supervised Learning (Big Data Analytics)

Classification algorithm in Data mining: An Overview

Data Mining: An Overview. David Madigan

Course: Model, Learning, and Inference: Lecture 5

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

Date: April 12, Contents

The Optimality of Naive Bayes

Social Media Mining. Data Mining Essentials

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Trading regret rate for computational efficiency in online learning with limited feedback

E3: PROBABILITY AND STATISTICS lecture notes

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Compact Representations and Approximations for Compuation in Games

Lecture 6 Online and streaming algorithms for clustering

Tail inequalities for order statistics of log-concave vectors and applications

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Model Combination. 24 Novembre 2009

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

(Quasi-)Newton methods

Language Modeling. Chapter Introduction

Local classification and local likelihoods

Neural Networks Lesson 5 - Cluster Analysis

Fuzzy Probability Distributions in Bayesian Analysis

A Simple Introduction to Support Vector Machines

Chapter 6: Episode discovery process

Message-passing sequential detection of multiple change points in networks

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Week 1: Introduction to Online Learning

1. Prove that the empty set is a subset of every set.

Virtual Landmarks for the Internet

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Random graphs with a given degree sequence

Lecture 10: Regression Trees

IN most approaches to binary classification, classifiers are

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Polarization codes and the rate of polarization

Penalized regression: Introduction

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

The Ergodic Theorem and randomness

Statistical Machine Translation: IBM Models 1 and 2

1 Another method of estimation: least squares

Support Vector Machine (SVM)

Bayes and Naïve Bayes. cs534-machine Learning

Hypothesis Testing for Beginners

ALMOST COMMON PRIORS 1. INTRODUCTION

No: Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Comparison of machine learning methods for intelligent tutoring systems

How To Cluster

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Approximation Algorithms

Transcription:

An introduction to statistical learning theory Fabrice Rossi TELECOM ParisTech November 2008

About this lecture Main goal Show how statistics enable a rigorous analysis of machine learning methods which leads to a better understanding and to possible improvements of those methods 2 / 154 Fabrice Rossi Outline

About this lecture Main goal Show how statistics enable a rigorous analysis of machine learning methods which leads to a better understanding and to possible improvements of those methods Outline 1. introduction to the formal model of statistical learning 2. empirical risk and capacity measures 3. capacity control 4. beyond empirical risk minimization 2 / 154 Fabrice Rossi Outline

Part I Introduction and formalization 3 / 154 Fabrice Rossi

Outline Machine learning in a nutshell Formalization Data and algorithms Consistency Risk/Loss optimization 4 / 154 Fabrice Rossi

Statistical Learning Statistical Learning 5 / 154 Fabrice Rossi Machine learning in a nutshell

Statistical Learning Statistical Learning = Machine learning + Statistics 5 / 154 Fabrice Rossi Machine learning in a nutshell

Statistical Learning Statistical Learning = Machine learning + Statistics Machine learning in three steps: 1. record observations about a phenomenon 2. build a model of this phenomenon 3. predict the future behavior of the phenomenon 5 / 154 Fabrice Rossi Machine learning in a nutshell

Statistical Learning Statistical Learning = Machine learning + Statistics Machine learning in three steps: 1. record observations about a phenomenon 2. build a model of this phenomenon 3. predict the future behavior of the phenomenon... all of this is done by the computer (no human intervention) 5 / 154 Fabrice Rossi Machine learning in a nutshell

Statistical Learning Statistical Learning = Machine learning + Statistics Machine learning in three steps: 1. record observations about a phenomenon 2. build a model of this phenomenon 3. predict the future behavior of the phenomenon... all of this is done by the computer (no human intervention) Statistics gives: a formal definition of machine learning some guarantees on its expected results some suggestions for new or improved modelling tools 5 / 154 Fabrice Rossi Machine learning in a nutshell

Statistical Learning Statistical Learning = Machine learning + Statistics Machine learning in three steps: 1. record observations about a phenomenon 2. build a model of this phenomenon 3. predict the future behavior of the phenomenon... all of this is done by the computer (no human intervention) Statistics gives: a formal definition of machine learning some guarantees on its expected results some suggestions for new or improved modelling tools Nothing is more practical than a good theory Vladimir Vapnik (1998) 5 / 154 Fabrice Rossi Machine learning in a nutshell

Machine learning a phenomenon is recorded via observations (z i ) 1 i n with z i Z 6 / 154 Fabrice Rossi Machine learning in a nutshell

Machine learning a phenomenon is recorded via observations (z i ) 1 i n with z i Z two generic situations: 1. unsupervised learning: no predefined structure in Z then the goal is to find some structures: clusters, association rules, distribution, etc. 6 / 154 Fabrice Rossi Machine learning in a nutshell

Machine learning a phenomenon is recorded via observations (z i ) 1 i n with z i Z two generic situations: 1. unsupervised learning: no predefined structure in Z then the goal is to find some structures: clusters, association rules, distribution, etc. 2. supervised learning: two non symmetric component in Z = X Y z = (x, y) X Y modelling: finding how x and y are related the goal is to make prediction: given x, find a reasonable value for y such that z = (x, y) is compatible with the phenomenon 6 / 154 Fabrice Rossi Machine learning in a nutshell

Machine learning a phenomenon is recorded via observations (z i ) 1 i n with z i Z two generic situations: 1. unsupervised learning: no predefined structure in Z then the goal is to find some structures: clusters, association rules, distribution, etc. 2. supervised learning: two non symmetric component in Z = X Y z = (x, y) X Y modelling: finding how x and y are related the goal is to make prediction: given x, find a reasonable value for y such that z = (x, y) is compatible with the phenomenon this course focuses on supervised learning 6 / 154 Fabrice Rossi Machine learning in a nutshell

Supervised learning several types of Y can be considered: Y = {1,..., q}: classification in q classes, e.g., tell whether a patient has hyper, hypo or normal thyroidal function based on some clinical measures Y = R q : regression, e.g., calculate the price of a house y based on some characteristics x Y = something complex but structured, e.g., construct a parse tree for a natural language sentence 7 / 154 Fabrice Rossi Machine learning in a nutshell

Supervised learning several types of Y can be considered: Y = {1,..., q}: classification in q classes, e.g., tell whether a patient has hyper, hypo or normal thyroidal function based on some clinical measures Y = R q : regression, e.g., calculate the price of a house y based on some characteristics x Y = something complex but structured, e.g., construct a parse tree for a natural language sentence modelling difficulty increases with the complexity of Y 7 / 154 Fabrice Rossi Machine learning in a nutshell

Supervised learning several types of Y can be considered: Y = {1,..., q}: classification in q classes, e.g., tell whether a patient has hyper, hypo or normal thyroidal function based on some clinical measures Y = R q : regression, e.g., calculate the price of a house y based on some characteristics x Y = something complex but structured, e.g., construct a parse tree for a natural language sentence modelling difficulty increases with the complexity of Y Vapnik s message: don t built a regression model to do classification 7 / 154 Fabrice Rossi Machine learning in a nutshell

Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that 1 i n, g(x i ) y i 8 / 154 Fabrice Rossi Machine learning in a nutshell

Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that or not? 1 i n, g(x i ) y i 8 / 154 Fabrice Rossi Machine learning in a nutshell

Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that 1 i n, g(x i ) y i or not? perfect interpolation is always possible but is that what we are looking for? 8 / 154 Fabrice Rossi Machine learning in a nutshell

Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that or not? 1 i n, g(x i ) y i perfect interpolation is always possible but is that what we are looking for? No! We want generalization 8 / 154 Fabrice Rossi Machine learning in a nutshell

Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that or not? 1 i n, g(x i ) y i perfect interpolation is always possible but is that what we are looking for? No! We want generalization: a good model must learn something smart from the data, not simply memorize them on new data, it should be such that g(xi ) y i (for i > n) 8 / 154 Fabrice Rossi Machine learning in a nutshell

Machine learning objectives main goal: from a dataset (x i, y i ) 1 i n, build a model g which generalizes in a satisfactory way on new data but also: request only as many data as needed do this efficiently (quickly with a low memory usage) gain knowledge on the underlying phenomenon (e.g., discard useless parts of x) etc. 9 / 154 Fabrice Rossi Machine learning in a nutshell

Machine learning objectives main goal: from a dataset (x i, y i ) 1 i n, build a model g which generalizes in a satisfactory way on new data but also: request only as many data as needed do this efficiently (quickly with a low memory usage) gain knowledge on the underlying phenomenon (e.g., discard useless parts of x) etc. the statistical learning framework addresses some of those goals: it provides asymptotic guarantees about learnability it gives non asymptotic confidence bounds around performance estimates it suggests/validates best practices (hold out estimates, regularization, etc.) 9 / 154 Fabrice Rossi Machine learning in a nutshell

Outline Machine learning in a nutshell Formalization Data and algorithms Consistency Risk/Loss optimization 10 / 154 Fabrice Rossi Formalization

Data the phenomenon is fully described by an unknown probability distribution P on X Y we are given n observations: Dn = (X i, Y i ) n i=1 each pair (Xi, Y i ) is distributed according to P pairs are independent 11 / 154 Fabrice Rossi Formalization

Data the phenomenon is fully described by an unknown probability distribution P on X Y we are given n observations: Dn = (X i, Y i ) n i=1 each pair (Xi, Y i ) is distributed according to P pairs are independent remarks: the phenomenon is stationary, i.e., P does not change: new data will be distributed as learning data extensions to non stationary situations exist, see e.g. covariate shift and concept drift the independence assumption says that each observation brings new information extensions to dependent observation exists, e.g., Markovian models for time series analysis 11 / 154 Fabrice Rossi Formalization

Model quality (risk) quality is measured via a cost function c: Y Y R + (a low cost is good!) the risk of a model g: X Y is L(g) = E {c(g(x), Y )} This is the expected value of the cost on an observation generated by the phenomenon (a low risk is good!) 12 / 154 Fabrice Rossi Formalization

Model quality (risk) quality is measured via a cost function c: Y Y R + (a low cost is good!) the risk of a model g: X Y is L(g) = E {c(g(x), Y )} This is the expected value of the cost on an observation generated by the phenomenon (a low risk is good!) interpretation: the average value of c(g(x), y) over a lot of observations generated by the phenomenon is L(g) this is a measure of generalization capabilities (if g has been built from a dataset) 12 / 154 Fabrice Rossi Formalization

Model quality (risk) quality is measured via a cost function c: Y Y R + (a low cost is good!) the risk of a model g: X Y is L(g) = E {c(g(x), Y )} This is the expected value of the cost on an observation generated by the phenomenon (a low risk is good!) interpretation: the average value of c(g(x), y) over a lot of observations generated by the phenomenon is L(g) this is a measure of generalization capabilities (if g has been built from a dataset) remark: we need a probability space, c and g have to be measurable functions, etc. 12 / 154 Fabrice Rossi Formalization

We are frequentists 13 / 154 Fabrice Rossi Formalization

Cost functions Regression (Y = R q ): the quadratic cost is the standard one: c(g(x), y) = g(x) y 2 other costs (Y = R): L 1 (Laplacian) cost: c(g(x), y) = g(x) y ɛ-insensitive: c ɛ(g(x), y) = max(0, g(x) y ɛ) Huber s loss: c δ (g(x), y) = (g(x) y) 2 when g(x) y < δ and c δ (g(x), y) = 2δ( g(x) y δ ) otherwise 2 14 / 154 Fabrice Rossi Formalization

Cost functions Regression (Y = R q ): the quadratic cost is the standard one: c(g(x), y) = g(x) y 2 other costs (Y = R): L 1 (Laplacian) cost: c(g(x), y) = g(x) y ɛ-insensitive: c ɛ(g(x), y) = max(0, g(x) y ɛ) Huber s loss: c δ (g(x), y) = (g(x) y) 2 when g(x) y < δ and c δ (g(x), y) = 2δ( g(x) y δ ) otherwise 2 Classification (Y = {1,..., q}): 0/1 cost: c(g(x), y) = I{g(x) y} then the risk is the probability of misclassification 14 / 154 Fabrice Rossi Formalization

Cost functions Regression (Y = R q ): the quadratic cost is the standard one: c(g(x), y) = g(x) y 2 other costs (Y = R): L 1 (Laplacian) cost: c(g(x), y) = g(x) y ɛ-insensitive: c ɛ(g(x), y) = max(0, g(x) y ɛ) Huber s loss: c δ (g(x), y) = (g(x) y) 2 when g(x) y < δ and c δ (g(x), y) = 2δ( g(x) y δ ) otherwise 2 Classification (Y = {1,..., q}): 0/1 cost: c(g(x), y) = I{g(x) y} then the risk is the probability of misclassification Remark: the cost is used to measure the quality of the model the machine learning algorithm may use a loss cost to build the model e.g.: the hinge loss for support vector machines more on this in part IV 14 / 154 Fabrice Rossi Formalization

Machine learning algorithm given D n, a machine learning algorithm/method builds a model g Dn (= g n for simplicity) g n is a random variable with values in the set of measurable functions from X to Y the associated (random) risk is L(g n ) = E {c(g n (X), Y ) D n } statistical learning studies the behavior of L(g n ) and E {L(g n )} when n increases 15 / 154 Fabrice Rossi Formalization

Machine learning algorithm given D n, a machine learning algorithm/method builds a model g Dn (= g n for simplicity) g n is a random variable with values in the set of measurable functions from X to Y the associated (random) risk is L(g n ) = E {c(g n (X), Y ) D n } statistical learning studies the behavior of L(g n ) and E {L(g n )} when n increases: keep in mind that L(gn ) is a random variable if many datasets are generated from the phenomenon, the mean risk of the classifiers produced by the algorithm for those datasets is E {L(g n )} 15 / 154 Fabrice Rossi Formalization

Consistency best risk: L = inf g L(g) the infimum is taken over all measurable functions from X to Y 16 / 154 Fabrice Rossi Formalization

Consistency best risk: L = inf g L(g) the infimum is taken over all measurable functions from X to Y a machine learning method is: strongly consistent: if L(gn ) p.s. L consistent: if E {L(gn )} L universally (strongly) consistent: if the convergence holds for any distribution of the data P remark: if c is bounded then lim n E {L(g n )} = L L(g n ) P L 16 / 154 Fabrice Rossi Formalization

Interpretation universality: no assumption on the data distribution (realistic) consistency: perfect generalization when given an infinite learning set 17 / 154 Fabrice Rossi Formalization

Interpretation universality: no assumption on the data distribution (realistic) consistency: perfect generalization when given an infinite learning set strong case: for (almost) any series of observations (x i, y i ) i 1 and for any ɛ > 0 there is N such that for n N L(g n) L + ɛ where g n is built on (x i, y i ) 1 i n N depends on ɛ and on the dataset (x i, y i ) i 1 17 / 154 Fabrice Rossi Formalization

Interpretation universality: no assumption on the data distribution (realistic) consistency: perfect generalization when given an infinite learning set strong case: for (almost) any series of observations (x i, y i ) i 1 and for any ɛ > 0 there is N such that for n N L(g n) L + ɛ where g n is built on (x i, y i ) 1 i n N depends on ɛ and on the dataset (x i, y i ) i 1 weak case: for any ɛ > 0 there is N such that for n N E {L(g n)} L + ɛ where g n is built on (x i, y i ) 1 i n 17 / 154 Fabrice Rossi Formalization

Interpretation (continued) We can be unlucky: strong case: L(g n) = E {c(g n(x), Y ) D n} L + ɛ but what about n+p 1 X c(g n(x i ), y i ) p i=n+1 weak case: E {L(g n)} L + ɛ but what about L(g n) for a particular dataset? see also the strong case! 18 / 154 Fabrice Rossi Formalization

Interpretation (continued) We can be unlucky: strong case: L(g n) = E {c(g n(x), Y ) D n} L + ɛ but what about n+p 1 X c(g n(x i ), y i ) p i=n+1 weak case: E {L(g n)} L + ɛ but what about L(g n) for a particular dataset? see also the strong case! Stronger result, Probably Approximately Optimal algorithm: ɛ, δ, n(ɛ, δ) such that n n(ɛ, δ) P {L (g n ) > L + ɛ} < δ Gives some convergence speed: especially interesting when n(ɛ, δ) is independent of P (distribution free) 18 / 154 Fabrice Rossi Formalization

From PAO to consistency if convergence happens fast enough, then the method is (strongly) consistent 19 / 154 Fabrice Rossi Formalization

From PAO to consistency if convergence happens fast enough, then the method is (strongly) consistent weak consistency is easy (for c bounded): if the method is PAO, then for any fixed ɛ P {L (g n ) > L + ɛ} n 0 by definition (and also because L(gn ) > L ), this means L(g n ) P L then, if c is bounded E {L(gn )} n L 19 / 154 Fabrice Rossi Formalization

From PAO to consistency if convergence happens fast enough, then the method is (strongly) consistent weak consistency is easy (for c bounded): if the method is PAO, then for any fixed ɛ P {L (g n ) > L + ɛ} n 0 by definition (and also because L(gn ) > L ), this means L(g n ) P L then, if c is bounded E {L(gn )} n L if convergence happens faster, then the method might be strongly consistent we need a very sharp decrease of P {L (g n ) > L + ɛ} with n, i.e., a slow increase of n(ɛ, δ) when δ decreases 19 / 154 Fabrice Rossi Formalization

From PAO to strong consistency this is based on the Borel-Cantelli Lemma let (Ai ) i 1 be a series of events define [Ai i.o.] = i=1 j=i A i (infinitely often) the lemma states that if i P {A i } < then P {[A i i.o.]} = 0 20 / 154 Fabrice Rossi Formalization

From PAO to strong consistency this is based on the Borel-Cantelli Lemma let (Ai ) i 1 be a series of events define [Ai i.o.] = i=1 j=i A i (infinitely often) the lemma states that if i P {A i } < then P {[A i i.o.]} = 0 if n P {L (g n) > L + ɛ} <, then P {[{L (g n ) > L + ɛ} i.o.]} = 0 20 / 154 Fabrice Rossi Formalization

From PAO to strong consistency this is based on the Borel-Cantelli Lemma let (Ai ) i 1 be a series of events define [Ai i.o.] = i=1 j=i A i (infinitely often) the lemma states that if i P {A i } < then P {[A i i.o.]} = 0 if n P {L (g n) > L + ɛ} <, then P {[{L (g n ) > L + ɛ} i.o.]} = 0 we have {L(g n ) L } = j=1 i=1 n=i {L (g n) L + 1 j } 20 / 154 Fabrice Rossi Formalization

From PAO to strong consistency this is based on the Borel-Cantelli Lemma let (Ai ) i 1 be a series of events define [Ai i.o.] = i=1 j=i A i (infinitely often) the lemma states that if i P {A i } < then P {[A i i.o.]} = 0 if n P {L (g n) > L + ɛ} <, then P {[{L (g n ) > L + ɛ} i.o.]} = 0 we have {L(g n ) L } = j=1 i=1 n=i {L (g n) L + 1 j } and therefore, L(g n ) p.s. L 20 / 154 Fabrice Rossi Formalization

Summary from n i.i.d. observations D n = (X i, Y i ) n i=1, a ML algorithm builds a model g n its risks is L(g n ) = E {c(g n (X), Y ) D n } statistical learning studies L(g n ) L and E {L(g n )} L the preferred approach is to show a fast decrease of P {L (g n ) > L + ɛ} with n no hypothesis on the data (i.e., on P), a.k.a. distribution free results 21 / 154 Fabrice Rossi Formalization

Classical statistics Even though we are fequentists... this program is quite different from classical statistics: no optimal parameters, no identifiability: intrinsically non parametric no optimal model: only optimal performances no asymptotic distribution: focus on finite distance inequalities no (or minimal) assumption on the data distribution minimal hypothesis are justified by our lack of knowledge on the studied phenomenon have strong and annoying consequences linked to the worst case scenario they imply 22 / 154 Fabrice Rossi Formalization

Optimal model there is nevertheless an optimal model regression: for the quadratic cost c(g(x), y) = g(x) y 2 the minimum risk L is reached by g(x) = E {Y X = x} 23 / 154 Fabrice Rossi Formalization

Optimal model there is nevertheless an optimal model regression: for the quadratic cost c(g(x), y) = g(x) y 2 the minimum risk L is reached by g(x) = E {Y X = x} classification: we need the posterior probabilities: P {Y = k X = x} the minimum risk L (the Bayes risk) is reached by the Bayes classifier: assign x to the most probable class arg max k P {Y = k X = x} 23 / 154 Fabrice Rossi Formalization

Optimal model there is nevertheless an optimal model regression: for the quadratic cost c(g(x), y) = g(x) y 2 the minimum risk L is reached by g(x) = E {Y X = x} classification: we need the posterior probabilities: P {Y = k X = x} the minimum risk L (the Bayes risk) is reached by the Bayes classifier: assign x to the most probable class arg max k P {Y = k X = x} we only need to mimic the prediction of the model, not the model itself: for optimal classification, we don t need to compute the posterior probabilities for the binary case, the decision is based on P {Y = 1 X = x} P {Y = 1 X = x} we only need to agree with the sign of E {Y X = x} 23 / 154 Fabrice Rossi Formalization

No free lunch unfortunately, making no hypothesis on P costs a lot 24 / 154 Fabrice Rossi Formalization

No free lunch unfortunately, making no hypothesis on P costs a lot no free lunch results for binary classification and misclassification cost 24 / 154 Fabrice Rossi Formalization

No free lunch unfortunately, making no hypothesis on P costs a lot no free lunch results for binary classification and misclassification cost there is always a bad distribution: given a fixed machine learning method for all ɛ > 0 and all n, there is (X, Y ) such that L = 0 and E {L(g n )} 1 2 ɛ 24 / 154 Fabrice Rossi Formalization

No free lunch unfortunately, making no hypothesis on P costs a lot no free lunch results for binary classification and misclassification cost there is always a bad distribution: given a fixed machine learning method for all ɛ > 0 and all n, there is (X, Y ) such that L = 0 and E {L(g n )} 1 2 ɛ arbitrary slow convergence always happens: given a fixed machine learning method and a decreasing series (an ) with limit 0 (and a 1 1/16) there is (X, Y ) such that L = 0 and E {L(g n )} a n 24 / 154 Fabrice Rossi Formalization

Estimation difficulties in fact, estimating the Bayes risk L (in binary classification) is difficult: given a fixed estimation algorithm for L, denoted ˆL n for all ɛ > 0 and all n, there is (X, Y ) such that { } E ˆL n L 1 4 ɛ 25 / 154 Fabrice Rossi Formalization

Estimation difficulties in fact, estimating the Bayes risk L (in binary classification) is difficult: given a fixed estimation algorithm for L, denoted ˆL n for all ɛ > 0 and all n, there is (X, Y ) such that { } E ˆL n L 1 4 ɛ regression is more difficult than classification: ηn estimator for η = E {Y X} in a weak sense: lim E { (η n (X) η(x)) 2} = 0 n then for g n (x) = I {ηn(x)>1/2}, we have lim n E {L(g n )} L E {(ηn (X) η(x)) 2 } = 0 25 / 154 Fabrice Rossi Formalization

Consequences no free lunch results show that we cannot get uniform convergence speed: they don t rule out universal (strong) consistency! they don t rule out PAO with hypothesis on P 26 / 154 Fabrice Rossi Formalization

Consequences no free lunch results show that we cannot get uniform convergence speed: they don t rule out universal (strong) consistency! they don t rule out PAO with hypothesis on P intuitive explanation: stationarity and independence are not sufficient if P is arbitrarily complex, then no generalization can occur from a finite learning set e.g., Y = f (X) + ε: interpolation needs regularity assumptions on f, extrapolation needs even stronger hypothesis 26 / 154 Fabrice Rossi Formalization

Consequences no free lunch results show that we cannot get uniform convergence speed: they don t rule out universal (strong) consistency! they don t rule out PAO with hypothesis on P intuitive explanation: stationarity and independence are not sufficient if P is arbitrarily complex, then no generalization can occur from a finite learning set e.g., Y = f (X) + ε: interpolation needs regularity assumptions on f, extrapolation needs even stronger hypothesis but we don t want hypothesis on P... 26 / 154 Fabrice Rossi Formalization

Estimation and approximation solution: split the problem in two parts, estimation and approximation 27 / 154 Fabrice Rossi Formalization

Estimation and approximation solution: split the problem in two parts, estimation and approximation estimation: restrict the class of acceptable models to G, a set of measurable functions from X to Y study L(g n ) inf g G L(g) when the ML algorithm must choose g n in G statistics needed! hypotheses on G replace hypotheses on P and allow uniform results! 27 / 154 Fabrice Rossi Formalization

Estimation and approximation solution: split the problem in two parts, estimation and approximation estimation: restrict the class of acceptable models to G, a set of measurable functions from X to Y study L(g n ) inf g G L(g) when the ML algorithm must choose g n in G statistics needed! hypotheses on G replace hypotheses on P and allow uniform results! approximation: compare G to the set of all possible models in other words, study infg G L(g) L function approximation results (density), no statistics! 27 / 154 Fabrice Rossi Formalization

Back to machine learning algorithms many ad hoc algorithms: k nearest neighbors tree based algorithms Adaboost 28 / 154 Fabrice Rossi Formalization

Back to machine learning algorithms many ad hoc algorithms: k nearest neighbors tree based algorithms Adaboost but most algorithms are based on the following scheme: fix a model class G choose gn in G by optimizing a quality measure defined via D n 28 / 154 Fabrice Rossi Formalization

Back to machine learning algorithms many ad hoc algorithms: k nearest neighbors tree based algorithms Adaboost but most algorithms are based on the following scheme: fix a model class G choose gn in G by optimizing a quality measure defined via D n examples: linear regression (X is a Hilbert space, Y = R): G = {x w, x } find w n by minimizing P the mean squared error 1 w n = arg min n w n i=1 (y i w, x i ) 2 linear classification (X = R p, Y = {1,..., c}): G = {x arg max k (W x) k } find W e.g., by minimizing a mean squared error for a disjunctive coding of the classes 28 / 154 Fabrice Rossi Formalization

Criterion the best solution would be to choose g n by minimizing L(g n ) over G, but: we cannot compute L(gn ) (P is unknown!) even if we knew L(g n ), optimization might be intractable (L has no reason to be convex, for instance) 29 / 154 Fabrice Rossi Formalization

Criterion the best solution would be to choose g n by minimizing L(g n ) over G, but: we cannot compute L(gn ) (P is unknown!) even if we knew L(g n ), optimization might be intractable (L has no reason to be convex, for instance) partial solution, the empirical risk: L n (g) = 1 n n c(g(x i ), Y i ) i=1 29 / 154 Fabrice Rossi Formalization

Criterion the best solution would be to choose g n by minimizing L(g n ) over G, but: we cannot compute L(gn ) (P is unknown!) even if we knew L(g n ), optimization might be intractable (L has no reason to be convex, for instance) partial solution, the empirical risk: L n (g) = 1 n n c(g(x i ), Y i ) i=1 naive justification by the strong law of large numbers (SLLN): when g is fixed, lim n L n (g) = L(g) 29 / 154 Fabrice Rossi Formalization

Criterion the best solution would be to choose g n by minimizing L(g n ) over G, but: we cannot compute L(gn ) (P is unknown!) even if we knew L(g n ), optimization might be intractable (L has no reason to be convex, for instance) partial solution, the empirical risk: L n (g) = 1 n n c(g(x i ), Y i ) i=1 naive justification by the strong law of large numbers (SLLN): when g is fixed, lim n L n (g) = L(g) the algorithmic issue is handled by replacing the empirical risk by an empirical loss (see part IV) 29 / 154 Fabrice Rossi Formalization

Empirical risk minimization generic machine learning method: fix a model class G choose g n in G by g n = arg min g G L n(g) denote L G = inf g G L(g) 30 / 154 Fabrice Rossi Formalization

Empirical risk minimization generic machine learning method: fix a model class G choose g n in G by g n = arg min g G L n(g) denote L G = inf g G L(g) does that work? E {L(g n )} L G? E {L(g n )} L? the SLLN does not help as g n depends on n 30 / 154 Fabrice Rossi Formalization

Empirical risk minimization generic machine learning method: fix a model class G choose g n in G by g n = arg min g G L n(g) denote L G = inf g G L(g) does that work? E {L(g n )} L G? E {L(g n )} L? the SLLN does not help as g n depends on n we are looking for bounds: L(gn) < L n (gn) + B (bound the risk using the empirical risk) L(gn) < L G + C (guarantee that the risk is close to the best possible one in the class) 30 / 154 Fabrice Rossi Formalization

Empirical risk minimization generic machine learning method: fix a model class G choose g n in G by g n = arg min g G L n(g) denote L G = inf g G L(g) does that work? E {L(g n )} L G? E {L(g n )} L? the SLLN does not help as g n depends on n we are looking for bounds: L(gn) < L n (gn) + B (bound the risk using the empirical risk) L(gn) < L G + C (guarantee that the risk is close to the best possible one in the class) L G L is handled differently (approximation vs estimation) 30 / 154 Fabrice Rossi Formalization

Complexity control The overfitting issue controlling L n (g n) L(g n) is not possible in arbitrary classes G 31 / 154 Fabrice Rossi Formalization

Complexity control The overfitting issue controlling L n (gn) L(gn) is not possible in arbitrary classes G a simple example: binary classification G: all piecewise functions on arbitrary partitions of X obviously L n (gn) = 0 if P X has a density (via the classifier defined by the 1-nn rule) but L(g n ) > 0 when L > 0! the 1-nn rule is overfitting 31 / 154 Fabrice Rossi Formalization

Complexity control The overfitting issue controlling L n (gn) L(gn) is not possible in arbitrary classes G a simple example: binary classification G: all piecewise functions on arbitrary partitions of X obviously L n (gn) = 0 if P X has a density (via the classifier defined by the 1-nn rule) but L(g n ) > 0 when L > 0! the 1-nn rule is overfitting the only solution is to reduce the size of G (see part II) this implies L G > L (for arbitrary P) 31 / 154 Fabrice Rossi Formalization

Complexity control The overfitting issue controlling L n (gn) L(gn) is not possible in arbitrary classes G a simple example: binary classification G: all piecewise functions on arbitrary partitions of X obviously L n (gn) = 0 if P X has a density (via the classifier defined by the 1-nn rule) but L(g n ) > 0 when L > 0! the 1-nn rule is overfitting the only solution is to reduce the size of G (see part II) this implies L G > L (for arbitrary P) how to reach L? (will be studied in part III) 31 / 154 Fabrice Rossi Formalization

Complexity control The overfitting issue controlling L n (gn) L(gn) is not possible in arbitrary classes G a simple example: binary classification G: all piecewise functions on arbitrary partitions of X obviously L n (gn) = 0 if P X has a density (via the classifier defined by the 1-nn rule) but L(g n ) > 0 when L > 0! the 1-nn rule is overfitting the only solution is to reduce the size of G (see part II) this implies L G > L (for arbitrary P) how to reach L? (will be studied in part III) in other words, how to balance estimation error and approximation error 31 / 154 Fabrice Rossi Formalization

Increasing complexity key idea: choose g n in G n, i.e., in a class that depends on the datasize 32 / 154 Fabrice Rossi Formalization

Increasing complexity key idea: choose g n in G n, i.e., in a class that depends on the datasize estimation part: statistics make sure that Ln (gn) L(gn) is under control when gn = arg min g Gn L n (g n ) 32 / 154 Fabrice Rossi Formalization

Increasing complexity key idea: choose g n in G n, i.e., in a class that depends on the datasize estimation part: statistics make sure that Ln (gn) L(gn) is under control when gn = arg min g Gn L n (g n ) approximation part: functional analysis make sure that limn L G n = L related to density arguments (universal approximation) 32 / 154 Fabrice Rossi Formalization

Increasing complexity key idea: choose g n in G n, i.e., in a class that depends on the datasize estimation part: statistics make sure that Ln (gn) L(gn) is under control when gn = arg min g Gn L n (g n ) approximation part: functional analysis make sure that limn L G n = L related to density arguments (universal approximation) this is the simplest approach but also the less realistic: G n depends only on the data size 32 / 154 Fabrice Rossi Formalization

Structural risk minimization key idea: use again more and more powerful classes G d but also compare models between classes 33 / 154 Fabrice Rossi Formalization

Structural risk minimization key idea: use again more and more powerful classes G d but also compare models between classes more precisely: compute g n,d by empirical risk minimization in G d choose gn by minimizing over d L n (g n,d) + r(n, d) 33 / 154 Fabrice Rossi Formalization

Structural risk minimization key idea: use again more and more powerful classes G d but also compare models between classes more precisely: compute g n,d by empirical risk minimization in G d choose gn by minimizing over d L n (g n,d) + r(n, d) r(n, d) is a penalty term: it decreases with n and increases with the complexity of G d to favor models for which the risk is correctly estimated by the empirical risk it corresponds to a complexity measure for Gd 33 / 154 Fabrice Rossi Formalization

Structural risk minimization key idea: use again more and more powerful classes G d but also compare models between classes more precisely: compute g n,d by empirical risk minimization in G d choose gn by minimizing over d L n (g n,d) + r(n, d) r(n, d) is a penalty term: it decreases with n and increases with the complexity of G d to favor models for which the risk is correctly estimated by the empirical risk it corresponds to a complexity measure for Gd more interesting than the increasing complexity solution, but rather unrealistic on a practical point of view because of the tremendous computational cost 33 / 154 Fabrice Rossi Formalization

Regularization key idea: use a large class but optimize a penalized version of the empirical risk (see part IV) 34 / 154 Fabrice Rossi Formalization

Regularization key idea: use a large class but optimize a penalized version of the empirical risk (see part IV) more precisely: choose G such that L G = L define g n by g n = arg min g G (L n(g) + λ n R(g)) 34 / 154 Fabrice Rossi Formalization

Regularization key idea: use a large class but optimize a penalized version of the empirical risk (see part IV) more precisely: choose G such that L G = L define g n by g n = arg min g G (L n(g) + λ n R(g)) R(g) measures the complexity of the model the trade-off between the risk and the complexity, λ n, must be carefully chosen 34 / 154 Fabrice Rossi Formalization

Regularization key idea: use a large class but optimize a penalized version of the empirical risk (see part IV) more precisely: choose G such that L G = L define g n by g n = arg min g G (L n(g) + λ n R(g)) R(g) measures the complexity of the model the trade-off between the risk and the complexity, λ n, must be carefully chosen this is the (one of the) best solution(s) 34 / 154 Fabrice Rossi Formalization

Summary from n i.i.d. observations D n = (X i, Y i ) n i=1, a ML algorithm builds a model g n a good ML algorithm builds a universally strongly consistent model, i.e. for all P, L(g n ) p.s. L a very general class of ML algorithms is based on empirical risk minimization, i.e. gn 1 = arg min g G n n c(g(x i ), Y i ) consistency is obtained by controlling the estimation error and the approximation error i=1 35 / 154 Fabrice Rossi Formalization

Outline part II: empirical risk and capacity measures part III: capacity control part IV: empirical loss and regularization 36 / 154 Fabrice Rossi Formalization

Outline part II: empirical risk and capacity measures part III: capacity control part IV: empirical loss and regularization not in this lecture (see the references): ad hoc methods such as k nearest neighbours and trees advanced topics such as noise conditions and data dependent bounds clustering Bayesian point of view and many other things... 36 / 154 Fabrice Rossi Formalization

Part II Empirical risk and capacity measures 37 / 154 Fabrice Rossi

Outline Concentration Hoeffding inequality Uniform bounds Vapnik-Chervonenkis Dimension Definition Application to classification Proof Covering numbers Definition and results Computing covering numbers Summary 38 / 154 Fabrice Rossi

Empirical risk Empirical risk minimisation (ERM) relies on the link between L n (g) and L(g) the law of large numbers is too limited: asymptotic only g must be fixed we need: 1. quantitative finite distance results, i.e., PAO like bounds on P { L n (g) L(g) > ɛ} 2. uniform results, i.e. bounds on { P sup L n (g) L(g) > ɛ g G } 39 / 154 Fabrice Rossi Concentration

Concentration inequalities in essence, the goal is to evaluate the concentration of the empirical mean of c(g(x), Y ) around its expectation L n (g) L(g) = 1 n n c(g(x i ), Y i ) E {c(g(x), Y )} i=1 40 / 154 Fabrice Rossi Concentration

Concentration inequalities in essence, the goal is to evaluate the concentration of the empirical mean of c(g(x), Y ) around its expectation L n (g) L(g) = 1 n n c(g(x i ), Y i ) E {c(g(x), Y )} i=1 some standard bounds: Markov: P { X t} E{ X } t Bienaymé-Chebyshev: P { X E {X} t} Var(X) t 2 BC gives for n independent real valued random variables X i, denoting S n = n i=1 X i: n i=1 P { S n E {S n } t} Var(X i) t 2 40 / 154 Fabrice Rossi Concentration

Hoeffding s inequality Hoeffding, 1963 hypothesis X 1,..., X n, n independent r.v. X i [a i, b i ] S n = n i=1 X i result P {S n E {S n } ɛ} e 2ɛ2 / P n i=1 (b i a i ) 2 P {S n E {S n } ɛ} e 2ɛ2 / P n i=1 (b i a i ) 2 Quantitative distribution free finite distance law of large numbers! 41 / 154 Fabrice Rossi Concentration

Direct application with a bounded cost c(u, v) [a, b] (for all (u, v)), such as a bounded Y in regression or the misclassification cost in binary classification [a, b] = [0, 1] 42 / 154 Fabrice Rossi Concentration

Direct application with a bounded cost c(u, v) [a, b] (for all (u, v)), such as a bounded Y in regression or the misclassification cost in binary classification [a, b] = [0, 1] U i = 1 n c(g(x i), Y i ), U i [ a n, b n ] Warning: g cannot depend on the (X i, Y i ) (or the U i are no longer independent) 42 / 154 Fabrice Rossi Concentration

Direct application with a bounded cost c(u, v) [a, b] (for all (u, v)), such as a bounded Y in regression or the misclassification cost in binary classification [a, b] = [0, 1] U i = 1 n c(g(x i), Y i ), U i [ a n, b n ] Warning: g cannot depend on the (X i, Y i ) (or the U i are no longer independent) n i=1 (b i a i ) 2 = (b a)2 n and therefore P { L n (g) L(g) ɛ} 2e 2nɛ2 /(b a) 2 42 / 154 Fabrice Rossi Concentration

Direct application with a bounded cost c(u, v) [a, b] (for all (u, v)), such as a bounded Y in regression or the misclassification cost in binary classification [a, b] = [0, 1] U i = 1 n c(g(x i), Y i ), U i [ a n, b n ] Warning: g cannot depend on the (X i, Y i ) (or the U i are no longer independent) n i=1 (b i a i ) 2 = (b a)2 n and therefore P { L n (g) L(g) ɛ} 2e 2nɛ2 /(b a) 2 then L n (g) P L(g) and L n (g) p.s. L(g) (Borel Cantelli) 42 / 154 Fabrice Rossi Concentration

Limitations g cannot depend on the (X i, Y i ) (independent test set) intuitively: δ = 2e 2nɛ2 /(b a) 2 for each g, the probability to draw from P a Dn on which L n (g) L(g) (b a) log 2 δ 2n is at least 1 δ but the sets of correct D n differ for each g conversely for a fixed D n, the bound is valid only for some of the models this cannot be used to study g n = arg min g G L n (g) 43 / 154 Fabrice Rossi Concentration

The test set Hoeffding s inequality justifies the test set approach (hold out estimate) Given a dataset D m : split the dataset into two disjoint sets: a learning set Dn and a test set D p (with p + n = m) use a ML algorithm to build gn using only D n claim that L(g n ) L p(g n ) = 1 m p i=n+1 c(g n(x i ), Y i ) 44 / 154 Fabrice Rossi Concentration

The test set Hoeffding s inequality justifies the test set approach (hold out estimate) Given a dataset D m : split the dataset into two disjoint sets: a learning set Dn and a test set D p (with p + n = m) use a ML algorithm to build gn using only D n claim that L(g n ) L p(g n ) = 1 m p i=n+1 c(g n(x i ), Y i ) In fact, with probability at least 1 δ L(g n ) L p(g n ) + (b a) log 1 δ 2p 44 / 154 Fabrice Rossi Concentration

The test set Hoeffding s inequality justifies the test set approach (hold out estimate) Given a dataset D m : split the dataset into two disjoint sets: a learning set Dn and a test set D p (with p + n = m) use a ML algorithm to build gn using only D n claim that L(g n ) L p(g n ) = 1 m p i=n+1 c(g n(x i ), Y i ) In fact, with probability at least 1 δ L(g n ) L p(g n ) + (b a) log 1 δ 2p Example: classification (a = 0, b = 1), with a good classifier L p(g n ) = 0.05 to be 99% sure that L(g n ) 0.06, we need p = 23000 (!!!) 44 / 154 Fabrice Rossi Concentration

Proof of the Hoeffding s inequality Chernoff s bounding technique (an application of Markov s inequality): { P {X t} = P e sx e st} E { e sx } e st the bound is controlled via s lemma: if E {X} = 0 and X [a, b], the for all s, E { e sx } e s2 (b a) 2 8 { } e is convex E e sx e φ(s(b a)) then we bound φ this is used for X = S n E {S n } 45 / 154 Fabrice Rossi Concentration

Proof of the Hoeffding s inequality P {S n E {S n } ɛ} e sɛ E {e s P } n i=1 (X i E{X i }) = e sɛ n e sɛ with s = 4ɛ/ n i=1 (b i a i ) 2 i=1 n i=1 = e sɛ e s 2 P n i=1 (b i a i ) 2 8 { } E e s(x i E{X i }) (independence) e s2 (b i a i ) 2 8 (lemma) = e 2ɛ2 / P n i=1 (b i a i ) 2 (minimization) 46 / 154 Fabrice Rossi Concentration

Remarks Hoeffding s inequality is useful in many other contexts: Large-scale machine learning (e.g., [Domingos and Hulten, 2001]): enormous dataset (e.g., cannot be loaded in memory) process growing subsets until the model quality is known with sufficient confidence monitor drifting via confidence bounds Regret in game theory (e.g., [Cesa-Bianchi and Lugosi, 2006]): define strategies by minimizing a regret measure get bounds on the regret estimations (e.g., trade-off between exploration and exploitation in multi-arms bandits) Many improvements/complements are available, such as: Bernstein s inequality McDiarmid s bounded difference inequality 47 / 154 Fabrice Rossi Concentration

Uniform bounds we deal with the estimation error: given the ERM choice: g n = arg min g G L n (g) what about L(gn) inf g G L(g)? 48 / 154 Fabrice Rossi Concentration

Uniform bounds we deal with the estimation error: given the ERM choice: g n = arg min g G L n (g) what about L(gn) inf g G L(g)? we need uniform bounds: L n (gn) L(gn) sup L n (g) L(g) g G L(g n) inf g G L(g) = L(g n) L n (g n) + L n (g n) inf g G L(g) L(gn) L n (gn) + sup L n (g) L(g) g G 2 sup L n (g) L(g) g G 48 / 154 Fabrice Rossi Concentration

Uniform bounds we deal with the estimation error: given the ERM choice: g n = arg min g G L n (g) what about L(gn) inf g G L(g)? we need uniform bounds: L n (gn) L(gn) sup L n (g) L(g) g G L(g n) inf g G L(g) = L(g n) L n (g n) + L n (g n) inf g G L(g) L(gn) L n (gn) + sup L n (g) L(g) g G 2 sup L n (g) L(g) g G therefore uniform bounds give an answer to both questions: the quality of the empirical risk as an estimate of the risk and the quality of the model select by ERM 48 / 154 Fabrice Rossi Concentration

Finite model class union bound : P {A ou B} P {A} + P {B} and therefore { } P max U i ɛ 1 i m m P {U i ɛ} i=1 49 / 154 Fabrice Rossi Concentration

Finite model class union bound : P {A ou B} P {A} + P {B} and therefore then when G is finite { P { } P max U i ɛ 1 i m sup L n (g) L(g) ɛ g G } m P {U i ɛ} i=1 2 G e 2nɛ2 /(b a) 2 so L(g n) p.s. inf g G L(g) (but in general inf g G L(g) > L ) 49 / 154 Fabrice Rossi Concentration

Finite model class union bound : P {A ou B} P {A} + P {B} and therefore then when G is finite { P { } P max U i ɛ 1 i m sup L n (g) L(g) ɛ g G } m P {U i ɛ} i=1 2 G e 2nɛ2 /(b a) 2 so L(g n) p.s. inf g G L(g) (but in general inf g G L(g) > L ) quantitative distribution free uniform finite distance law of large numbers! 49 / 154 Fabrice Rossi Concentration

Bound versus size with probability at least 1 δ L(gn) inf L(g) + 2(b a) g G log G + log 2 δ 2n inf g G L(g) decreases with G (more models) but the bound increases with G : trade-off between flexibility and estimation quality remark: log G is the number of bits needed to encode the model choice; this is a capacity measure 50 / 154 Fabrice Rossi Concentration

Outline Concentration Hoeffding inequality Uniform bounds Vapnik-Chervonenkis Dimension Definition Application to classification Proof Covering numbers Definition and results Computing covering numbers Summary 51 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Infinite model class so far we have uniform bounds when G is finite: even simple classes such as Glin = {x arg max k (W x) k } are infinite in addition infg Glin L(g) > 0 in general, even when L = 0 (nonlinear problems) 52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Infinite model class so far we have uniform bounds when G is finite: even simple classes such as Glin = {x arg max k (W x) k } are infinite in addition infg Glin L(g) > 0 in general, even when L = 0 (nonlinear problems) solution: evaluate the capacity of a class rather than its size 52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Infinite model class so far we have uniform bounds when G is finite: even simple classes such as Glin = {x arg max k (W x) k } are infinite in addition infg Glin L(g) > 0 in general, even when L = 0 (nonlinear problems) solution: evaluate the capacity of a class rather than its size first step: binary classification (the simplest case) a model is then a measurable set A X the capacity of G measures the variability of the available shapes it is evaluated with respect to the data: if A Dn = B D n, then A and B are identical 52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Growth function abstract setting: F is a set of measurable functions from R d to {0, 1} (in other words a set of (indicator functions of) measurable subsets of R d ) 53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Growth function abstract setting: F is a set of measurable functions from R d to {0, 1} (in other words a set of (indicator functions of) measurable subsets of R d ) given z 1 R d,..., z n R d, we define F z1,...,z n = {u {0, 1} n f F, u = (f (z 1 ),..., f (z n ))} interpretation: each u describes a binary partition of z 1,..., z n and F z1,...,z n is then the set of partitions implemented by F 53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Growth function abstract setting: F is a set of measurable functions from R d to {0, 1} (in other words a set of (indicator functions of) measurable subsets of R d ) given z 1 R d,..., z n R d, we define F z1,...,z n = {u {0, 1} n f F, u = (f (z 1 ),..., f (z n ))} interpretation: each u describes a binary partition of z 1,..., z n and F z1,...,z n is then the set of partitions implemented by F Growth function S F (n) = sup F z1,...,z n z 1 R d,...,z n R d interpretation: maximal number of binary partitions implementable via F (distribution free worst case analysis) 53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Vapnik-Chervonenkis dimension S F (n) 2 n vocabulary: S F (n) is the n-th shatter coefficient of F if F z1,...,z n = 2 n, F shatters z 1,..., z n 54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Vapnik-Chervonenkis dimension S F (n) 2 n vocabulary: S F (n) is the n-th shatter coefficient of F if F z1,...,z n = 2 n, F shatters z 1,..., z n Vapnik-Chervonenkis dimension VCdim (F) = sup{n N + S F (n) = 2 n } 54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Vapnik-Chervonenkis dimension S F (n) 2 n vocabulary: S F (n) is the n-th shatter coefficient of F if F z1,...,z n = 2 n, F shatters z 1,..., z n Vapnik-Chervonenkis dimension VCdim (F) = sup{n N + S F (n) = 2 n } interpretation: if S F (n) < 2 n : for all z 1,..., z n, there is a partition u {0, 1} n, such that u F z1,...,z n for any dataset, F can fail to reach L = 0 if S F (n) = 2 n : there is at least one set z 1,..., z n shattered by F 54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Example points from R 2 F = {I {ax+by+c 0} } 55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 even if we have sometimes F z1,z 2,z 3 < 2 3 55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 even if we have sometimes F z1,z 2,z 3 < 2 3 S F (4) < 2 4 55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 even if we have sometimes F z1,z 2,z 3 < 2 3 S F (4) < 2 4 VCdim (F) = 3 55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Linear models result if G is a p dimensional vector space of real valued functions defined on R d then }) VCdim ({f : R d {0, 1} g G, f (z) = I {g(z) 0} p 56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Linear models result if G is a p dimensional vector space of real valued functions defined on R d then }) VCdim ({f : R d {0, 1} g G, f (z) = I {g(z) 0} p proof given z 1,..., z p+1, let F from G to R p+1 be F(g) = (g(z 1 ),..., g(z p+1 )) 56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Linear models result if G is a p dimensional vector space of real valued functions defined on R d then }) VCdim ({f : R d {0, 1} g G, f (z) = I {g(z) 0} p proof given z 1,..., z p+1, let F from G to R p+1 be F(g) = (g(z 1 ),..., g(z p+1 )) as dim(f(g)) p, there is a non zero γ = (γ 1,..., γ p+1 ) such that p+1 i=1 γ ig(z i ) = 0 56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Linear models result if G is a p dimensional vector space of real valued functions defined on R d then }) VCdim ({f : R d {0, 1} g G, f (z) = I {g(z) 0} p proof given z 1,..., z p+1, let F from G to R p+1 be F(g) = (g(z 1 ),..., g(z p+1 )) as dim(f(g)) p, there is a non zero γ = (γ 1,..., γ p+1 ) such that p+1 i=1 γ ig(z i ) = 0 if z 1,..., z p+1 is shattered, there is also g j such that g j (z i ) = δ ij and therefore γ j = 0 for all j 56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension