An introduction to statistical learning theory. Fabrice Rossi TELECOM ParisTech November 2008
|
|
- Charlotte Wilkins
- 7 years ago
- Views:
Transcription
1 An introduction to statistical learning theory Fabrice Rossi TELECOM ParisTech November 2008
2 About this lecture Main goal Show how statistics enable a rigorous analysis of machine learning methods which leads to a better understanding and to possible improvements of those methods 2 / 154 Fabrice Rossi Outline
3 About this lecture Main goal Show how statistics enable a rigorous analysis of machine learning methods which leads to a better understanding and to possible improvements of those methods Outline 1. introduction to the formal model of statistical learning 2. empirical risk and capacity measures 3. capacity control 4. beyond empirical risk minimization 2 / 154 Fabrice Rossi Outline
4 Part I Introduction and formalization 3 / 154 Fabrice Rossi
5 Outline Machine learning in a nutshell Formalization Data and algorithms Consistency Risk/Loss optimization 4 / 154 Fabrice Rossi
6 Statistical Learning Statistical Learning 5 / 154 Fabrice Rossi Machine learning in a nutshell
7 Statistical Learning Statistical Learning = Machine learning + Statistics 5 / 154 Fabrice Rossi Machine learning in a nutshell
8 Statistical Learning Statistical Learning = Machine learning + Statistics Machine learning in three steps: 1. record observations about a phenomenon 2. build a model of this phenomenon 3. predict the future behavior of the phenomenon 5 / 154 Fabrice Rossi Machine learning in a nutshell
9 Statistical Learning Statistical Learning = Machine learning + Statistics Machine learning in three steps: 1. record observations about a phenomenon 2. build a model of this phenomenon 3. predict the future behavior of the phenomenon... all of this is done by the computer (no human intervention) 5 / 154 Fabrice Rossi Machine learning in a nutshell
10 Statistical Learning Statistical Learning = Machine learning + Statistics Machine learning in three steps: 1. record observations about a phenomenon 2. build a model of this phenomenon 3. predict the future behavior of the phenomenon... all of this is done by the computer (no human intervention) Statistics gives: a formal definition of machine learning some guarantees on its expected results some suggestions for new or improved modelling tools 5 / 154 Fabrice Rossi Machine learning in a nutshell
11 Statistical Learning Statistical Learning = Machine learning + Statistics Machine learning in three steps: 1. record observations about a phenomenon 2. build a model of this phenomenon 3. predict the future behavior of the phenomenon... all of this is done by the computer (no human intervention) Statistics gives: a formal definition of machine learning some guarantees on its expected results some suggestions for new or improved modelling tools Nothing is more practical than a good theory Vladimir Vapnik (1998) 5 / 154 Fabrice Rossi Machine learning in a nutshell
12 Machine learning a phenomenon is recorded via observations (z i ) 1 i n with z i Z 6 / 154 Fabrice Rossi Machine learning in a nutshell
13 Machine learning a phenomenon is recorded via observations (z i ) 1 i n with z i Z two generic situations: 1. unsupervised learning: no predefined structure in Z then the goal is to find some structures: clusters, association rules, distribution, etc. 6 / 154 Fabrice Rossi Machine learning in a nutshell
14 Machine learning a phenomenon is recorded via observations (z i ) 1 i n with z i Z two generic situations: 1. unsupervised learning: no predefined structure in Z then the goal is to find some structures: clusters, association rules, distribution, etc. 2. supervised learning: two non symmetric component in Z = X Y z = (x, y) X Y modelling: finding how x and y are related the goal is to make prediction: given x, find a reasonable value for y such that z = (x, y) is compatible with the phenomenon 6 / 154 Fabrice Rossi Machine learning in a nutshell
15 Machine learning a phenomenon is recorded via observations (z i ) 1 i n with z i Z two generic situations: 1. unsupervised learning: no predefined structure in Z then the goal is to find some structures: clusters, association rules, distribution, etc. 2. supervised learning: two non symmetric component in Z = X Y z = (x, y) X Y modelling: finding how x and y are related the goal is to make prediction: given x, find a reasonable value for y such that z = (x, y) is compatible with the phenomenon this course focuses on supervised learning 6 / 154 Fabrice Rossi Machine learning in a nutshell
16 Supervised learning several types of Y can be considered: Y = {1,..., q}: classification in q classes, e.g., tell whether a patient has hyper, hypo or normal thyroidal function based on some clinical measures Y = R q : regression, e.g., calculate the price of a house y based on some characteristics x Y = something complex but structured, e.g., construct a parse tree for a natural language sentence 7 / 154 Fabrice Rossi Machine learning in a nutshell
17 Supervised learning several types of Y can be considered: Y = {1,..., q}: classification in q classes, e.g., tell whether a patient has hyper, hypo or normal thyroidal function based on some clinical measures Y = R q : regression, e.g., calculate the price of a house y based on some characteristics x Y = something complex but structured, e.g., construct a parse tree for a natural language sentence modelling difficulty increases with the complexity of Y 7 / 154 Fabrice Rossi Machine learning in a nutshell
18 Supervised learning several types of Y can be considered: Y = {1,..., q}: classification in q classes, e.g., tell whether a patient has hyper, hypo or normal thyroidal function based on some clinical measures Y = R q : regression, e.g., calculate the price of a house y based on some characteristics x Y = something complex but structured, e.g., construct a parse tree for a natural language sentence modelling difficulty increases with the complexity of Y Vapnik s message: don t built a regression model to do classification 7 / 154 Fabrice Rossi Machine learning in a nutshell
19 Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that 1 i n, g(x i ) y i 8 / 154 Fabrice Rossi Machine learning in a nutshell
20 Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that or not? 1 i n, g(x i ) y i 8 / 154 Fabrice Rossi Machine learning in a nutshell
21 Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that 1 i n, g(x i ) y i or not? perfect interpolation is always possible but is that what we are looking for? 8 / 154 Fabrice Rossi Machine learning in a nutshell
22 Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that or not? 1 i n, g(x i ) y i perfect interpolation is always possible but is that what we are looking for? No! We want generalization 8 / 154 Fabrice Rossi Machine learning in a nutshell
23 Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that or not? 1 i n, g(x i ) y i perfect interpolation is always possible but is that what we are looking for? No! We want generalization: a good model must learn something smart from the data, not simply memorize them on new data, it should be such that g(xi ) y i (for i > n) 8 / 154 Fabrice Rossi Machine learning in a nutshell
24 Machine learning objectives main goal: from a dataset (x i, y i ) 1 i n, build a model g which generalizes in a satisfactory way on new data but also: request only as many data as needed do this efficiently (quickly with a low memory usage) gain knowledge on the underlying phenomenon (e.g., discard useless parts of x) etc. 9 / 154 Fabrice Rossi Machine learning in a nutshell
25 Machine learning objectives main goal: from a dataset (x i, y i ) 1 i n, build a model g which generalizes in a satisfactory way on new data but also: request only as many data as needed do this efficiently (quickly with a low memory usage) gain knowledge on the underlying phenomenon (e.g., discard useless parts of x) etc. the statistical learning framework addresses some of those goals: it provides asymptotic guarantees about learnability it gives non asymptotic confidence bounds around performance estimates it suggests/validates best practices (hold out estimates, regularization, etc.) 9 / 154 Fabrice Rossi Machine learning in a nutshell
26 Outline Machine learning in a nutshell Formalization Data and algorithms Consistency Risk/Loss optimization 10 / 154 Fabrice Rossi Formalization
27 Data the phenomenon is fully described by an unknown probability distribution P on X Y we are given n observations: Dn = (X i, Y i ) n i=1 each pair (Xi, Y i ) is distributed according to P pairs are independent 11 / 154 Fabrice Rossi Formalization
28 Data the phenomenon is fully described by an unknown probability distribution P on X Y we are given n observations: Dn = (X i, Y i ) n i=1 each pair (Xi, Y i ) is distributed according to P pairs are independent remarks: the phenomenon is stationary, i.e., P does not change: new data will be distributed as learning data extensions to non stationary situations exist, see e.g. covariate shift and concept drift the independence assumption says that each observation brings new information extensions to dependent observation exists, e.g., Markovian models for time series analysis 11 / 154 Fabrice Rossi Formalization
29 Model quality (risk) quality is measured via a cost function c: Y Y R + (a low cost is good!) the risk of a model g: X Y is L(g) = E {c(g(x), Y )} This is the expected value of the cost on an observation generated by the phenomenon (a low risk is good!) 12 / 154 Fabrice Rossi Formalization
30 Model quality (risk) quality is measured via a cost function c: Y Y R + (a low cost is good!) the risk of a model g: X Y is L(g) = E {c(g(x), Y )} This is the expected value of the cost on an observation generated by the phenomenon (a low risk is good!) interpretation: the average value of c(g(x), y) over a lot of observations generated by the phenomenon is L(g) this is a measure of generalization capabilities (if g has been built from a dataset) 12 / 154 Fabrice Rossi Formalization
31 Model quality (risk) quality is measured via a cost function c: Y Y R + (a low cost is good!) the risk of a model g: X Y is L(g) = E {c(g(x), Y )} This is the expected value of the cost on an observation generated by the phenomenon (a low risk is good!) interpretation: the average value of c(g(x), y) over a lot of observations generated by the phenomenon is L(g) this is a measure of generalization capabilities (if g has been built from a dataset) remark: we need a probability space, c and g have to be measurable functions, etc. 12 / 154 Fabrice Rossi Formalization
32 We are frequentists 13 / 154 Fabrice Rossi Formalization
33 Cost functions Regression (Y = R q ): the quadratic cost is the standard one: c(g(x), y) = g(x) y 2 other costs (Y = R): L 1 (Laplacian) cost: c(g(x), y) = g(x) y ɛ-insensitive: c ɛ(g(x), y) = max(0, g(x) y ɛ) Huber s loss: c δ (g(x), y) = (g(x) y) 2 when g(x) y < δ and c δ (g(x), y) = 2δ( g(x) y δ ) otherwise 2 14 / 154 Fabrice Rossi Formalization
34 Cost functions Regression (Y = R q ): the quadratic cost is the standard one: c(g(x), y) = g(x) y 2 other costs (Y = R): L 1 (Laplacian) cost: c(g(x), y) = g(x) y ɛ-insensitive: c ɛ(g(x), y) = max(0, g(x) y ɛ) Huber s loss: c δ (g(x), y) = (g(x) y) 2 when g(x) y < δ and c δ (g(x), y) = 2δ( g(x) y δ ) otherwise 2 Classification (Y = {1,..., q}): 0/1 cost: c(g(x), y) = I{g(x) y} then the risk is the probability of misclassification 14 / 154 Fabrice Rossi Formalization
35 Cost functions Regression (Y = R q ): the quadratic cost is the standard one: c(g(x), y) = g(x) y 2 other costs (Y = R): L 1 (Laplacian) cost: c(g(x), y) = g(x) y ɛ-insensitive: c ɛ(g(x), y) = max(0, g(x) y ɛ) Huber s loss: c δ (g(x), y) = (g(x) y) 2 when g(x) y < δ and c δ (g(x), y) = 2δ( g(x) y δ ) otherwise 2 Classification (Y = {1,..., q}): 0/1 cost: c(g(x), y) = I{g(x) y} then the risk is the probability of misclassification Remark: the cost is used to measure the quality of the model the machine learning algorithm may use a loss cost to build the model e.g.: the hinge loss for support vector machines more on this in part IV 14 / 154 Fabrice Rossi Formalization
36 Machine learning algorithm given D n, a machine learning algorithm/method builds a model g Dn (= g n for simplicity) g n is a random variable with values in the set of measurable functions from X to Y the associated (random) risk is L(g n ) = E {c(g n (X), Y ) D n } statistical learning studies the behavior of L(g n ) and E {L(g n )} when n increases 15 / 154 Fabrice Rossi Formalization
37 Machine learning algorithm given D n, a machine learning algorithm/method builds a model g Dn (= g n for simplicity) g n is a random variable with values in the set of measurable functions from X to Y the associated (random) risk is L(g n ) = E {c(g n (X), Y ) D n } statistical learning studies the behavior of L(g n ) and E {L(g n )} when n increases: keep in mind that L(gn ) is a random variable if many datasets are generated from the phenomenon, the mean risk of the classifiers produced by the algorithm for those datasets is E {L(g n )} 15 / 154 Fabrice Rossi Formalization
38 Consistency best risk: L = inf g L(g) the infimum is taken over all measurable functions from X to Y 16 / 154 Fabrice Rossi Formalization
39 Consistency best risk: L = inf g L(g) the infimum is taken over all measurable functions from X to Y a machine learning method is: strongly consistent: if L(gn ) p.s. L consistent: if E {L(gn )} L universally (strongly) consistent: if the convergence holds for any distribution of the data P remark: if c is bounded then lim n E {L(g n )} = L L(g n ) P L 16 / 154 Fabrice Rossi Formalization
40 Interpretation universality: no assumption on the data distribution (realistic) consistency: perfect generalization when given an infinite learning set 17 / 154 Fabrice Rossi Formalization
41 Interpretation universality: no assumption on the data distribution (realistic) consistency: perfect generalization when given an infinite learning set strong case: for (almost) any series of observations (x i, y i ) i 1 and for any ɛ > 0 there is N such that for n N L(g n) L + ɛ where g n is built on (x i, y i ) 1 i n N depends on ɛ and on the dataset (x i, y i ) i 1 17 / 154 Fabrice Rossi Formalization
42 Interpretation universality: no assumption on the data distribution (realistic) consistency: perfect generalization when given an infinite learning set strong case: for (almost) any series of observations (x i, y i ) i 1 and for any ɛ > 0 there is N such that for n N L(g n) L + ɛ where g n is built on (x i, y i ) 1 i n N depends on ɛ and on the dataset (x i, y i ) i 1 weak case: for any ɛ > 0 there is N such that for n N E {L(g n)} L + ɛ where g n is built on (x i, y i ) 1 i n 17 / 154 Fabrice Rossi Formalization
43 Interpretation (continued) We can be unlucky: strong case: L(g n) = E {c(g n(x), Y ) D n} L + ɛ but what about n+p 1 X c(g n(x i ), y i ) p i=n+1 weak case: E {L(g n)} L + ɛ but what about L(g n) for a particular dataset? see also the strong case! 18 / 154 Fabrice Rossi Formalization
44 Interpretation (continued) We can be unlucky: strong case: L(g n) = E {c(g n(x), Y ) D n} L + ɛ but what about n+p 1 X c(g n(x i ), y i ) p i=n+1 weak case: E {L(g n)} L + ɛ but what about L(g n) for a particular dataset? see also the strong case! Stronger result, Probably Approximately Optimal algorithm: ɛ, δ, n(ɛ, δ) such that n n(ɛ, δ) P {L (g n ) > L + ɛ} < δ Gives some convergence speed: especially interesting when n(ɛ, δ) is independent of P (distribution free) 18 / 154 Fabrice Rossi Formalization
45 From PAO to consistency if convergence happens fast enough, then the method is (strongly) consistent 19 / 154 Fabrice Rossi Formalization
46 From PAO to consistency if convergence happens fast enough, then the method is (strongly) consistent weak consistency is easy (for c bounded): if the method is PAO, then for any fixed ɛ P {L (g n ) > L + ɛ} n 0 by definition (and also because L(gn ) > L ), this means L(g n ) P L then, if c is bounded E {L(gn )} n L 19 / 154 Fabrice Rossi Formalization
47 From PAO to consistency if convergence happens fast enough, then the method is (strongly) consistent weak consistency is easy (for c bounded): if the method is PAO, then for any fixed ɛ P {L (g n ) > L + ɛ} n 0 by definition (and also because L(gn ) > L ), this means L(g n ) P L then, if c is bounded E {L(gn )} n L if convergence happens faster, then the method might be strongly consistent we need a very sharp decrease of P {L (g n ) > L + ɛ} with n, i.e., a slow increase of n(ɛ, δ) when δ decreases 19 / 154 Fabrice Rossi Formalization
48 From PAO to strong consistency this is based on the Borel-Cantelli Lemma let (Ai ) i 1 be a series of events define [Ai i.o.] = i=1 j=i A i (infinitely often) the lemma states that if i P {A i } < then P {[A i i.o.]} = 0 20 / 154 Fabrice Rossi Formalization
49 From PAO to strong consistency this is based on the Borel-Cantelli Lemma let (Ai ) i 1 be a series of events define [Ai i.o.] = i=1 j=i A i (infinitely often) the lemma states that if i P {A i } < then P {[A i i.o.]} = 0 if n P {L (g n) > L + ɛ} <, then P {[{L (g n ) > L + ɛ} i.o.]} = 0 20 / 154 Fabrice Rossi Formalization
50 From PAO to strong consistency this is based on the Borel-Cantelli Lemma let (Ai ) i 1 be a series of events define [Ai i.o.] = i=1 j=i A i (infinitely often) the lemma states that if i P {A i } < then P {[A i i.o.]} = 0 if n P {L (g n) > L + ɛ} <, then P {[{L (g n ) > L + ɛ} i.o.]} = 0 we have {L(g n ) L } = j=1 i=1 n=i {L (g n) L + 1 j } 20 / 154 Fabrice Rossi Formalization
51 From PAO to strong consistency this is based on the Borel-Cantelli Lemma let (Ai ) i 1 be a series of events define [Ai i.o.] = i=1 j=i A i (infinitely often) the lemma states that if i P {A i } < then P {[A i i.o.]} = 0 if n P {L (g n) > L + ɛ} <, then P {[{L (g n ) > L + ɛ} i.o.]} = 0 we have {L(g n ) L } = j=1 i=1 n=i {L (g n) L + 1 j } and therefore, L(g n ) p.s. L 20 / 154 Fabrice Rossi Formalization
52 Summary from n i.i.d. observations D n = (X i, Y i ) n i=1, a ML algorithm builds a model g n its risks is L(g n ) = E {c(g n (X), Y ) D n } statistical learning studies L(g n ) L and E {L(g n )} L the preferred approach is to show a fast decrease of P {L (g n ) > L + ɛ} with n no hypothesis on the data (i.e., on P), a.k.a. distribution free results 21 / 154 Fabrice Rossi Formalization
53 Classical statistics Even though we are fequentists... this program is quite different from classical statistics: no optimal parameters, no identifiability: intrinsically non parametric no optimal model: only optimal performances no asymptotic distribution: focus on finite distance inequalities no (or minimal) assumption on the data distribution minimal hypothesis are justified by our lack of knowledge on the studied phenomenon have strong and annoying consequences linked to the worst case scenario they imply 22 / 154 Fabrice Rossi Formalization
54 Optimal model there is nevertheless an optimal model regression: for the quadratic cost c(g(x), y) = g(x) y 2 the minimum risk L is reached by g(x) = E {Y X = x} 23 / 154 Fabrice Rossi Formalization
55 Optimal model there is nevertheless an optimal model regression: for the quadratic cost c(g(x), y) = g(x) y 2 the minimum risk L is reached by g(x) = E {Y X = x} classification: we need the posterior probabilities: P {Y = k X = x} the minimum risk L (the Bayes risk) is reached by the Bayes classifier: assign x to the most probable class arg max k P {Y = k X = x} 23 / 154 Fabrice Rossi Formalization
56 Optimal model there is nevertheless an optimal model regression: for the quadratic cost c(g(x), y) = g(x) y 2 the minimum risk L is reached by g(x) = E {Y X = x} classification: we need the posterior probabilities: P {Y = k X = x} the minimum risk L (the Bayes risk) is reached by the Bayes classifier: assign x to the most probable class arg max k P {Y = k X = x} we only need to mimic the prediction of the model, not the model itself: for optimal classification, we don t need to compute the posterior probabilities for the binary case, the decision is based on P {Y = 1 X = x} P {Y = 1 X = x} we only need to agree with the sign of E {Y X = x} 23 / 154 Fabrice Rossi Formalization
57 No free lunch unfortunately, making no hypothesis on P costs a lot 24 / 154 Fabrice Rossi Formalization
58 No free lunch unfortunately, making no hypothesis on P costs a lot no free lunch results for binary classification and misclassification cost 24 / 154 Fabrice Rossi Formalization
59 No free lunch unfortunately, making no hypothesis on P costs a lot no free lunch results for binary classification and misclassification cost there is always a bad distribution: given a fixed machine learning method for all ɛ > 0 and all n, there is (X, Y ) such that L = 0 and E {L(g n )} 1 2 ɛ 24 / 154 Fabrice Rossi Formalization
60 No free lunch unfortunately, making no hypothesis on P costs a lot no free lunch results for binary classification and misclassification cost there is always a bad distribution: given a fixed machine learning method for all ɛ > 0 and all n, there is (X, Y ) such that L = 0 and E {L(g n )} 1 2 ɛ arbitrary slow convergence always happens: given a fixed machine learning method and a decreasing series (an ) with limit 0 (and a 1 1/16) there is (X, Y ) such that L = 0 and E {L(g n )} a n 24 / 154 Fabrice Rossi Formalization
61 Estimation difficulties in fact, estimating the Bayes risk L (in binary classification) is difficult: given a fixed estimation algorithm for L, denoted ˆL n for all ɛ > 0 and all n, there is (X, Y ) such that { } E ˆL n L 1 4 ɛ 25 / 154 Fabrice Rossi Formalization
62 Estimation difficulties in fact, estimating the Bayes risk L (in binary classification) is difficult: given a fixed estimation algorithm for L, denoted ˆL n for all ɛ > 0 and all n, there is (X, Y ) such that { } E ˆL n L 1 4 ɛ regression is more difficult than classification: ηn estimator for η = E {Y X} in a weak sense: lim E { (η n (X) η(x)) 2} = 0 n then for g n (x) = I {ηn(x)>1/2}, we have lim n E {L(g n )} L E {(ηn (X) η(x)) 2 } = 0 25 / 154 Fabrice Rossi Formalization
63 Consequences no free lunch results show that we cannot get uniform convergence speed: they don t rule out universal (strong) consistency! they don t rule out PAO with hypothesis on P 26 / 154 Fabrice Rossi Formalization
64 Consequences no free lunch results show that we cannot get uniform convergence speed: they don t rule out universal (strong) consistency! they don t rule out PAO with hypothesis on P intuitive explanation: stationarity and independence are not sufficient if P is arbitrarily complex, then no generalization can occur from a finite learning set e.g., Y = f (X) + ε: interpolation needs regularity assumptions on f, extrapolation needs even stronger hypothesis 26 / 154 Fabrice Rossi Formalization
65 Consequences no free lunch results show that we cannot get uniform convergence speed: they don t rule out universal (strong) consistency! they don t rule out PAO with hypothesis on P intuitive explanation: stationarity and independence are not sufficient if P is arbitrarily complex, then no generalization can occur from a finite learning set e.g., Y = f (X) + ε: interpolation needs regularity assumptions on f, extrapolation needs even stronger hypothesis but we don t want hypothesis on P / 154 Fabrice Rossi Formalization
66 Estimation and approximation solution: split the problem in two parts, estimation and approximation 27 / 154 Fabrice Rossi Formalization
67 Estimation and approximation solution: split the problem in two parts, estimation and approximation estimation: restrict the class of acceptable models to G, a set of measurable functions from X to Y study L(g n ) inf g G L(g) when the ML algorithm must choose g n in G statistics needed! hypotheses on G replace hypotheses on P and allow uniform results! 27 / 154 Fabrice Rossi Formalization
68 Estimation and approximation solution: split the problem in two parts, estimation and approximation estimation: restrict the class of acceptable models to G, a set of measurable functions from X to Y study L(g n ) inf g G L(g) when the ML algorithm must choose g n in G statistics needed! hypotheses on G replace hypotheses on P and allow uniform results! approximation: compare G to the set of all possible models in other words, study infg G L(g) L function approximation results (density), no statistics! 27 / 154 Fabrice Rossi Formalization
69 Back to machine learning algorithms many ad hoc algorithms: k nearest neighbors tree based algorithms Adaboost 28 / 154 Fabrice Rossi Formalization
70 Back to machine learning algorithms many ad hoc algorithms: k nearest neighbors tree based algorithms Adaboost but most algorithms are based on the following scheme: fix a model class G choose gn in G by optimizing a quality measure defined via D n 28 / 154 Fabrice Rossi Formalization
71 Back to machine learning algorithms many ad hoc algorithms: k nearest neighbors tree based algorithms Adaboost but most algorithms are based on the following scheme: fix a model class G choose gn in G by optimizing a quality measure defined via D n examples: linear regression (X is a Hilbert space, Y = R): G = {x w, x } find w n by minimizing P the mean squared error 1 w n = arg min n w n i=1 (y i w, x i ) 2 linear classification (X = R p, Y = {1,..., c}): G = {x arg max k (W x) k } find W e.g., by minimizing a mean squared error for a disjunctive coding of the classes 28 / 154 Fabrice Rossi Formalization
72 Criterion the best solution would be to choose g n by minimizing L(g n ) over G, but: we cannot compute L(gn ) (P is unknown!) even if we knew L(g n ), optimization might be intractable (L has no reason to be convex, for instance) 29 / 154 Fabrice Rossi Formalization
73 Criterion the best solution would be to choose g n by minimizing L(g n ) over G, but: we cannot compute L(gn ) (P is unknown!) even if we knew L(g n ), optimization might be intractable (L has no reason to be convex, for instance) partial solution, the empirical risk: L n (g) = 1 n n c(g(x i ), Y i ) i=1 29 / 154 Fabrice Rossi Formalization
74 Criterion the best solution would be to choose g n by minimizing L(g n ) over G, but: we cannot compute L(gn ) (P is unknown!) even if we knew L(g n ), optimization might be intractable (L has no reason to be convex, for instance) partial solution, the empirical risk: L n (g) = 1 n n c(g(x i ), Y i ) i=1 naive justification by the strong law of large numbers (SLLN): when g is fixed, lim n L n (g) = L(g) 29 / 154 Fabrice Rossi Formalization
75 Criterion the best solution would be to choose g n by minimizing L(g n ) over G, but: we cannot compute L(gn ) (P is unknown!) even if we knew L(g n ), optimization might be intractable (L has no reason to be convex, for instance) partial solution, the empirical risk: L n (g) = 1 n n c(g(x i ), Y i ) i=1 naive justification by the strong law of large numbers (SLLN): when g is fixed, lim n L n (g) = L(g) the algorithmic issue is handled by replacing the empirical risk by an empirical loss (see part IV) 29 / 154 Fabrice Rossi Formalization
76 Empirical risk minimization generic machine learning method: fix a model class G choose g n in G by g n = arg min g G L n(g) denote L G = inf g G L(g) 30 / 154 Fabrice Rossi Formalization
77 Empirical risk minimization generic machine learning method: fix a model class G choose g n in G by g n = arg min g G L n(g) denote L G = inf g G L(g) does that work? E {L(g n )} L G? E {L(g n )} L? the SLLN does not help as g n depends on n 30 / 154 Fabrice Rossi Formalization
78 Empirical risk minimization generic machine learning method: fix a model class G choose g n in G by g n = arg min g G L n(g) denote L G = inf g G L(g) does that work? E {L(g n )} L G? E {L(g n )} L? the SLLN does not help as g n depends on n we are looking for bounds: L(gn) < L n (gn) + B (bound the risk using the empirical risk) L(gn) < L G + C (guarantee that the risk is close to the best possible one in the class) 30 / 154 Fabrice Rossi Formalization
79 Empirical risk minimization generic machine learning method: fix a model class G choose g n in G by g n = arg min g G L n(g) denote L G = inf g G L(g) does that work? E {L(g n )} L G? E {L(g n )} L? the SLLN does not help as g n depends on n we are looking for bounds: L(gn) < L n (gn) + B (bound the risk using the empirical risk) L(gn) < L G + C (guarantee that the risk is close to the best possible one in the class) L G L is handled differently (approximation vs estimation) 30 / 154 Fabrice Rossi Formalization
80 Complexity control The overfitting issue controlling L n (g n) L(g n) is not possible in arbitrary classes G 31 / 154 Fabrice Rossi Formalization
81 Complexity control The overfitting issue controlling L n (gn) L(gn) is not possible in arbitrary classes G a simple example: binary classification G: all piecewise functions on arbitrary partitions of X obviously L n (gn) = 0 if P X has a density (via the classifier defined by the 1-nn rule) but L(g n ) > 0 when L > 0! the 1-nn rule is overfitting 31 / 154 Fabrice Rossi Formalization
82 Complexity control The overfitting issue controlling L n (gn) L(gn) is not possible in arbitrary classes G a simple example: binary classification G: all piecewise functions on arbitrary partitions of X obviously L n (gn) = 0 if P X has a density (via the classifier defined by the 1-nn rule) but L(g n ) > 0 when L > 0! the 1-nn rule is overfitting the only solution is to reduce the size of G (see part II) this implies L G > L (for arbitrary P) 31 / 154 Fabrice Rossi Formalization
83 Complexity control The overfitting issue controlling L n (gn) L(gn) is not possible in arbitrary classes G a simple example: binary classification G: all piecewise functions on arbitrary partitions of X obviously L n (gn) = 0 if P X has a density (via the classifier defined by the 1-nn rule) but L(g n ) > 0 when L > 0! the 1-nn rule is overfitting the only solution is to reduce the size of G (see part II) this implies L G > L (for arbitrary P) how to reach L? (will be studied in part III) 31 / 154 Fabrice Rossi Formalization
84 Complexity control The overfitting issue controlling L n (gn) L(gn) is not possible in arbitrary classes G a simple example: binary classification G: all piecewise functions on arbitrary partitions of X obviously L n (gn) = 0 if P X has a density (via the classifier defined by the 1-nn rule) but L(g n ) > 0 when L > 0! the 1-nn rule is overfitting the only solution is to reduce the size of G (see part II) this implies L G > L (for arbitrary P) how to reach L? (will be studied in part III) in other words, how to balance estimation error and approximation error 31 / 154 Fabrice Rossi Formalization
85 Increasing complexity key idea: choose g n in G n, i.e., in a class that depends on the datasize 32 / 154 Fabrice Rossi Formalization
86 Increasing complexity key idea: choose g n in G n, i.e., in a class that depends on the datasize estimation part: statistics make sure that Ln (gn) L(gn) is under control when gn = arg min g Gn L n (g n ) 32 / 154 Fabrice Rossi Formalization
87 Increasing complexity key idea: choose g n in G n, i.e., in a class that depends on the datasize estimation part: statistics make sure that Ln (gn) L(gn) is under control when gn = arg min g Gn L n (g n ) approximation part: functional analysis make sure that limn L G n = L related to density arguments (universal approximation) 32 / 154 Fabrice Rossi Formalization
88 Increasing complexity key idea: choose g n in G n, i.e., in a class that depends on the datasize estimation part: statistics make sure that Ln (gn) L(gn) is under control when gn = arg min g Gn L n (g n ) approximation part: functional analysis make sure that limn L G n = L related to density arguments (universal approximation) this is the simplest approach but also the less realistic: G n depends only on the data size 32 / 154 Fabrice Rossi Formalization
89 Structural risk minimization key idea: use again more and more powerful classes G d but also compare models between classes 33 / 154 Fabrice Rossi Formalization
90 Structural risk minimization key idea: use again more and more powerful classes G d but also compare models between classes more precisely: compute g n,d by empirical risk minimization in G d choose gn by minimizing over d L n (g n,d) + r(n, d) 33 / 154 Fabrice Rossi Formalization
91 Structural risk minimization key idea: use again more and more powerful classes G d but also compare models between classes more precisely: compute g n,d by empirical risk minimization in G d choose gn by minimizing over d L n (g n,d) + r(n, d) r(n, d) is a penalty term: it decreases with n and increases with the complexity of G d to favor models for which the risk is correctly estimated by the empirical risk it corresponds to a complexity measure for Gd 33 / 154 Fabrice Rossi Formalization
92 Structural risk minimization key idea: use again more and more powerful classes G d but also compare models between classes more precisely: compute g n,d by empirical risk minimization in G d choose gn by minimizing over d L n (g n,d) + r(n, d) r(n, d) is a penalty term: it decreases with n and increases with the complexity of G d to favor models for which the risk is correctly estimated by the empirical risk it corresponds to a complexity measure for Gd more interesting than the increasing complexity solution, but rather unrealistic on a practical point of view because of the tremendous computational cost 33 / 154 Fabrice Rossi Formalization
93 Regularization key idea: use a large class but optimize a penalized version of the empirical risk (see part IV) 34 / 154 Fabrice Rossi Formalization
94 Regularization key idea: use a large class but optimize a penalized version of the empirical risk (see part IV) more precisely: choose G such that L G = L define g n by g n = arg min g G (L n(g) + λ n R(g)) 34 / 154 Fabrice Rossi Formalization
95 Regularization key idea: use a large class but optimize a penalized version of the empirical risk (see part IV) more precisely: choose G such that L G = L define g n by g n = arg min g G (L n(g) + λ n R(g)) R(g) measures the complexity of the model the trade-off between the risk and the complexity, λ n, must be carefully chosen 34 / 154 Fabrice Rossi Formalization
96 Regularization key idea: use a large class but optimize a penalized version of the empirical risk (see part IV) more precisely: choose G such that L G = L define g n by g n = arg min g G (L n(g) + λ n R(g)) R(g) measures the complexity of the model the trade-off between the risk and the complexity, λ n, must be carefully chosen this is the (one of the) best solution(s) 34 / 154 Fabrice Rossi Formalization
97 Summary from n i.i.d. observations D n = (X i, Y i ) n i=1, a ML algorithm builds a model g n a good ML algorithm builds a universally strongly consistent model, i.e. for all P, L(g n ) p.s. L a very general class of ML algorithms is based on empirical risk minimization, i.e. gn 1 = arg min g G n n c(g(x i ), Y i ) consistency is obtained by controlling the estimation error and the approximation error i=1 35 / 154 Fabrice Rossi Formalization
98 Outline part II: empirical risk and capacity measures part III: capacity control part IV: empirical loss and regularization 36 / 154 Fabrice Rossi Formalization
99 Outline part II: empirical risk and capacity measures part III: capacity control part IV: empirical loss and regularization not in this lecture (see the references): ad hoc methods such as k nearest neighbours and trees advanced topics such as noise conditions and data dependent bounds clustering Bayesian point of view and many other things / 154 Fabrice Rossi Formalization
100 Part II Empirical risk and capacity measures 37 / 154 Fabrice Rossi
101 Outline Concentration Hoeffding inequality Uniform bounds Vapnik-Chervonenkis Dimension Definition Application to classification Proof Covering numbers Definition and results Computing covering numbers Summary 38 / 154 Fabrice Rossi
102 Empirical risk Empirical risk minimisation (ERM) relies on the link between L n (g) and L(g) the law of large numbers is too limited: asymptotic only g must be fixed we need: 1. quantitative finite distance results, i.e., PAO like bounds on P { L n (g) L(g) > ɛ} 2. uniform results, i.e. bounds on { P sup L n (g) L(g) > ɛ g G } 39 / 154 Fabrice Rossi Concentration
103 Concentration inequalities in essence, the goal is to evaluate the concentration of the empirical mean of c(g(x), Y ) around its expectation L n (g) L(g) = 1 n n c(g(x i ), Y i ) E {c(g(x), Y )} i=1 40 / 154 Fabrice Rossi Concentration
104 Concentration inequalities in essence, the goal is to evaluate the concentration of the empirical mean of c(g(x), Y ) around its expectation L n (g) L(g) = 1 n n c(g(x i ), Y i ) E {c(g(x), Y )} i=1 some standard bounds: Markov: P { X t} E{ X } t Bienaymé-Chebyshev: P { X E {X} t} Var(X) t 2 BC gives for n independent real valued random variables X i, denoting S n = n i=1 X i: n i=1 P { S n E {S n } t} Var(X i) t 2 40 / 154 Fabrice Rossi Concentration
105 Hoeffding s inequality Hoeffding, 1963 hypothesis X 1,..., X n, n independent r.v. X i [a i, b i ] S n = n i=1 X i result P {S n E {S n } ɛ} e 2ɛ2 / P n i=1 (b i a i ) 2 P {S n E {S n } ɛ} e 2ɛ2 / P n i=1 (b i a i ) 2 Quantitative distribution free finite distance law of large numbers! 41 / 154 Fabrice Rossi Concentration
106 Direct application with a bounded cost c(u, v) [a, b] (for all (u, v)), such as a bounded Y in regression or the misclassification cost in binary classification [a, b] = [0, 1] 42 / 154 Fabrice Rossi Concentration
107 Direct application with a bounded cost c(u, v) [a, b] (for all (u, v)), such as a bounded Y in regression or the misclassification cost in binary classification [a, b] = [0, 1] U i = 1 n c(g(x i), Y i ), U i [ a n, b n ] Warning: g cannot depend on the (X i, Y i ) (or the U i are no longer independent) 42 / 154 Fabrice Rossi Concentration
108 Direct application with a bounded cost c(u, v) [a, b] (for all (u, v)), such as a bounded Y in regression or the misclassification cost in binary classification [a, b] = [0, 1] U i = 1 n c(g(x i), Y i ), U i [ a n, b n ] Warning: g cannot depend on the (X i, Y i ) (or the U i are no longer independent) n i=1 (b i a i ) 2 = (b a)2 n and therefore P { L n (g) L(g) ɛ} 2e 2nɛ2 /(b a) 2 42 / 154 Fabrice Rossi Concentration
109 Direct application with a bounded cost c(u, v) [a, b] (for all (u, v)), such as a bounded Y in regression or the misclassification cost in binary classification [a, b] = [0, 1] U i = 1 n c(g(x i), Y i ), U i [ a n, b n ] Warning: g cannot depend on the (X i, Y i ) (or the U i are no longer independent) n i=1 (b i a i ) 2 = (b a)2 n and therefore P { L n (g) L(g) ɛ} 2e 2nɛ2 /(b a) 2 then L n (g) P L(g) and L n (g) p.s. L(g) (Borel Cantelli) 42 / 154 Fabrice Rossi Concentration
110 Limitations g cannot depend on the (X i, Y i ) (independent test set) intuitively: δ = 2e 2nɛ2 /(b a) 2 for each g, the probability to draw from P a Dn on which L n (g) L(g) (b a) log 2 δ 2n is at least 1 δ but the sets of correct D n differ for each g conversely for a fixed D n, the bound is valid only for some of the models this cannot be used to study g n = arg min g G L n (g) 43 / 154 Fabrice Rossi Concentration
111 The test set Hoeffding s inequality justifies the test set approach (hold out estimate) Given a dataset D m : split the dataset into two disjoint sets: a learning set Dn and a test set D p (with p + n = m) use a ML algorithm to build gn using only D n claim that L(g n ) L p(g n ) = 1 m p i=n+1 c(g n(x i ), Y i ) 44 / 154 Fabrice Rossi Concentration
112 The test set Hoeffding s inequality justifies the test set approach (hold out estimate) Given a dataset D m : split the dataset into two disjoint sets: a learning set Dn and a test set D p (with p + n = m) use a ML algorithm to build gn using only D n claim that L(g n ) L p(g n ) = 1 m p i=n+1 c(g n(x i ), Y i ) In fact, with probability at least 1 δ L(g n ) L p(g n ) + (b a) log 1 δ 2p 44 / 154 Fabrice Rossi Concentration
113 The test set Hoeffding s inequality justifies the test set approach (hold out estimate) Given a dataset D m : split the dataset into two disjoint sets: a learning set Dn and a test set D p (with p + n = m) use a ML algorithm to build gn using only D n claim that L(g n ) L p(g n ) = 1 m p i=n+1 c(g n(x i ), Y i ) In fact, with probability at least 1 δ L(g n ) L p(g n ) + (b a) log 1 δ 2p Example: classification (a = 0, b = 1), with a good classifier L p(g n ) = 0.05 to be 99% sure that L(g n ) 0.06, we need p = (!!!) 44 / 154 Fabrice Rossi Concentration
114 Proof of the Hoeffding s inequality Chernoff s bounding technique (an application of Markov s inequality): { P {X t} = P e sx e st} E { e sx } e st the bound is controlled via s lemma: if E {X} = 0 and X [a, b], the for all s, E { e sx } e s2 (b a) 2 8 { } e is convex E e sx e φ(s(b a)) then we bound φ this is used for X = S n E {S n } 45 / 154 Fabrice Rossi Concentration
115 Proof of the Hoeffding s inequality P {S n E {S n } ɛ} e sɛ E {e s P } n i=1 (X i E{X i }) = e sɛ n e sɛ with s = 4ɛ/ n i=1 (b i a i ) 2 i=1 n i=1 = e sɛ e s 2 P n i=1 (b i a i ) 2 8 { } E e s(x i E{X i }) (independence) e s2 (b i a i ) 2 8 (lemma) = e 2ɛ2 / P n i=1 (b i a i ) 2 (minimization) 46 / 154 Fabrice Rossi Concentration
116 Remarks Hoeffding s inequality is useful in many other contexts: Large-scale machine learning (e.g., [Domingos and Hulten, 2001]): enormous dataset (e.g., cannot be loaded in memory) process growing subsets until the model quality is known with sufficient confidence monitor drifting via confidence bounds Regret in game theory (e.g., [Cesa-Bianchi and Lugosi, 2006]): define strategies by minimizing a regret measure get bounds on the regret estimations (e.g., trade-off between exploration and exploitation in multi-arms bandits) Many improvements/complements are available, such as: Bernstein s inequality McDiarmid s bounded difference inequality 47 / 154 Fabrice Rossi Concentration
117 Uniform bounds we deal with the estimation error: given the ERM choice: g n = arg min g G L n (g) what about L(gn) inf g G L(g)? 48 / 154 Fabrice Rossi Concentration
118 Uniform bounds we deal with the estimation error: given the ERM choice: g n = arg min g G L n (g) what about L(gn) inf g G L(g)? we need uniform bounds: L n (gn) L(gn) sup L n (g) L(g) g G L(g n) inf g G L(g) = L(g n) L n (g n) + L n (g n) inf g G L(g) L(gn) L n (gn) + sup L n (g) L(g) g G 2 sup L n (g) L(g) g G 48 / 154 Fabrice Rossi Concentration
119 Uniform bounds we deal with the estimation error: given the ERM choice: g n = arg min g G L n (g) what about L(gn) inf g G L(g)? we need uniform bounds: L n (gn) L(gn) sup L n (g) L(g) g G L(g n) inf g G L(g) = L(g n) L n (g n) + L n (g n) inf g G L(g) L(gn) L n (gn) + sup L n (g) L(g) g G 2 sup L n (g) L(g) g G therefore uniform bounds give an answer to both questions: the quality of the empirical risk as an estimate of the risk and the quality of the model select by ERM 48 / 154 Fabrice Rossi Concentration
120 Finite model class union bound : P {A ou B} P {A} + P {B} and therefore { } P max U i ɛ 1 i m m P {U i ɛ} i=1 49 / 154 Fabrice Rossi Concentration
121 Finite model class union bound : P {A ou B} P {A} + P {B} and therefore then when G is finite { P { } P max U i ɛ 1 i m sup L n (g) L(g) ɛ g G } m P {U i ɛ} i=1 2 G e 2nɛ2 /(b a) 2 so L(g n) p.s. inf g G L(g) (but in general inf g G L(g) > L ) 49 / 154 Fabrice Rossi Concentration
122 Finite model class union bound : P {A ou B} P {A} + P {B} and therefore then when G is finite { P { } P max U i ɛ 1 i m sup L n (g) L(g) ɛ g G } m P {U i ɛ} i=1 2 G e 2nɛ2 /(b a) 2 so L(g n) p.s. inf g G L(g) (but in general inf g G L(g) > L ) quantitative distribution free uniform finite distance law of large numbers! 49 / 154 Fabrice Rossi Concentration
123 Bound versus size with probability at least 1 δ L(gn) inf L(g) + 2(b a) g G log G + log 2 δ 2n inf g G L(g) decreases with G (more models) but the bound increases with G : trade-off between flexibility and estimation quality remark: log G is the number of bits needed to encode the model choice; this is a capacity measure 50 / 154 Fabrice Rossi Concentration
124 Outline Concentration Hoeffding inequality Uniform bounds Vapnik-Chervonenkis Dimension Definition Application to classification Proof Covering numbers Definition and results Computing covering numbers Summary 51 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
125 Infinite model class so far we have uniform bounds when G is finite: even simple classes such as Glin = {x arg max k (W x) k } are infinite in addition infg Glin L(g) > 0 in general, even when L = 0 (nonlinear problems) 52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
126 Infinite model class so far we have uniform bounds when G is finite: even simple classes such as Glin = {x arg max k (W x) k } are infinite in addition infg Glin L(g) > 0 in general, even when L = 0 (nonlinear problems) solution: evaluate the capacity of a class rather than its size 52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
127 Infinite model class so far we have uniform bounds when G is finite: even simple classes such as Glin = {x arg max k (W x) k } are infinite in addition infg Glin L(g) > 0 in general, even when L = 0 (nonlinear problems) solution: evaluate the capacity of a class rather than its size first step: binary classification (the simplest case) a model is then a measurable set A X the capacity of G measures the variability of the available shapes it is evaluated with respect to the data: if A Dn = B D n, then A and B are identical 52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
128 Growth function abstract setting: F is a set of measurable functions from R d to {0, 1} (in other words a set of (indicator functions of) measurable subsets of R d ) 53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
129 Growth function abstract setting: F is a set of measurable functions from R d to {0, 1} (in other words a set of (indicator functions of) measurable subsets of R d ) given z 1 R d,..., z n R d, we define F z1,...,z n = {u {0, 1} n f F, u = (f (z 1 ),..., f (z n ))} interpretation: each u describes a binary partition of z 1,..., z n and F z1,...,z n is then the set of partitions implemented by F 53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
130 Growth function abstract setting: F is a set of measurable functions from R d to {0, 1} (in other words a set of (indicator functions of) measurable subsets of R d ) given z 1 R d,..., z n R d, we define F z1,...,z n = {u {0, 1} n f F, u = (f (z 1 ),..., f (z n ))} interpretation: each u describes a binary partition of z 1,..., z n and F z1,...,z n is then the set of partitions implemented by F Growth function S F (n) = sup F z1,...,z n z 1 R d,...,z n R d interpretation: maximal number of binary partitions implementable via F (distribution free worst case analysis) 53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
131 Vapnik-Chervonenkis dimension S F (n) 2 n vocabulary: S F (n) is the n-th shatter coefficient of F if F z1,...,z n = 2 n, F shatters z 1,..., z n 54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
132 Vapnik-Chervonenkis dimension S F (n) 2 n vocabulary: S F (n) is the n-th shatter coefficient of F if F z1,...,z n = 2 n, F shatters z 1,..., z n Vapnik-Chervonenkis dimension VCdim (F) = sup{n N + S F (n) = 2 n } 54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
133 Vapnik-Chervonenkis dimension S F (n) 2 n vocabulary: S F (n) is the n-th shatter coefficient of F if F z1,...,z n = 2 n, F shatters z 1,..., z n Vapnik-Chervonenkis dimension VCdim (F) = sup{n N + S F (n) = 2 n } interpretation: if S F (n) < 2 n : for all z 1,..., z n, there is a partition u {0, 1} n, such that u F z1,...,z n for any dataset, F can fail to reach L = 0 if S F (n) = 2 n : there is at least one set z 1,..., z n shattered by F 54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
134 Example points from R 2 F = {I {ax+by+c 0} } 55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
135 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
136 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
137 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
138 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
139 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 even if we have sometimes F z1,z 2,z 3 < / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
140 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 even if we have sometimes F z1,z 2,z 3 < 2 3 S F (4) < / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
141 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 even if we have sometimes F z1,z 2,z 3 < 2 3 S F (4) < 2 4 VCdim (F) = 3 55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
142 Linear models result if G is a p dimensional vector space of real valued functions defined on R d then }) VCdim ({f : R d {0, 1} g G, f (z) = I {g(z) 0} p 56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
143 Linear models result if G is a p dimensional vector space of real valued functions defined on R d then }) VCdim ({f : R d {0, 1} g G, f (z) = I {g(z) 0} p proof given z 1,..., z p+1, let F from G to R p+1 be F(g) = (g(z 1 ),..., g(z p+1 )) 56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
144 Linear models result if G is a p dimensional vector space of real valued functions defined on R d then }) VCdim ({f : R d {0, 1} g G, f (z) = I {g(z) 0} p proof given z 1,..., z p+1, let F from G to R p+1 be F(g) = (g(z 1 ),..., g(z p+1 )) as dim(f(g)) p, there is a non zero γ = (γ 1,..., γ p+1 ) such that p+1 i=1 γ ig(z i ) = 0 56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
145 Linear models result if G is a p dimensional vector space of real valued functions defined on R d then }) VCdim ({f : R d {0, 1} g G, f (z) = I {g(z) 0} p proof given z 1,..., z p+1, let F from G to R p+1 be F(g) = (g(z 1 ),..., g(z p+1 )) as dim(f(g)) p, there is a non zero γ = (γ 1,..., γ p+1 ) such that p+1 i=1 γ ig(z i ) = 0 if z 1,..., z p+1 is shattered, there is also g j such that g j (z i ) = δ ij and therefore γ j = 0 for all j 56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More information171:290 Model Selection Lecture II: The Akaike Information Criterion
171:290 Model Selection Lecture II: The Akaike Information Criterion Department of Biostatistics Department of Statistics and Actuarial Science August 28, 2012 Introduction AIC, the Akaike Information
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationQuestion 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationUn point de vue bayésien pour des algorithmes de bandit plus performants
Un point de vue bayésien pour des algorithmes de bandit plus performants Emilie Kaufmann, Telecom ParisTech Rencontre des Jeunes Statisticiens, Aussois, 28 août 2013 Emilie Kaufmann (Telecom ParisTech)
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationOnline Learning, Stability, and Stochastic Gradient Descent
Online Learning, Stability, and Stochastic Gradient Descent arxiv:1105.4701v3 [cs.lg] 8 Sep 2011 September 9, 2011 Tomaso Poggio, Stephen Voinea, Lorenzo Rosasco CBCL, McGovern Institute, CSAIL, Brain
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
More information1 if 1 x 0 1 if 0 x 1
Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationMachine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu
Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationINDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS
INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationMachine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
More informationNotes from Week 1: Algorithms for sequential prediction
CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking
More informationMachine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34
Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning
More informationThe CUSUM algorithm a small review. Pierre Granjon
The CUSUM algorithm a small review Pierre Granjon June, 1 Contents 1 The CUSUM algorithm 1.1 Algorithm............................... 1.1.1 The problem......................... 1.1. The different steps......................
More informationMachine-Learning for Big Data: Sampling and Distributed On-Line Algorithms. Stéphan Clémençon
Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No. 5141 - Telecom ParisTech - Journée Traitement de Masses de Données du Laboratoire JL Lions
More informationIntroduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
More informationChristfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
More informationIntroduction to Learning & Decision Trees
Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing
More informationFurther Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1
Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1 J. Zhang Institute of Applied Mathematics, Chongqing University of Posts and Telecommunications, Chongqing
More informationBig Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationSupport Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
More informationSeveral Views of Support Vector Machines
Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min
More informationNon Parametric Inference
Maura Department of Economics and Finance Università Tor Vergata Outline 1 2 3 Inverse distribution function Theorem: Let U be a uniform random variable on (0, 1). Let X be a continuous random variable
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
More informationUniversal Algorithm for Trading in Stock Market Based on the Method of Calibration
Universal Algorithm for Trading in Stock Market Based on the Method of Calibration Vladimir V yugin Institute for Information Transmission Problems, Russian Academy of Sciences, Bol shoi Karetnyi per.
More informationLarge-Scale Similarity and Distance Metric Learning
Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationMachine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationData Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
More informationCourse: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.
More informationEMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA
EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA Andreas Christmann Department of Mathematics homepages.vub.ac.be/ achristm Talk: ULB, Sciences Actuarielles, 17/NOV/2006 Contents 1. Project: Motor vehicle
More informationDate: April 12, 2001. Contents
2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........
More informationThe Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationTrading regret rate for computational efficiency in online learning with limited feedback
Trading regret rate for computational efficiency in online learning with limited feedback Shai Shalev-Shwartz TTI-C Hebrew University On-line Learning with Limited Feedback Workshop, 2009 June 2009 Shai
More informationE3: PROBABILITY AND STATISTICS lecture notes
E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................
More informationDiscussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.
Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Fabienne Comte, Celine Duval, Valentine Genon-Catalot To cite this version: Fabienne
More informationIntroduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationCompact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
More informationLecture 6 Online and streaming algorithms for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
More informationTail inequalities for order statistics of log-concave vectors and applications
Tail inequalities for order statistics of log-concave vectors and applications Rafał Latała Based in part on a joint work with R.Adamczak, A.E.Litvak, A.Pajor and N.Tomczak-Jaegermann Banff, May 2011 Basic
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationModel Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
More informationSpatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
More information(Quasi-)Newton methods
(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting
More informationLanguage Modeling. Chapter 1. 1.1 Introduction
Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set
More informationLocal classification and local likelihoods
Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor
More informationNeural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More informationFuzzy Probability Distributions in Bayesian Analysis
Fuzzy Probability Distributions in Bayesian Analysis Reinhard Viertl and Owat Sunanta Department of Statistics and Probability Theory Vienna University of Technology, Vienna, Austria Corresponding author:
More informationA Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
More informationChapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
More informationMessage-passing sequential detection of multiple change points in networks
Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal
More information1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09
1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch
More informationWeek 1: Introduction to Online Learning
Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider
More information1. Prove that the empty set is a subset of every set.
1. Prove that the empty set is a subset of every set. Basic Topology Written by Men-Gen Tsai email: b89902089@ntu.edu.tw Proof: For any element x of the empty set, x is also an element of every set since
More informationVirtual Landmarks for the Internet
Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser
More informationCSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
More informationINDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)
INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its
More informationRandom graphs with a given degree sequence
Sourav Chatterjee (NYU) Persi Diaconis (Stanford) Allan Sly (Microsoft) Let G be an undirected simple graph on n vertices. Let d 1,..., d n be the degrees of the vertices of G arranged in descending order.
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationIN most approaches to binary classification, classifiers are
3806 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 11, NOVEMBER 2005 A Neyman Pearson Approach to Statistical Learning Clayton Scott, Member, IEEE, Robert Nowak, Senior Member, IEEE Abstract The
More informationlarge-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
More informationPolarization codes and the rate of polarization
Polarization codes and the rate of polarization Erdal Arıkan, Emre Telatar Bilkent U., EPFL Sept 10, 2008 Channel Polarization Given a binary input DMC W, i.i.d. uniformly distributed inputs (X 1,...,
More informationPenalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
More informationA Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
More informationThe Ergodic Theorem and randomness
The Ergodic Theorem and randomness Peter Gács Department of Computer Science Boston University March 19, 2008 Peter Gács (Boston University) Ergodic theorem March 19, 2008 1 / 27 Introduction Introduction
More informationStatistical Machine Translation: IBM Models 1 and 2
Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation
More information1 Another method of estimation: least squares
1 Another method of estimation: least squares erm: -estim.tex, Dec8, 009: 6 p.m. (draft - typos/writos likely exist) Corrections, comments, suggestions welcome. 1.1 Least squares in general Assume Y i
More informationSupport Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationBayes and Naïve Bayes. cs534-machine Learning
Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule
More informationHypothesis Testing for Beginners
Hypothesis Testing for Beginners Michele Piffer LSE August, 2011 Michele Piffer (LSE) Hypothesis Testing for Beginners August, 2011 1 / 53 One year ago a friend asked me to put down some easy-to-read notes
More informationALMOST COMMON PRIORS 1. INTRODUCTION
ALMOST COMMON PRIORS ZIV HELLMAN ABSTRACT. What happens when priors are not common? We introduce a measure for how far a type space is from having a common prior, which we term prior distance. If a type
More informationNo: 10 04. Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics
No: 10 04 Bilkent University Monotonic Extension Farhad Husseinov Discussion Papers Department of Economics The Discussion Papers of the Department of Economics are intended to make the initial results
More informationA Stochastic 3MG Algorithm with Application to 2D Filter Identification
A Stochastic 3MG Algorithm with Application to 2D Filter Identification Emilie Chouzenoux 1, Jean-Christophe Pesquet 1, and Anisia Florescu 2 1 Laboratoire d Informatique Gaspard Monge - CNRS Univ. Paris-Est,
More informationWhat s New in Econometrics? Lecture 8 Cluster and Stratified Sampling
What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationComparison of machine learning methods for intelligent tutoring systems
Comparison of machine learning methods for intelligent tutoring systems Wilhelmiina Hämäläinen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu, P.O. Box 111, FI-80101 Joensuu
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationPart III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing
CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and
More informationApproximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
More information