An introduction to statistical learning theory. Fabrice Rossi TELECOM ParisTech November 2008

Size: px
Start display at page:

Download "An introduction to statistical learning theory. Fabrice Rossi TELECOM ParisTech November 2008"

Transcription

1 An introduction to statistical learning theory Fabrice Rossi TELECOM ParisTech November 2008

2 About this lecture Main goal Show how statistics enable a rigorous analysis of machine learning methods which leads to a better understanding and to possible improvements of those methods 2 / 154 Fabrice Rossi Outline

3 About this lecture Main goal Show how statistics enable a rigorous analysis of machine learning methods which leads to a better understanding and to possible improvements of those methods Outline 1. introduction to the formal model of statistical learning 2. empirical risk and capacity measures 3. capacity control 4. beyond empirical risk minimization 2 / 154 Fabrice Rossi Outline

4 Part I Introduction and formalization 3 / 154 Fabrice Rossi

5 Outline Machine learning in a nutshell Formalization Data and algorithms Consistency Risk/Loss optimization 4 / 154 Fabrice Rossi

6 Statistical Learning Statistical Learning 5 / 154 Fabrice Rossi Machine learning in a nutshell

7 Statistical Learning Statistical Learning = Machine learning + Statistics 5 / 154 Fabrice Rossi Machine learning in a nutshell

8 Statistical Learning Statistical Learning = Machine learning + Statistics Machine learning in three steps: 1. record observations about a phenomenon 2. build a model of this phenomenon 3. predict the future behavior of the phenomenon 5 / 154 Fabrice Rossi Machine learning in a nutshell

9 Statistical Learning Statistical Learning = Machine learning + Statistics Machine learning in three steps: 1. record observations about a phenomenon 2. build a model of this phenomenon 3. predict the future behavior of the phenomenon... all of this is done by the computer (no human intervention) 5 / 154 Fabrice Rossi Machine learning in a nutshell

10 Statistical Learning Statistical Learning = Machine learning + Statistics Machine learning in three steps: 1. record observations about a phenomenon 2. build a model of this phenomenon 3. predict the future behavior of the phenomenon... all of this is done by the computer (no human intervention) Statistics gives: a formal definition of machine learning some guarantees on its expected results some suggestions for new or improved modelling tools 5 / 154 Fabrice Rossi Machine learning in a nutshell

11 Statistical Learning Statistical Learning = Machine learning + Statistics Machine learning in three steps: 1. record observations about a phenomenon 2. build a model of this phenomenon 3. predict the future behavior of the phenomenon... all of this is done by the computer (no human intervention) Statistics gives: a formal definition of machine learning some guarantees on its expected results some suggestions for new or improved modelling tools Nothing is more practical than a good theory Vladimir Vapnik (1998) 5 / 154 Fabrice Rossi Machine learning in a nutshell

12 Machine learning a phenomenon is recorded via observations (z i ) 1 i n with z i Z 6 / 154 Fabrice Rossi Machine learning in a nutshell

13 Machine learning a phenomenon is recorded via observations (z i ) 1 i n with z i Z two generic situations: 1. unsupervised learning: no predefined structure in Z then the goal is to find some structures: clusters, association rules, distribution, etc. 6 / 154 Fabrice Rossi Machine learning in a nutshell

14 Machine learning a phenomenon is recorded via observations (z i ) 1 i n with z i Z two generic situations: 1. unsupervised learning: no predefined structure in Z then the goal is to find some structures: clusters, association rules, distribution, etc. 2. supervised learning: two non symmetric component in Z = X Y z = (x, y) X Y modelling: finding how x and y are related the goal is to make prediction: given x, find a reasonable value for y such that z = (x, y) is compatible with the phenomenon 6 / 154 Fabrice Rossi Machine learning in a nutshell

15 Machine learning a phenomenon is recorded via observations (z i ) 1 i n with z i Z two generic situations: 1. unsupervised learning: no predefined structure in Z then the goal is to find some structures: clusters, association rules, distribution, etc. 2. supervised learning: two non symmetric component in Z = X Y z = (x, y) X Y modelling: finding how x and y are related the goal is to make prediction: given x, find a reasonable value for y such that z = (x, y) is compatible with the phenomenon this course focuses on supervised learning 6 / 154 Fabrice Rossi Machine learning in a nutshell

16 Supervised learning several types of Y can be considered: Y = {1,..., q}: classification in q classes, e.g., tell whether a patient has hyper, hypo or normal thyroidal function based on some clinical measures Y = R q : regression, e.g., calculate the price of a house y based on some characteristics x Y = something complex but structured, e.g., construct a parse tree for a natural language sentence 7 / 154 Fabrice Rossi Machine learning in a nutshell

17 Supervised learning several types of Y can be considered: Y = {1,..., q}: classification in q classes, e.g., tell whether a patient has hyper, hypo or normal thyroidal function based on some clinical measures Y = R q : regression, e.g., calculate the price of a house y based on some characteristics x Y = something complex but structured, e.g., construct a parse tree for a natural language sentence modelling difficulty increases with the complexity of Y 7 / 154 Fabrice Rossi Machine learning in a nutshell

18 Supervised learning several types of Y can be considered: Y = {1,..., q}: classification in q classes, e.g., tell whether a patient has hyper, hypo or normal thyroidal function based on some clinical measures Y = R q : regression, e.g., calculate the price of a house y based on some characteristics x Y = something complex but structured, e.g., construct a parse tree for a natural language sentence modelling difficulty increases with the complexity of Y Vapnik s message: don t built a regression model to do classification 7 / 154 Fabrice Rossi Machine learning in a nutshell

19 Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that 1 i n, g(x i ) y i 8 / 154 Fabrice Rossi Machine learning in a nutshell

20 Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that or not? 1 i n, g(x i ) y i 8 / 154 Fabrice Rossi Machine learning in a nutshell

21 Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that 1 i n, g(x i ) y i or not? perfect interpolation is always possible but is that what we are looking for? 8 / 154 Fabrice Rossi Machine learning in a nutshell

22 Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that or not? 1 i n, g(x i ) y i perfect interpolation is always possible but is that what we are looking for? No! We want generalization 8 / 154 Fabrice Rossi Machine learning in a nutshell

23 Model quality given a dataset (x i, y i ) 1 i n, a machine learning method builds a model g from X to Y a good model should be such that or not? 1 i n, g(x i ) y i perfect interpolation is always possible but is that what we are looking for? No! We want generalization: a good model must learn something smart from the data, not simply memorize them on new data, it should be such that g(xi ) y i (for i > n) 8 / 154 Fabrice Rossi Machine learning in a nutshell

24 Machine learning objectives main goal: from a dataset (x i, y i ) 1 i n, build a model g which generalizes in a satisfactory way on new data but also: request only as many data as needed do this efficiently (quickly with a low memory usage) gain knowledge on the underlying phenomenon (e.g., discard useless parts of x) etc. 9 / 154 Fabrice Rossi Machine learning in a nutshell

25 Machine learning objectives main goal: from a dataset (x i, y i ) 1 i n, build a model g which generalizes in a satisfactory way on new data but also: request only as many data as needed do this efficiently (quickly with a low memory usage) gain knowledge on the underlying phenomenon (e.g., discard useless parts of x) etc. the statistical learning framework addresses some of those goals: it provides asymptotic guarantees about learnability it gives non asymptotic confidence bounds around performance estimates it suggests/validates best practices (hold out estimates, regularization, etc.) 9 / 154 Fabrice Rossi Machine learning in a nutshell

26 Outline Machine learning in a nutshell Formalization Data and algorithms Consistency Risk/Loss optimization 10 / 154 Fabrice Rossi Formalization

27 Data the phenomenon is fully described by an unknown probability distribution P on X Y we are given n observations: Dn = (X i, Y i ) n i=1 each pair (Xi, Y i ) is distributed according to P pairs are independent 11 / 154 Fabrice Rossi Formalization

28 Data the phenomenon is fully described by an unknown probability distribution P on X Y we are given n observations: Dn = (X i, Y i ) n i=1 each pair (Xi, Y i ) is distributed according to P pairs are independent remarks: the phenomenon is stationary, i.e., P does not change: new data will be distributed as learning data extensions to non stationary situations exist, see e.g. covariate shift and concept drift the independence assumption says that each observation brings new information extensions to dependent observation exists, e.g., Markovian models for time series analysis 11 / 154 Fabrice Rossi Formalization

29 Model quality (risk) quality is measured via a cost function c: Y Y R + (a low cost is good!) the risk of a model g: X Y is L(g) = E {c(g(x), Y )} This is the expected value of the cost on an observation generated by the phenomenon (a low risk is good!) 12 / 154 Fabrice Rossi Formalization

30 Model quality (risk) quality is measured via a cost function c: Y Y R + (a low cost is good!) the risk of a model g: X Y is L(g) = E {c(g(x), Y )} This is the expected value of the cost on an observation generated by the phenomenon (a low risk is good!) interpretation: the average value of c(g(x), y) over a lot of observations generated by the phenomenon is L(g) this is a measure of generalization capabilities (if g has been built from a dataset) 12 / 154 Fabrice Rossi Formalization

31 Model quality (risk) quality is measured via a cost function c: Y Y R + (a low cost is good!) the risk of a model g: X Y is L(g) = E {c(g(x), Y )} This is the expected value of the cost on an observation generated by the phenomenon (a low risk is good!) interpretation: the average value of c(g(x), y) over a lot of observations generated by the phenomenon is L(g) this is a measure of generalization capabilities (if g has been built from a dataset) remark: we need a probability space, c and g have to be measurable functions, etc. 12 / 154 Fabrice Rossi Formalization

32 We are frequentists 13 / 154 Fabrice Rossi Formalization

33 Cost functions Regression (Y = R q ): the quadratic cost is the standard one: c(g(x), y) = g(x) y 2 other costs (Y = R): L 1 (Laplacian) cost: c(g(x), y) = g(x) y ɛ-insensitive: c ɛ(g(x), y) = max(0, g(x) y ɛ) Huber s loss: c δ (g(x), y) = (g(x) y) 2 when g(x) y < δ and c δ (g(x), y) = 2δ( g(x) y δ ) otherwise 2 14 / 154 Fabrice Rossi Formalization

34 Cost functions Regression (Y = R q ): the quadratic cost is the standard one: c(g(x), y) = g(x) y 2 other costs (Y = R): L 1 (Laplacian) cost: c(g(x), y) = g(x) y ɛ-insensitive: c ɛ(g(x), y) = max(0, g(x) y ɛ) Huber s loss: c δ (g(x), y) = (g(x) y) 2 when g(x) y < δ and c δ (g(x), y) = 2δ( g(x) y δ ) otherwise 2 Classification (Y = {1,..., q}): 0/1 cost: c(g(x), y) = I{g(x) y} then the risk is the probability of misclassification 14 / 154 Fabrice Rossi Formalization

35 Cost functions Regression (Y = R q ): the quadratic cost is the standard one: c(g(x), y) = g(x) y 2 other costs (Y = R): L 1 (Laplacian) cost: c(g(x), y) = g(x) y ɛ-insensitive: c ɛ(g(x), y) = max(0, g(x) y ɛ) Huber s loss: c δ (g(x), y) = (g(x) y) 2 when g(x) y < δ and c δ (g(x), y) = 2δ( g(x) y δ ) otherwise 2 Classification (Y = {1,..., q}): 0/1 cost: c(g(x), y) = I{g(x) y} then the risk is the probability of misclassification Remark: the cost is used to measure the quality of the model the machine learning algorithm may use a loss cost to build the model e.g.: the hinge loss for support vector machines more on this in part IV 14 / 154 Fabrice Rossi Formalization

36 Machine learning algorithm given D n, a machine learning algorithm/method builds a model g Dn (= g n for simplicity) g n is a random variable with values in the set of measurable functions from X to Y the associated (random) risk is L(g n ) = E {c(g n (X), Y ) D n } statistical learning studies the behavior of L(g n ) and E {L(g n )} when n increases 15 / 154 Fabrice Rossi Formalization

37 Machine learning algorithm given D n, a machine learning algorithm/method builds a model g Dn (= g n for simplicity) g n is a random variable with values in the set of measurable functions from X to Y the associated (random) risk is L(g n ) = E {c(g n (X), Y ) D n } statistical learning studies the behavior of L(g n ) and E {L(g n )} when n increases: keep in mind that L(gn ) is a random variable if many datasets are generated from the phenomenon, the mean risk of the classifiers produced by the algorithm for those datasets is E {L(g n )} 15 / 154 Fabrice Rossi Formalization

38 Consistency best risk: L = inf g L(g) the infimum is taken over all measurable functions from X to Y 16 / 154 Fabrice Rossi Formalization

39 Consistency best risk: L = inf g L(g) the infimum is taken over all measurable functions from X to Y a machine learning method is: strongly consistent: if L(gn ) p.s. L consistent: if E {L(gn )} L universally (strongly) consistent: if the convergence holds for any distribution of the data P remark: if c is bounded then lim n E {L(g n )} = L L(g n ) P L 16 / 154 Fabrice Rossi Formalization

40 Interpretation universality: no assumption on the data distribution (realistic) consistency: perfect generalization when given an infinite learning set 17 / 154 Fabrice Rossi Formalization

41 Interpretation universality: no assumption on the data distribution (realistic) consistency: perfect generalization when given an infinite learning set strong case: for (almost) any series of observations (x i, y i ) i 1 and for any ɛ > 0 there is N such that for n N L(g n) L + ɛ where g n is built on (x i, y i ) 1 i n N depends on ɛ and on the dataset (x i, y i ) i 1 17 / 154 Fabrice Rossi Formalization

42 Interpretation universality: no assumption on the data distribution (realistic) consistency: perfect generalization when given an infinite learning set strong case: for (almost) any series of observations (x i, y i ) i 1 and for any ɛ > 0 there is N such that for n N L(g n) L + ɛ where g n is built on (x i, y i ) 1 i n N depends on ɛ and on the dataset (x i, y i ) i 1 weak case: for any ɛ > 0 there is N such that for n N E {L(g n)} L + ɛ where g n is built on (x i, y i ) 1 i n 17 / 154 Fabrice Rossi Formalization

43 Interpretation (continued) We can be unlucky: strong case: L(g n) = E {c(g n(x), Y ) D n} L + ɛ but what about n+p 1 X c(g n(x i ), y i ) p i=n+1 weak case: E {L(g n)} L + ɛ but what about L(g n) for a particular dataset? see also the strong case! 18 / 154 Fabrice Rossi Formalization

44 Interpretation (continued) We can be unlucky: strong case: L(g n) = E {c(g n(x), Y ) D n} L + ɛ but what about n+p 1 X c(g n(x i ), y i ) p i=n+1 weak case: E {L(g n)} L + ɛ but what about L(g n) for a particular dataset? see also the strong case! Stronger result, Probably Approximately Optimal algorithm: ɛ, δ, n(ɛ, δ) such that n n(ɛ, δ) P {L (g n ) > L + ɛ} < δ Gives some convergence speed: especially interesting when n(ɛ, δ) is independent of P (distribution free) 18 / 154 Fabrice Rossi Formalization

45 From PAO to consistency if convergence happens fast enough, then the method is (strongly) consistent 19 / 154 Fabrice Rossi Formalization

46 From PAO to consistency if convergence happens fast enough, then the method is (strongly) consistent weak consistency is easy (for c bounded): if the method is PAO, then for any fixed ɛ P {L (g n ) > L + ɛ} n 0 by definition (and also because L(gn ) > L ), this means L(g n ) P L then, if c is bounded E {L(gn )} n L 19 / 154 Fabrice Rossi Formalization

47 From PAO to consistency if convergence happens fast enough, then the method is (strongly) consistent weak consistency is easy (for c bounded): if the method is PAO, then for any fixed ɛ P {L (g n ) > L + ɛ} n 0 by definition (and also because L(gn ) > L ), this means L(g n ) P L then, if c is bounded E {L(gn )} n L if convergence happens faster, then the method might be strongly consistent we need a very sharp decrease of P {L (g n ) > L + ɛ} with n, i.e., a slow increase of n(ɛ, δ) when δ decreases 19 / 154 Fabrice Rossi Formalization

48 From PAO to strong consistency this is based on the Borel-Cantelli Lemma let (Ai ) i 1 be a series of events define [Ai i.o.] = i=1 j=i A i (infinitely often) the lemma states that if i P {A i } < then P {[A i i.o.]} = 0 20 / 154 Fabrice Rossi Formalization

49 From PAO to strong consistency this is based on the Borel-Cantelli Lemma let (Ai ) i 1 be a series of events define [Ai i.o.] = i=1 j=i A i (infinitely often) the lemma states that if i P {A i } < then P {[A i i.o.]} = 0 if n P {L (g n) > L + ɛ} <, then P {[{L (g n ) > L + ɛ} i.o.]} = 0 20 / 154 Fabrice Rossi Formalization

50 From PAO to strong consistency this is based on the Borel-Cantelli Lemma let (Ai ) i 1 be a series of events define [Ai i.o.] = i=1 j=i A i (infinitely often) the lemma states that if i P {A i } < then P {[A i i.o.]} = 0 if n P {L (g n) > L + ɛ} <, then P {[{L (g n ) > L + ɛ} i.o.]} = 0 we have {L(g n ) L } = j=1 i=1 n=i {L (g n) L + 1 j } 20 / 154 Fabrice Rossi Formalization

51 From PAO to strong consistency this is based on the Borel-Cantelli Lemma let (Ai ) i 1 be a series of events define [Ai i.o.] = i=1 j=i A i (infinitely often) the lemma states that if i P {A i } < then P {[A i i.o.]} = 0 if n P {L (g n) > L + ɛ} <, then P {[{L (g n ) > L + ɛ} i.o.]} = 0 we have {L(g n ) L } = j=1 i=1 n=i {L (g n) L + 1 j } and therefore, L(g n ) p.s. L 20 / 154 Fabrice Rossi Formalization

52 Summary from n i.i.d. observations D n = (X i, Y i ) n i=1, a ML algorithm builds a model g n its risks is L(g n ) = E {c(g n (X), Y ) D n } statistical learning studies L(g n ) L and E {L(g n )} L the preferred approach is to show a fast decrease of P {L (g n ) > L + ɛ} with n no hypothesis on the data (i.e., on P), a.k.a. distribution free results 21 / 154 Fabrice Rossi Formalization

53 Classical statistics Even though we are fequentists... this program is quite different from classical statistics: no optimal parameters, no identifiability: intrinsically non parametric no optimal model: only optimal performances no asymptotic distribution: focus on finite distance inequalities no (or minimal) assumption on the data distribution minimal hypothesis are justified by our lack of knowledge on the studied phenomenon have strong and annoying consequences linked to the worst case scenario they imply 22 / 154 Fabrice Rossi Formalization

54 Optimal model there is nevertheless an optimal model regression: for the quadratic cost c(g(x), y) = g(x) y 2 the minimum risk L is reached by g(x) = E {Y X = x} 23 / 154 Fabrice Rossi Formalization

55 Optimal model there is nevertheless an optimal model regression: for the quadratic cost c(g(x), y) = g(x) y 2 the minimum risk L is reached by g(x) = E {Y X = x} classification: we need the posterior probabilities: P {Y = k X = x} the minimum risk L (the Bayes risk) is reached by the Bayes classifier: assign x to the most probable class arg max k P {Y = k X = x} 23 / 154 Fabrice Rossi Formalization

56 Optimal model there is nevertheless an optimal model regression: for the quadratic cost c(g(x), y) = g(x) y 2 the minimum risk L is reached by g(x) = E {Y X = x} classification: we need the posterior probabilities: P {Y = k X = x} the minimum risk L (the Bayes risk) is reached by the Bayes classifier: assign x to the most probable class arg max k P {Y = k X = x} we only need to mimic the prediction of the model, not the model itself: for optimal classification, we don t need to compute the posterior probabilities for the binary case, the decision is based on P {Y = 1 X = x} P {Y = 1 X = x} we only need to agree with the sign of E {Y X = x} 23 / 154 Fabrice Rossi Formalization

57 No free lunch unfortunately, making no hypothesis on P costs a lot 24 / 154 Fabrice Rossi Formalization

58 No free lunch unfortunately, making no hypothesis on P costs a lot no free lunch results for binary classification and misclassification cost 24 / 154 Fabrice Rossi Formalization

59 No free lunch unfortunately, making no hypothesis on P costs a lot no free lunch results for binary classification and misclassification cost there is always a bad distribution: given a fixed machine learning method for all ɛ > 0 and all n, there is (X, Y ) such that L = 0 and E {L(g n )} 1 2 ɛ 24 / 154 Fabrice Rossi Formalization

60 No free lunch unfortunately, making no hypothesis on P costs a lot no free lunch results for binary classification and misclassification cost there is always a bad distribution: given a fixed machine learning method for all ɛ > 0 and all n, there is (X, Y ) such that L = 0 and E {L(g n )} 1 2 ɛ arbitrary slow convergence always happens: given a fixed machine learning method and a decreasing series (an ) with limit 0 (and a 1 1/16) there is (X, Y ) such that L = 0 and E {L(g n )} a n 24 / 154 Fabrice Rossi Formalization

61 Estimation difficulties in fact, estimating the Bayes risk L (in binary classification) is difficult: given a fixed estimation algorithm for L, denoted ˆL n for all ɛ > 0 and all n, there is (X, Y ) such that { } E ˆL n L 1 4 ɛ 25 / 154 Fabrice Rossi Formalization

62 Estimation difficulties in fact, estimating the Bayes risk L (in binary classification) is difficult: given a fixed estimation algorithm for L, denoted ˆL n for all ɛ > 0 and all n, there is (X, Y ) such that { } E ˆL n L 1 4 ɛ regression is more difficult than classification: ηn estimator for η = E {Y X} in a weak sense: lim E { (η n (X) η(x)) 2} = 0 n then for g n (x) = I {ηn(x)>1/2}, we have lim n E {L(g n )} L E {(ηn (X) η(x)) 2 } = 0 25 / 154 Fabrice Rossi Formalization

63 Consequences no free lunch results show that we cannot get uniform convergence speed: they don t rule out universal (strong) consistency! they don t rule out PAO with hypothesis on P 26 / 154 Fabrice Rossi Formalization

64 Consequences no free lunch results show that we cannot get uniform convergence speed: they don t rule out universal (strong) consistency! they don t rule out PAO with hypothesis on P intuitive explanation: stationarity and independence are not sufficient if P is arbitrarily complex, then no generalization can occur from a finite learning set e.g., Y = f (X) + ε: interpolation needs regularity assumptions on f, extrapolation needs even stronger hypothesis 26 / 154 Fabrice Rossi Formalization

65 Consequences no free lunch results show that we cannot get uniform convergence speed: they don t rule out universal (strong) consistency! they don t rule out PAO with hypothesis on P intuitive explanation: stationarity and independence are not sufficient if P is arbitrarily complex, then no generalization can occur from a finite learning set e.g., Y = f (X) + ε: interpolation needs regularity assumptions on f, extrapolation needs even stronger hypothesis but we don t want hypothesis on P / 154 Fabrice Rossi Formalization

66 Estimation and approximation solution: split the problem in two parts, estimation and approximation 27 / 154 Fabrice Rossi Formalization

67 Estimation and approximation solution: split the problem in two parts, estimation and approximation estimation: restrict the class of acceptable models to G, a set of measurable functions from X to Y study L(g n ) inf g G L(g) when the ML algorithm must choose g n in G statistics needed! hypotheses on G replace hypotheses on P and allow uniform results! 27 / 154 Fabrice Rossi Formalization

68 Estimation and approximation solution: split the problem in two parts, estimation and approximation estimation: restrict the class of acceptable models to G, a set of measurable functions from X to Y study L(g n ) inf g G L(g) when the ML algorithm must choose g n in G statistics needed! hypotheses on G replace hypotheses on P and allow uniform results! approximation: compare G to the set of all possible models in other words, study infg G L(g) L function approximation results (density), no statistics! 27 / 154 Fabrice Rossi Formalization

69 Back to machine learning algorithms many ad hoc algorithms: k nearest neighbors tree based algorithms Adaboost 28 / 154 Fabrice Rossi Formalization

70 Back to machine learning algorithms many ad hoc algorithms: k nearest neighbors tree based algorithms Adaboost but most algorithms are based on the following scheme: fix a model class G choose gn in G by optimizing a quality measure defined via D n 28 / 154 Fabrice Rossi Formalization

71 Back to machine learning algorithms many ad hoc algorithms: k nearest neighbors tree based algorithms Adaboost but most algorithms are based on the following scheme: fix a model class G choose gn in G by optimizing a quality measure defined via D n examples: linear regression (X is a Hilbert space, Y = R): G = {x w, x } find w n by minimizing P the mean squared error 1 w n = arg min n w n i=1 (y i w, x i ) 2 linear classification (X = R p, Y = {1,..., c}): G = {x arg max k (W x) k } find W e.g., by minimizing a mean squared error for a disjunctive coding of the classes 28 / 154 Fabrice Rossi Formalization

72 Criterion the best solution would be to choose g n by minimizing L(g n ) over G, but: we cannot compute L(gn ) (P is unknown!) even if we knew L(g n ), optimization might be intractable (L has no reason to be convex, for instance) 29 / 154 Fabrice Rossi Formalization

73 Criterion the best solution would be to choose g n by minimizing L(g n ) over G, but: we cannot compute L(gn ) (P is unknown!) even if we knew L(g n ), optimization might be intractable (L has no reason to be convex, for instance) partial solution, the empirical risk: L n (g) = 1 n n c(g(x i ), Y i ) i=1 29 / 154 Fabrice Rossi Formalization

74 Criterion the best solution would be to choose g n by minimizing L(g n ) over G, but: we cannot compute L(gn ) (P is unknown!) even if we knew L(g n ), optimization might be intractable (L has no reason to be convex, for instance) partial solution, the empirical risk: L n (g) = 1 n n c(g(x i ), Y i ) i=1 naive justification by the strong law of large numbers (SLLN): when g is fixed, lim n L n (g) = L(g) 29 / 154 Fabrice Rossi Formalization

75 Criterion the best solution would be to choose g n by minimizing L(g n ) over G, but: we cannot compute L(gn ) (P is unknown!) even if we knew L(g n ), optimization might be intractable (L has no reason to be convex, for instance) partial solution, the empirical risk: L n (g) = 1 n n c(g(x i ), Y i ) i=1 naive justification by the strong law of large numbers (SLLN): when g is fixed, lim n L n (g) = L(g) the algorithmic issue is handled by replacing the empirical risk by an empirical loss (see part IV) 29 / 154 Fabrice Rossi Formalization

76 Empirical risk minimization generic machine learning method: fix a model class G choose g n in G by g n = arg min g G L n(g) denote L G = inf g G L(g) 30 / 154 Fabrice Rossi Formalization

77 Empirical risk minimization generic machine learning method: fix a model class G choose g n in G by g n = arg min g G L n(g) denote L G = inf g G L(g) does that work? E {L(g n )} L G? E {L(g n )} L? the SLLN does not help as g n depends on n 30 / 154 Fabrice Rossi Formalization

78 Empirical risk minimization generic machine learning method: fix a model class G choose g n in G by g n = arg min g G L n(g) denote L G = inf g G L(g) does that work? E {L(g n )} L G? E {L(g n )} L? the SLLN does not help as g n depends on n we are looking for bounds: L(gn) < L n (gn) + B (bound the risk using the empirical risk) L(gn) < L G + C (guarantee that the risk is close to the best possible one in the class) 30 / 154 Fabrice Rossi Formalization

79 Empirical risk minimization generic machine learning method: fix a model class G choose g n in G by g n = arg min g G L n(g) denote L G = inf g G L(g) does that work? E {L(g n )} L G? E {L(g n )} L? the SLLN does not help as g n depends on n we are looking for bounds: L(gn) < L n (gn) + B (bound the risk using the empirical risk) L(gn) < L G + C (guarantee that the risk is close to the best possible one in the class) L G L is handled differently (approximation vs estimation) 30 / 154 Fabrice Rossi Formalization

80 Complexity control The overfitting issue controlling L n (g n) L(g n) is not possible in arbitrary classes G 31 / 154 Fabrice Rossi Formalization

81 Complexity control The overfitting issue controlling L n (gn) L(gn) is not possible in arbitrary classes G a simple example: binary classification G: all piecewise functions on arbitrary partitions of X obviously L n (gn) = 0 if P X has a density (via the classifier defined by the 1-nn rule) but L(g n ) > 0 when L > 0! the 1-nn rule is overfitting 31 / 154 Fabrice Rossi Formalization

82 Complexity control The overfitting issue controlling L n (gn) L(gn) is not possible in arbitrary classes G a simple example: binary classification G: all piecewise functions on arbitrary partitions of X obviously L n (gn) = 0 if P X has a density (via the classifier defined by the 1-nn rule) but L(g n ) > 0 when L > 0! the 1-nn rule is overfitting the only solution is to reduce the size of G (see part II) this implies L G > L (for arbitrary P) 31 / 154 Fabrice Rossi Formalization

83 Complexity control The overfitting issue controlling L n (gn) L(gn) is not possible in arbitrary classes G a simple example: binary classification G: all piecewise functions on arbitrary partitions of X obviously L n (gn) = 0 if P X has a density (via the classifier defined by the 1-nn rule) but L(g n ) > 0 when L > 0! the 1-nn rule is overfitting the only solution is to reduce the size of G (see part II) this implies L G > L (for arbitrary P) how to reach L? (will be studied in part III) 31 / 154 Fabrice Rossi Formalization

84 Complexity control The overfitting issue controlling L n (gn) L(gn) is not possible in arbitrary classes G a simple example: binary classification G: all piecewise functions on arbitrary partitions of X obviously L n (gn) = 0 if P X has a density (via the classifier defined by the 1-nn rule) but L(g n ) > 0 when L > 0! the 1-nn rule is overfitting the only solution is to reduce the size of G (see part II) this implies L G > L (for arbitrary P) how to reach L? (will be studied in part III) in other words, how to balance estimation error and approximation error 31 / 154 Fabrice Rossi Formalization

85 Increasing complexity key idea: choose g n in G n, i.e., in a class that depends on the datasize 32 / 154 Fabrice Rossi Formalization

86 Increasing complexity key idea: choose g n in G n, i.e., in a class that depends on the datasize estimation part: statistics make sure that Ln (gn) L(gn) is under control when gn = arg min g Gn L n (g n ) 32 / 154 Fabrice Rossi Formalization

87 Increasing complexity key idea: choose g n in G n, i.e., in a class that depends on the datasize estimation part: statistics make sure that Ln (gn) L(gn) is under control when gn = arg min g Gn L n (g n ) approximation part: functional analysis make sure that limn L G n = L related to density arguments (universal approximation) 32 / 154 Fabrice Rossi Formalization

88 Increasing complexity key idea: choose g n in G n, i.e., in a class that depends on the datasize estimation part: statistics make sure that Ln (gn) L(gn) is under control when gn = arg min g Gn L n (g n ) approximation part: functional analysis make sure that limn L G n = L related to density arguments (universal approximation) this is the simplest approach but also the less realistic: G n depends only on the data size 32 / 154 Fabrice Rossi Formalization

89 Structural risk minimization key idea: use again more and more powerful classes G d but also compare models between classes 33 / 154 Fabrice Rossi Formalization

90 Structural risk minimization key idea: use again more and more powerful classes G d but also compare models between classes more precisely: compute g n,d by empirical risk minimization in G d choose gn by minimizing over d L n (g n,d) + r(n, d) 33 / 154 Fabrice Rossi Formalization

91 Structural risk minimization key idea: use again more and more powerful classes G d but also compare models between classes more precisely: compute g n,d by empirical risk minimization in G d choose gn by minimizing over d L n (g n,d) + r(n, d) r(n, d) is a penalty term: it decreases with n and increases with the complexity of G d to favor models for which the risk is correctly estimated by the empirical risk it corresponds to a complexity measure for Gd 33 / 154 Fabrice Rossi Formalization

92 Structural risk minimization key idea: use again more and more powerful classes G d but also compare models between classes more precisely: compute g n,d by empirical risk minimization in G d choose gn by minimizing over d L n (g n,d) + r(n, d) r(n, d) is a penalty term: it decreases with n and increases with the complexity of G d to favor models for which the risk is correctly estimated by the empirical risk it corresponds to a complexity measure for Gd more interesting than the increasing complexity solution, but rather unrealistic on a practical point of view because of the tremendous computational cost 33 / 154 Fabrice Rossi Formalization

93 Regularization key idea: use a large class but optimize a penalized version of the empirical risk (see part IV) 34 / 154 Fabrice Rossi Formalization

94 Regularization key idea: use a large class but optimize a penalized version of the empirical risk (see part IV) more precisely: choose G such that L G = L define g n by g n = arg min g G (L n(g) + λ n R(g)) 34 / 154 Fabrice Rossi Formalization

95 Regularization key idea: use a large class but optimize a penalized version of the empirical risk (see part IV) more precisely: choose G such that L G = L define g n by g n = arg min g G (L n(g) + λ n R(g)) R(g) measures the complexity of the model the trade-off between the risk and the complexity, λ n, must be carefully chosen 34 / 154 Fabrice Rossi Formalization

96 Regularization key idea: use a large class but optimize a penalized version of the empirical risk (see part IV) more precisely: choose G such that L G = L define g n by g n = arg min g G (L n(g) + λ n R(g)) R(g) measures the complexity of the model the trade-off between the risk and the complexity, λ n, must be carefully chosen this is the (one of the) best solution(s) 34 / 154 Fabrice Rossi Formalization

97 Summary from n i.i.d. observations D n = (X i, Y i ) n i=1, a ML algorithm builds a model g n a good ML algorithm builds a universally strongly consistent model, i.e. for all P, L(g n ) p.s. L a very general class of ML algorithms is based on empirical risk minimization, i.e. gn 1 = arg min g G n n c(g(x i ), Y i ) consistency is obtained by controlling the estimation error and the approximation error i=1 35 / 154 Fabrice Rossi Formalization

98 Outline part II: empirical risk and capacity measures part III: capacity control part IV: empirical loss and regularization 36 / 154 Fabrice Rossi Formalization

99 Outline part II: empirical risk and capacity measures part III: capacity control part IV: empirical loss and regularization not in this lecture (see the references): ad hoc methods such as k nearest neighbours and trees advanced topics such as noise conditions and data dependent bounds clustering Bayesian point of view and many other things / 154 Fabrice Rossi Formalization

100 Part II Empirical risk and capacity measures 37 / 154 Fabrice Rossi

101 Outline Concentration Hoeffding inequality Uniform bounds Vapnik-Chervonenkis Dimension Definition Application to classification Proof Covering numbers Definition and results Computing covering numbers Summary 38 / 154 Fabrice Rossi

102 Empirical risk Empirical risk minimisation (ERM) relies on the link between L n (g) and L(g) the law of large numbers is too limited: asymptotic only g must be fixed we need: 1. quantitative finite distance results, i.e., PAO like bounds on P { L n (g) L(g) > ɛ} 2. uniform results, i.e. bounds on { P sup L n (g) L(g) > ɛ g G } 39 / 154 Fabrice Rossi Concentration

103 Concentration inequalities in essence, the goal is to evaluate the concentration of the empirical mean of c(g(x), Y ) around its expectation L n (g) L(g) = 1 n n c(g(x i ), Y i ) E {c(g(x), Y )} i=1 40 / 154 Fabrice Rossi Concentration

104 Concentration inequalities in essence, the goal is to evaluate the concentration of the empirical mean of c(g(x), Y ) around its expectation L n (g) L(g) = 1 n n c(g(x i ), Y i ) E {c(g(x), Y )} i=1 some standard bounds: Markov: P { X t} E{ X } t Bienaymé-Chebyshev: P { X E {X} t} Var(X) t 2 BC gives for n independent real valued random variables X i, denoting S n = n i=1 X i: n i=1 P { S n E {S n } t} Var(X i) t 2 40 / 154 Fabrice Rossi Concentration

105 Hoeffding s inequality Hoeffding, 1963 hypothesis X 1,..., X n, n independent r.v. X i [a i, b i ] S n = n i=1 X i result P {S n E {S n } ɛ} e 2ɛ2 / P n i=1 (b i a i ) 2 P {S n E {S n } ɛ} e 2ɛ2 / P n i=1 (b i a i ) 2 Quantitative distribution free finite distance law of large numbers! 41 / 154 Fabrice Rossi Concentration

106 Direct application with a bounded cost c(u, v) [a, b] (for all (u, v)), such as a bounded Y in regression or the misclassification cost in binary classification [a, b] = [0, 1] 42 / 154 Fabrice Rossi Concentration

107 Direct application with a bounded cost c(u, v) [a, b] (for all (u, v)), such as a bounded Y in regression or the misclassification cost in binary classification [a, b] = [0, 1] U i = 1 n c(g(x i), Y i ), U i [ a n, b n ] Warning: g cannot depend on the (X i, Y i ) (or the U i are no longer independent) 42 / 154 Fabrice Rossi Concentration

108 Direct application with a bounded cost c(u, v) [a, b] (for all (u, v)), such as a bounded Y in regression or the misclassification cost in binary classification [a, b] = [0, 1] U i = 1 n c(g(x i), Y i ), U i [ a n, b n ] Warning: g cannot depend on the (X i, Y i ) (or the U i are no longer independent) n i=1 (b i a i ) 2 = (b a)2 n and therefore P { L n (g) L(g) ɛ} 2e 2nɛ2 /(b a) 2 42 / 154 Fabrice Rossi Concentration

109 Direct application with a bounded cost c(u, v) [a, b] (for all (u, v)), such as a bounded Y in regression or the misclassification cost in binary classification [a, b] = [0, 1] U i = 1 n c(g(x i), Y i ), U i [ a n, b n ] Warning: g cannot depend on the (X i, Y i ) (or the U i are no longer independent) n i=1 (b i a i ) 2 = (b a)2 n and therefore P { L n (g) L(g) ɛ} 2e 2nɛ2 /(b a) 2 then L n (g) P L(g) and L n (g) p.s. L(g) (Borel Cantelli) 42 / 154 Fabrice Rossi Concentration

110 Limitations g cannot depend on the (X i, Y i ) (independent test set) intuitively: δ = 2e 2nɛ2 /(b a) 2 for each g, the probability to draw from P a Dn on which L n (g) L(g) (b a) log 2 δ 2n is at least 1 δ but the sets of correct D n differ for each g conversely for a fixed D n, the bound is valid only for some of the models this cannot be used to study g n = arg min g G L n (g) 43 / 154 Fabrice Rossi Concentration

111 The test set Hoeffding s inequality justifies the test set approach (hold out estimate) Given a dataset D m : split the dataset into two disjoint sets: a learning set Dn and a test set D p (with p + n = m) use a ML algorithm to build gn using only D n claim that L(g n ) L p(g n ) = 1 m p i=n+1 c(g n(x i ), Y i ) 44 / 154 Fabrice Rossi Concentration

112 The test set Hoeffding s inequality justifies the test set approach (hold out estimate) Given a dataset D m : split the dataset into two disjoint sets: a learning set Dn and a test set D p (with p + n = m) use a ML algorithm to build gn using only D n claim that L(g n ) L p(g n ) = 1 m p i=n+1 c(g n(x i ), Y i ) In fact, with probability at least 1 δ L(g n ) L p(g n ) + (b a) log 1 δ 2p 44 / 154 Fabrice Rossi Concentration

113 The test set Hoeffding s inequality justifies the test set approach (hold out estimate) Given a dataset D m : split the dataset into two disjoint sets: a learning set Dn and a test set D p (with p + n = m) use a ML algorithm to build gn using only D n claim that L(g n ) L p(g n ) = 1 m p i=n+1 c(g n(x i ), Y i ) In fact, with probability at least 1 δ L(g n ) L p(g n ) + (b a) log 1 δ 2p Example: classification (a = 0, b = 1), with a good classifier L p(g n ) = 0.05 to be 99% sure that L(g n ) 0.06, we need p = (!!!) 44 / 154 Fabrice Rossi Concentration

114 Proof of the Hoeffding s inequality Chernoff s bounding technique (an application of Markov s inequality): { P {X t} = P e sx e st} E { e sx } e st the bound is controlled via s lemma: if E {X} = 0 and X [a, b], the for all s, E { e sx } e s2 (b a) 2 8 { } e is convex E e sx e φ(s(b a)) then we bound φ this is used for X = S n E {S n } 45 / 154 Fabrice Rossi Concentration

115 Proof of the Hoeffding s inequality P {S n E {S n } ɛ} e sɛ E {e s P } n i=1 (X i E{X i }) = e sɛ n e sɛ with s = 4ɛ/ n i=1 (b i a i ) 2 i=1 n i=1 = e sɛ e s 2 P n i=1 (b i a i ) 2 8 { } E e s(x i E{X i }) (independence) e s2 (b i a i ) 2 8 (lemma) = e 2ɛ2 / P n i=1 (b i a i ) 2 (minimization) 46 / 154 Fabrice Rossi Concentration

116 Remarks Hoeffding s inequality is useful in many other contexts: Large-scale machine learning (e.g., [Domingos and Hulten, 2001]): enormous dataset (e.g., cannot be loaded in memory) process growing subsets until the model quality is known with sufficient confidence monitor drifting via confidence bounds Regret in game theory (e.g., [Cesa-Bianchi and Lugosi, 2006]): define strategies by minimizing a regret measure get bounds on the regret estimations (e.g., trade-off between exploration and exploitation in multi-arms bandits) Many improvements/complements are available, such as: Bernstein s inequality McDiarmid s bounded difference inequality 47 / 154 Fabrice Rossi Concentration

117 Uniform bounds we deal with the estimation error: given the ERM choice: g n = arg min g G L n (g) what about L(gn) inf g G L(g)? 48 / 154 Fabrice Rossi Concentration

118 Uniform bounds we deal with the estimation error: given the ERM choice: g n = arg min g G L n (g) what about L(gn) inf g G L(g)? we need uniform bounds: L n (gn) L(gn) sup L n (g) L(g) g G L(g n) inf g G L(g) = L(g n) L n (g n) + L n (g n) inf g G L(g) L(gn) L n (gn) + sup L n (g) L(g) g G 2 sup L n (g) L(g) g G 48 / 154 Fabrice Rossi Concentration

119 Uniform bounds we deal with the estimation error: given the ERM choice: g n = arg min g G L n (g) what about L(gn) inf g G L(g)? we need uniform bounds: L n (gn) L(gn) sup L n (g) L(g) g G L(g n) inf g G L(g) = L(g n) L n (g n) + L n (g n) inf g G L(g) L(gn) L n (gn) + sup L n (g) L(g) g G 2 sup L n (g) L(g) g G therefore uniform bounds give an answer to both questions: the quality of the empirical risk as an estimate of the risk and the quality of the model select by ERM 48 / 154 Fabrice Rossi Concentration

120 Finite model class union bound : P {A ou B} P {A} + P {B} and therefore { } P max U i ɛ 1 i m m P {U i ɛ} i=1 49 / 154 Fabrice Rossi Concentration

121 Finite model class union bound : P {A ou B} P {A} + P {B} and therefore then when G is finite { P { } P max U i ɛ 1 i m sup L n (g) L(g) ɛ g G } m P {U i ɛ} i=1 2 G e 2nɛ2 /(b a) 2 so L(g n) p.s. inf g G L(g) (but in general inf g G L(g) > L ) 49 / 154 Fabrice Rossi Concentration

122 Finite model class union bound : P {A ou B} P {A} + P {B} and therefore then when G is finite { P { } P max U i ɛ 1 i m sup L n (g) L(g) ɛ g G } m P {U i ɛ} i=1 2 G e 2nɛ2 /(b a) 2 so L(g n) p.s. inf g G L(g) (but in general inf g G L(g) > L ) quantitative distribution free uniform finite distance law of large numbers! 49 / 154 Fabrice Rossi Concentration

123 Bound versus size with probability at least 1 δ L(gn) inf L(g) + 2(b a) g G log G + log 2 δ 2n inf g G L(g) decreases with G (more models) but the bound increases with G : trade-off between flexibility and estimation quality remark: log G is the number of bits needed to encode the model choice; this is a capacity measure 50 / 154 Fabrice Rossi Concentration

124 Outline Concentration Hoeffding inequality Uniform bounds Vapnik-Chervonenkis Dimension Definition Application to classification Proof Covering numbers Definition and results Computing covering numbers Summary 51 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

125 Infinite model class so far we have uniform bounds when G is finite: even simple classes such as Glin = {x arg max k (W x) k } are infinite in addition infg Glin L(g) > 0 in general, even when L = 0 (nonlinear problems) 52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

126 Infinite model class so far we have uniform bounds when G is finite: even simple classes such as Glin = {x arg max k (W x) k } are infinite in addition infg Glin L(g) > 0 in general, even when L = 0 (nonlinear problems) solution: evaluate the capacity of a class rather than its size 52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

127 Infinite model class so far we have uniform bounds when G is finite: even simple classes such as Glin = {x arg max k (W x) k } are infinite in addition infg Glin L(g) > 0 in general, even when L = 0 (nonlinear problems) solution: evaluate the capacity of a class rather than its size first step: binary classification (the simplest case) a model is then a measurable set A X the capacity of G measures the variability of the available shapes it is evaluated with respect to the data: if A Dn = B D n, then A and B are identical 52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

128 Growth function abstract setting: F is a set of measurable functions from R d to {0, 1} (in other words a set of (indicator functions of) measurable subsets of R d ) 53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

129 Growth function abstract setting: F is a set of measurable functions from R d to {0, 1} (in other words a set of (indicator functions of) measurable subsets of R d ) given z 1 R d,..., z n R d, we define F z1,...,z n = {u {0, 1} n f F, u = (f (z 1 ),..., f (z n ))} interpretation: each u describes a binary partition of z 1,..., z n and F z1,...,z n is then the set of partitions implemented by F 53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

130 Growth function abstract setting: F is a set of measurable functions from R d to {0, 1} (in other words a set of (indicator functions of) measurable subsets of R d ) given z 1 R d,..., z n R d, we define F z1,...,z n = {u {0, 1} n f F, u = (f (z 1 ),..., f (z n ))} interpretation: each u describes a binary partition of z 1,..., z n and F z1,...,z n is then the set of partitions implemented by F Growth function S F (n) = sup F z1,...,z n z 1 R d,...,z n R d interpretation: maximal number of binary partitions implementable via F (distribution free worst case analysis) 53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

131 Vapnik-Chervonenkis dimension S F (n) 2 n vocabulary: S F (n) is the n-th shatter coefficient of F if F z1,...,z n = 2 n, F shatters z 1,..., z n 54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

132 Vapnik-Chervonenkis dimension S F (n) 2 n vocabulary: S F (n) is the n-th shatter coefficient of F if F z1,...,z n = 2 n, F shatters z 1,..., z n Vapnik-Chervonenkis dimension VCdim (F) = sup{n N + S F (n) = 2 n } 54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

133 Vapnik-Chervonenkis dimension S F (n) 2 n vocabulary: S F (n) is the n-th shatter coefficient of F if F z1,...,z n = 2 n, F shatters z 1,..., z n Vapnik-Chervonenkis dimension VCdim (F) = sup{n N + S F (n) = 2 n } interpretation: if S F (n) < 2 n : for all z 1,..., z n, there is a partition u {0, 1} n, such that u F z1,...,z n for any dataset, F can fail to reach L = 0 if S F (n) = 2 n : there is at least one set z 1,..., z n shattered by F 54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

134 Example points from R 2 F = {I {ax+by+c 0} } 55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

135 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

136 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

137 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

138 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

139 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 even if we have sometimes F z1,z 2,z 3 < / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

140 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 even if we have sometimes F z1,z 2,z 3 < 2 3 S F (4) < / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

141 Example points from R 2 F = {I {ax+by+c 0} } S F (3) = 2 3 even if we have sometimes F z1,z 2,z 3 < 2 3 S F (4) < 2 4 VCdim (F) = 3 55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

142 Linear models result if G is a p dimensional vector space of real valued functions defined on R d then }) VCdim ({f : R d {0, 1} g G, f (z) = I {g(z) 0} p 56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

143 Linear models result if G is a p dimensional vector space of real valued functions defined on R d then }) VCdim ({f : R d {0, 1} g G, f (z) = I {g(z) 0} p proof given z 1,..., z p+1, let F from G to R p+1 be F(g) = (g(z 1 ),..., g(z p+1 )) 56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

144 Linear models result if G is a p dimensional vector space of real valued functions defined on R d then }) VCdim ({f : R d {0, 1} g G, f (z) = I {g(z) 0} p proof given z 1,..., z p+1, let F from G to R p+1 be F(g) = (g(z 1 ),..., g(z p+1 )) as dim(f(g)) p, there is a non zero γ = (γ 1,..., γ p+1 ) such that p+1 i=1 γ ig(z i ) = 0 56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

145 Linear models result if G is a p dimensional vector space of real valued functions defined on R d then }) VCdim ({f : R d {0, 1} g G, f (z) = I {g(z) 0} p proof given z 1,..., z p+1, let F from G to R p+1 be F(g) = (g(z 1 ),..., g(z p+1 )) as dim(f(g)) p, there is a non zero γ = (γ 1,..., γ p+1 ) such that p+1 i=1 γ ig(z i ) = 0 if z 1,..., z p+1 is shattered, there is also g j such that g j (z i ) = δ ij and therefore γ j = 0 for all j 56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

171:290 Model Selection Lecture II: The Akaike Information Criterion

171:290 Model Selection Lecture II: The Akaike Information Criterion 171:290 Model Selection Lecture II: The Akaike Information Criterion Department of Biostatistics Department of Statistics and Actuarial Science August 28, 2012 Introduction AIC, the Akaike Information

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

Un point de vue bayésien pour des algorithmes de bandit plus performants

Un point de vue bayésien pour des algorithmes de bandit plus performants Un point de vue bayésien pour des algorithmes de bandit plus performants Emilie Kaufmann, Telecom ParisTech Rencontre des Jeunes Statisticiens, Aussois, 28 août 2013 Emilie Kaufmann (Telecom ParisTech)

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Online Learning, Stability, and Stochastic Gradient Descent

Online Learning, Stability, and Stochastic Gradient Descent Online Learning, Stability, and Stochastic Gradient Descent arxiv:1105.4701v3 [cs.lg] 8 Sep 2011 September 9, 2011 Tomaso Poggio, Stephen Voinea, Lorenzo Rosasco CBCL, McGovern Institute, CSAIL, Brain

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

1 if 1 x 0 1 if 0 x 1

1 if 1 x 0 1 if 0 x 1 Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu

Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Notes from Week 1: Algorithms for sequential prediction

Notes from Week 1: Algorithms for sequential prediction CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

More information

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning

More information

The CUSUM algorithm a small review. Pierre Granjon

The CUSUM algorithm a small review. Pierre Granjon The CUSUM algorithm a small review Pierre Granjon June, 1 Contents 1 The CUSUM algorithm 1.1 Algorithm............................... 1.1.1 The problem......................... 1.1. The different steps......................

More information

Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms. Stéphan Clémençon

Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms. Stéphan Clémençon Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No. 5141 - Telecom ParisTech - Journée Traitement de Masses de Données du Laboratoire JL Lions

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

Christfried Webers. Canberra February June 2015

Christfried Webers. Canberra February June 2015 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1 Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1 J. Zhang Institute of Applied Mathematics, Chongqing University of Posts and Telecommunications, Chongqing

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Several Views of Support Vector Machines

Several Views of Support Vector Machines Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

More information

Non Parametric Inference

Non Parametric Inference Maura Department of Economics and Finance Università Tor Vergata Outline 1 2 3 Inverse distribution function Theorem: Let U be a uniform random variable on (0, 1). Let X be a continuous random variable

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Universal Algorithm for Trading in Stock Market Based on the Method of Calibration

Universal Algorithm for Trading in Stock Market Based on the Method of Calibration Universal Algorithm for Trading in Stock Market Based on the Method of Calibration Vladimir V yugin Institute for Information Transmission Problems, Russian Academy of Sciences, Bol shoi Karetnyi per.

More information

Large-Scale Similarity and Distance Metric Learning

Large-Scale Similarity and Distance Metric Learning Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA Andreas Christmann Department of Mathematics homepages.vub.ac.be/ achristm Talk: ULB, Sciences Actuarielles, 17/NOV/2006 Contents 1. Project: Motor vehicle

More information

Date: April 12, 2001. Contents

Date: April 12, 2001. Contents 2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........

More information

The Optimality of Naive Bayes

The Optimality of Naive Bayes The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Trading regret rate for computational efficiency in online learning with limited feedback

Trading regret rate for computational efficiency in online learning with limited feedback Trading regret rate for computational efficiency in online learning with limited feedback Shai Shalev-Shwartz TTI-C Hebrew University On-line Learning with Limited Feedback Workshop, 2009 June 2009 Shai

More information

E3: PROBABILITY AND STATISTICS lecture notes

E3: PROBABILITY AND STATISTICS lecture notes E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................

More information

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Fabienne Comte, Celine Duval, Valentine Genon-Catalot To cite this version: Fabienne

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Compact Representations and Approximations for Compuation in Games

Compact Representations and Approximations for Compuation in Games Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions

More information

Lecture 6 Online and streaming algorithms for clustering

Lecture 6 Online and streaming algorithms for clustering CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

More information

Tail inequalities for order statistics of log-concave vectors and applications

Tail inequalities for order statistics of log-concave vectors and applications Tail inequalities for order statistics of log-concave vectors and applications Rafał Latała Based in part on a joint work with R.Adamczak, A.E.Litvak, A.Pajor and N.Tomczak-Jaegermann Banff, May 2011 Basic

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

(Quasi-)Newton methods

(Quasi-)Newton methods (Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

More information

Language Modeling. Chapter 1. 1.1 Introduction

Language Modeling. Chapter 1. 1.1 Introduction Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

More information

Local classification and local likelihoods

Local classification and local likelihoods Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Fuzzy Probability Distributions in Bayesian Analysis

Fuzzy Probability Distributions in Bayesian Analysis Fuzzy Probability Distributions in Bayesian Analysis Reinhard Viertl and Owat Sunanta Department of Statistics and Probability Theory Vienna University of Technology, Vienna, Austria Corresponding author:

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

More information

Message-passing sequential detection of multiple change points in networks

Message-passing sequential detection of multiple change points in networks Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal

More information

1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09

1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09 1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch

More information

Week 1: Introduction to Online Learning

Week 1: Introduction to Online Learning Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider

More information

1. Prove that the empty set is a subset of every set.

1. Prove that the empty set is a subset of every set. 1. Prove that the empty set is a subset of every set. Basic Topology Written by Men-Gen Tsai email: b89902089@ntu.edu.tw Proof: For any element x of the empty set, x is also an element of every set since

More information

Virtual Landmarks for the Internet

Virtual Landmarks for the Internet Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser

More information

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms

More information

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its

More information

Random graphs with a given degree sequence

Random graphs with a given degree sequence Sourav Chatterjee (NYU) Persi Diaconis (Stanford) Allan Sly (Microsoft) Let G be an undirected simple graph on n vertices. Let d 1,..., d n be the degrees of the vertices of G arranged in descending order.

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

IN most approaches to binary classification, classifiers are

IN most approaches to binary classification, classifiers are 3806 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 11, NOVEMBER 2005 A Neyman Pearson Approach to Statistical Learning Clayton Scott, Member, IEEE, Robert Nowak, Senior Member, IEEE Abstract The

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

Polarization codes and the rate of polarization

Polarization codes and the rate of polarization Polarization codes and the rate of polarization Erdal Arıkan, Emre Telatar Bilkent U., EPFL Sept 10, 2008 Channel Polarization Given a binary input DMC W, i.i.d. uniformly distributed inputs (X 1,...,

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

The Ergodic Theorem and randomness

The Ergodic Theorem and randomness The Ergodic Theorem and randomness Peter Gács Department of Computer Science Boston University March 19, 2008 Peter Gács (Boston University) Ergodic theorem March 19, 2008 1 / 27 Introduction Introduction

More information

Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

More information

1 Another method of estimation: least squares

1 Another method of estimation: least squares 1 Another method of estimation: least squares erm: -estim.tex, Dec8, 009: 6 p.m. (draft - typos/writos likely exist) Corrections, comments, suggestions welcome. 1.1 Least squares in general Assume Y i

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Bayes and Naïve Bayes. cs534-machine Learning

Bayes and Naïve Bayes. cs534-machine Learning Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

More information

Hypothesis Testing for Beginners

Hypothesis Testing for Beginners Hypothesis Testing for Beginners Michele Piffer LSE August, 2011 Michele Piffer (LSE) Hypothesis Testing for Beginners August, 2011 1 / 53 One year ago a friend asked me to put down some easy-to-read notes

More information

ALMOST COMMON PRIORS 1. INTRODUCTION

ALMOST COMMON PRIORS 1. INTRODUCTION ALMOST COMMON PRIORS ZIV HELLMAN ABSTRACT. What happens when priors are not common? We introduce a measure for how far a type space is from having a common prior, which we term prior distance. If a type

More information

No: 10 04. Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

No: 10 04. Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics No: 10 04 Bilkent University Monotonic Extension Farhad Husseinov Discussion Papers Department of Economics The Discussion Papers of the Department of Economics are intended to make the initial results

More information

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

A Stochastic 3MG Algorithm with Application to 2D Filter Identification A Stochastic 3MG Algorithm with Application to 2D Filter Identification Emilie Chouzenoux 1, Jean-Christophe Pesquet 1, and Anisia Florescu 2 1 Laboratoire d Informatique Gaspard Monge - CNRS Univ. Paris-Est,

More information

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and

More information

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

More information

Comparison of machine learning methods for intelligent tutoring systems

Comparison of machine learning methods for intelligent tutoring systems Comparison of machine learning methods for intelligent tutoring systems Wilhelmiina Hämäläinen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu, P.O. Box 111, FI-80101 Joensuu

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and

More information

Approximation Algorithms

Approximation Algorithms Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

More information