Statistique en grande dimension

Similar documents
Statistiques en grande dimension

Basics of Statistical Machine Learning

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

1 Maximum likelihood estimation

Statistical Machine Learning

STA 4273H: Statistical Machine Learning

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

UNIVERSITETET I OSLO

Open call for tenders n SCIC C4 2014/01

Consistent Binary Classification with Generalized Performance Metrics

Machine de Soufflage defibre

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

TOPIC 4: DERIVATIVES

THE CENTRAL LIMIT THEOREM TORONTO

A.II. Kernel Estimation of Densities

HT2015: SC4 Statistical Data Mining and Machine Learning

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Personnalisez votre intérieur avec les revêtements imprimés ALYOS design

Troncatures dans les modèles linéaires simples et à effets mixtes sous R

MATH 425, PRACTICE FINAL EXAM SOLUTIONS.

Influences in low-degree polynomials

Introduction to General and Generalized Linear Models

Week 1: Introduction to Online Learning

Critical points of once continuously differentiable functions are important because they are the only points that can be local maxima or minima.

(Quasi-)Newton methods

Adaptive Online Gradient Descent

Licence Informatique Année Exceptions

MACHINE LEARNING IN HIGH ENERGY PHYSICS

4.5 Linear Dependence and Linear Independence

Solutions of Equations in One Variable. Fixed-Point Iteration II

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Polarization codes and the rate of polarization

Introduction. GEAL Bibliothèque Java pour écrire des algorithmes évolutionnaires. Objectifs. Simplicité Evolution et coévolution Parallélisme

Lecture 3: Linear methods for classification

Question 2 Naïve Bayes (16 points)

RESONANCES AND BALLS IN OBSTACLE SCATTERING WITH NEUMANN BOUNDARY CONDITIONS

Section 6.1 Joint Distribution Functions

88 CHAPTER 2. VECTOR FUNCTIONS. . First, we need to compute T (s). a By definition, r (s) T (s) = 1 a sin s a. sin s a, cos s a

Non Parametric Inference

Lecture 2: Universality

1 Formulating The Low Degree Testing Problem

CSCI567 Machine Learning (Fall 2014)

CONTINUED FRACTIONS AND PELL S EQUATION. Contents 1. Continued Fractions 1 2. Solution to Pell s Equation 9 References 12

Uncertainty quantification for the family-wise error rate in multivariate copula models

Régression logistique : introduction

Lecture 10: CPA Encryption, MACs, Hash Functions. 2 Recap of last lecture - PRGs for one time pads

Christfried Webers. Canberra February June 2015

Principle of Data Reduction

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

MA4001 Engineering Mathematics 1 Lecture 10 Limits and Continuity

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

General Certificate of Education Advanced Level Examination June 2012

Bayes and Naïve Bayes. cs534-machine Learning

An optimal transportation problem with import/export taxes on the boundary

Lecture Notes 1. Brief Review of Basic Probability

Scalar Valued Functions of Several Variables; the Gradient Vector

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Microeconomic Theory: Basic Math Concepts

ALOHA Performs Delay-Optimum Power Control

15. Symmetric polynomials

How To Find Out How Roots Of A Polynome Are Calculated

Lecture 3: Finding integer solutions to systems of linear equations

Statistical Machine Translation: IBM Models 1 and 2

Online Learning, Stability, and Stochastic Gradient Descent

Multi-variable Calculus and Optimization

Properties of BMO functions whose reciprocals are also BMO

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

General Theory of Differential Equations Sections 2.8, , 4.1

9 More on differentiation

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Power Distribution System. Additional Information on page 2 See Page 2 Page 6. Eaton. See Page 2. Additional Information on page 2

Predict Influencers in the Social Network

Introduction au BIM. ESEB Seyssinet-Pariset Economie de la construction contact@eseb.fr

Math 120 Final Exam Practice Problems, Form: A

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

Tail inequalities for order statistics of log-concave vectors and applications

Simple and efficient online algorithms for real world applications

CHAPTER 2 Estimating Probabilities

Real Roots of Univariate Polynomials with Real Coefficients

Machine Learning.

Reject Inference in Credit Scoring. Jie-Men Mok

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

EPREUVE D EXPRESSION ORALE. SAVOIR et SAVOIR-FAIRE

F.S. Hillier & G.T Lierberman Introduction to Operations Research McGraw-Hill, 2004

"Internationalization vs. Localization: The Translation of Videogame Advertising"

Section 1.1. Introduction to R n

Dimensioning an inbound call center using constraint programming

Sun Management Center Change Manager Release Notes

POB-JAVA Documentation

Transcription:

Statistique en grande dimension Lecturer : Dalalyan A., Scribe : Thomas F.-X. First lecture Introduction. Statistique classique Statistique paramétriques : Z,..., Z n iid, avec une loi commune P θ On fait l hypothèse θ Θ R d Connu : Z,..., Z n et Θ Inconnu : θ ou P θ Hypothèse importante : d est fixe et n + On sait dans ce cas que l estimateur du MV est asymptotiquement le plus) efficace convergent) : ˆθ MV vérifie quand n + : E P ˆθ MV θ 2] = C + o )) n On estime θ à une vitesse n vitesse paramétrique) Constat. Si d = d n t.q. lim n + d n = +, alors toute la théorie paramétrique est inutilisable. De plus, l estimateur du MV n est plus le meilleur estimateur!.2 Statistique non paramétrique On observe Z,..., Z n iid de loi P, inconnue, telle que P P θ, θ Θ, mais avec Θ soit de dimension infinie, soit de dimension d = d n finie mais + avec la taille de l échantillon. Exemples: Θ = = f : 0, ] R, f Lipschitz de constante L f : 0, ] R, x, y, f x) f y) L x y Θ = θ = θ, θ 2,...), j= ) 2) θ 2 j < + = l 2 3) Démarche générale: On approche Θ par une suite croissante Θ k de sous-ensembles de Θ telle que Θ k est de

dimension d k. En procédant comme si θ appartenait à Θ k ce n est pas nécessairement le cas), on utilise une méthode paramétrique pour définir un estimateur θ k de θ. Cela nous donne une famille d estimateurs θ k. Question principale. Comment choisir k pour minimiser le risque de θ k? Si k est petit, on est face à un phénomène de sous-apprentissage underfitting) Inversement, si k est grand, phénomène de sur-apprentissage overfitting).3 Principal models in non-parametric statistics Density model. We have X,..., X n iid with a density f defined on R p, and : P X A) = A f x)dx The assumptions imposed on f are very weak as opposed to the parametric setting. For instance, a typical assumption in parametric setting is that f is the Gaussian density : f x) = det Σ ) 2π) p/2 exp ] 2 x µ)t Σ x µ), whereas a common assumption on f in nonparametric framework is : f is smooth, say, twice continuously differentiable with a bounded second derivative. Regression model. We observe Z i = X i, Y i ), with input X i, output Y i and error ε i : Y i = f X i ) + ε i. The function f is called the regression function. Here, the goal is to estimate f without assuming any parametric structure on it. Practical examples. Marketing. Each i represents a consumer X i are the features of the consumer A typical question is how do I estimate different relevant groups of consumers. A typical answer is then to use clustering algorithms. We assume that X,..., X n are iid with density f. Then, we estimate f in a non-parametric manner by ˆf. The clusters are defined as regions around the local maxima of the function ˆf..4 Machine Learning Essentially the same as non-parametric statistics The main focus here is on the algorithms rather than on the models), their statistical performance and their computational complexity. 2

2 Main concepts and notations Observations : Z,..., Z n iid, P Non-supervised learning : Z i = X i Supervised learning : Z i = X i, Y i ), where X i is an example or a feature, and Y i a label. Aim. To learn the distribution P or some properties of it. Prediction. We assume that a new feature X from the same prob. distribution as X,..., X n ) is observed. The aim is to predict the label associated to X. To measure the quality of a prediction, we need a loss function l y, ỹ) y is the true label, ỹ is the predicted label). In practice, both y and ỹ are random variables, furthermore y and its distribution are unknown, so l is hard to compute! Risk function. This is the expectation of the loss. Definition Assume that Z i = X i, Y i ) X Y and l : Y Y R is a loss function. A predictor, or preduction algorithm, is any mapping : The risk of the prediction function g is : ĝ : X Y) n Y X R P g ] = E P l Y, gx))] The risk of a predictor ĝ is R P ĝ ], which is random since ĝ depends on the data. R P ĝ ] = l y, ĝx)) dp x, y) Examples: X Y. Binary classification: Y = 0,, with any X l y, ỹ) = 0, if y = ỹ, otherwise = y = ỹ) = y ỹ) 2. 2. Least-squares regression: Y R, with any X l y, ỹ) = y ỹ) 2. 3 Excess risk and Bayes predictor We have Z i = X i, Y i ) R P g ] = X Y l y, gx)) P dx, dy) P dx, dy) = P Y X dy X = x) P X dx) Definition 2 Given a loss function l : Y Y R, the Bayes predictor, or oracle is the prediction function minimizing the risk : g arg min g Y X R P g ] 3

Remark In practice, g is unavailable, since it depends on P, which is unknown. The ultimate goal is to do almost as well as the oracle. A predictor ĝ n will be considered as a good one if : lim R P ĝ n ] R P g ] = 0 n + excess risk Definition 3 We say that the predictor ĝ n is consistent universally consistent) if P, we have : Theorem lim E P R P ĝ n ]] R P g ] = 0 n +. Suppose that x X,the infimum of y E P l Y, y) X = x] is reached. Then the funcion g defined by : g x) arg min y Y E P l Y, y) X = x]...is a Bayes predictor. 2. In the case of the binary classification, Y = 0, and l y, ỹ) = y = ỹ), g x) = η x) > ) where η x) = P Y = X = x]. 2 Furthermore, the excess risk can be computed by R P g] R P g ] = E P gx) g X)) 2η X))]. 4) 3. In the case of the least squares regression, Furthermore, for any η : X Y, we have : g x) = η x) where η x) = E P Y X = x] R P η ] R P η ] = E P η X) η X)) 2] Proof. Let g Y X and let : We have : g x) arg min y Y E P l Y, y) X = x]. R P g ] = E P l Y, g X))] = EP l Y, g X)) X = x] P X dx) EP l Y, g x)) X = x] P X dx) = R P g ]. 4

2. Using the first assertion, Therefore, g x) arg min P Y = y) X = x] y 0, = arg min P Y = y X = x) y 0, = arg max P Y = y X = x) y 0, To check 4), it suffices to remark that = arg max η x)y = ) + η x))y = 0). y 0, g x) = 0, if P Y = X = x) 2, otherwise. R P g] = E P gx) Y) 2 ] = E P gx) 2 ] + E P Y 2 ] 2E P YgX)] = E P gx)] + E P Y] 2E P E P YgX) X)] = E P gx)] + E P Y] 2E P gx)e P Y X)] = E P gx)] + E P Y] 2E P gx)η P X)] = E P gx) 2η P X)] + E PY]. Writing the same identity for g P and making the difference of these two identities, we get the desired result. 3. In view of the first assertion of the theorem, we have: g x) arg min y R E ] P Y y) 2 X = x = arg min y R ϕ y) where ϕ y) = E P Y 2 X = x ] 2yE P Y X = x] + y 2 is a second order polynomial. The minimization of such a polynomial is straightforward and leads to: arg min y R ϕ y) = E P Y X = x]. This shows that the Bayes predictor is equal to the regression function η x). The risk of this predictor is: R P η ] = E P Y η X)) 2] ]) = E P EP Y η X)) 2 X ] = E P EP Y η X)) 2 X ] = R P η ] + 0 + E P η η) 2 X), where the cross-product term vanishes since ) + 2E P Y η X)) η η) X) X] + η η) 2 X) E P Y η X)) η η) X) X] = η η) X)E P Y η X)) X] = 0. This completes the proof of the theorem. 5

3. Link between Binary Classification & Regression Plug-in rule We start by estimating η x) by ˆη n x), ) We define ĝ n x) = ˆη n > 2. Question: How good the plug-in rule ĝ n is? ) Proposition Let ˆη be an estimator of the regression function η, and let ĝx) = ˆη x) > 2. Then, we have : R class ĝ] R class g ] 2 R reg ˆη] R reg η ] ) Proof Let η : X Y R, and gx) = ηx) > 2, and let s compute the excess risk of g. We have, R class g] R class g ] = E P gx) g X)) 2η X))]. Since g and g are both indicator functions and, therefore, take only the values 0 and, their difference will be nonzero if and only if one of them is equal to and the other one is equal to 0. This leads to R class g ] Rclass E P ηx) /2 < η X) ) 2η X) ] +E P η X) /2 < ηx) ) 2η X) ] = 2E P /2 η X), ηx)] ) η X) /2 ] If ηx) /2 and η X) > /2, then η X) /2 η X) ηx), and thus : ] R class g Rclass g ] 2E P /2 ηx), η X)] ) η X ) η X) ] 2E P ηx) η X) ] 2 EP ηx) η X) ) 2] = 2 R reg η) R reg η ). Since this inequality is true for every deterministic η, we get the desired property. 6