Statistique en grande dimension


 Johnathan Fleming
 10 months ago
 Views:
Transcription
1 Statistique en grande dimension Lecturer : Dalalyan A., Scribe : Thomas F.X. First lecture Introduction. Statistique classique Statistique paramétriques : Z,..., Z n iid, avec une loi commune P θ On fait l hypothèse θ Θ R d Connu : Z,..., Z n et Θ Inconnu : θ ou P θ Hypothèse importante : d est fixe et n + On sait dans ce cas que l estimateur du MV est asymptotiquement le plus) efficace convergent) : ˆθ MV vérifie quand n + : E P ˆθ MV θ 2] = C + o )) n On estime θ à une vitesse n vitesse paramétrique) Constat. Si d = d n t.q. lim n + d n = +, alors toute la théorie paramétrique est inutilisable. De plus, l estimateur du MV n est plus le meilleur estimateur!.2 Statistique non paramétrique On observe Z,..., Z n iid de loi P, inconnue, telle que P P θ, θ Θ, mais avec Θ soit de dimension infinie, soit de dimension d = d n finie mais + avec la taille de l échantillon. Exemples: Θ = = f : 0, ] R, f Lipschitz de constante L f : 0, ] R, x, y, f x) f y) L x y Θ = θ = θ, θ 2,...), j= ) 2) θ 2 j < + = l 2 3) Démarche générale: On approche Θ par une suite croissante Θ k de sousensembles de Θ telle que Θ k est de
2 dimension d k. En procédant comme si θ appartenait à Θ k ce n est pas nécessairement le cas), on utilise une méthode paramétrique pour définir un estimateur θ k de θ. Cela nous donne une famille d estimateurs θ k. Question principale. Comment choisir k pour minimiser le risque de θ k? Si k est petit, on est face à un phénomène de sousapprentissage underfitting) Inversement, si k est grand, phénomène de surapprentissage overfitting).3 Principal models in nonparametric statistics Density model. We have X,..., X n iid with a density f defined on R p, and : P X A) = A f x)dx The assumptions imposed on f are very weak as opposed to the parametric setting. For instance, a typical assumption in parametric setting is that f is the Gaussian density : f x) = det Σ ) 2π) p/2 exp ] 2 x µ)t Σ x µ), whereas a common assumption on f in nonparametric framework is : f is smooth, say, twice continuously differentiable with a bounded second derivative. Regression model. We observe Z i = X i, Y i ), with input X i, output Y i and error ε i : Y i = f X i ) + ε i. The function f is called the regression function. Here, the goal is to estimate f without assuming any parametric structure on it. Practical examples. Marketing. Each i represents a consumer X i are the features of the consumer A typical question is how do I estimate different relevant groups of consumers. A typical answer is then to use clustering algorithms. We assume that X,..., X n are iid with density f. Then, we estimate f in a nonparametric manner by ˆf. The clusters are defined as regions around the local maxima of the function ˆf..4 Machine Learning Essentially the same as nonparametric statistics The main focus here is on the algorithms rather than on the models), their statistical performance and their computational complexity. 2
3 2 Main concepts and notations Observations : Z,..., Z n iid, P Nonsupervised learning : Z i = X i Supervised learning : Z i = X i, Y i ), where X i is an example or a feature, and Y i a label. Aim. To learn the distribution P or some properties of it. Prediction. We assume that a new feature X from the same prob. distribution as X,..., X n ) is observed. The aim is to predict the label associated to X. To measure the quality of a prediction, we need a loss function l y, ỹ) y is the true label, ỹ is the predicted label). In practice, both y and ỹ are random variables, furthermore y and its distribution are unknown, so l is hard to compute! Risk function. This is the expectation of the loss. Definition Assume that Z i = X i, Y i ) X Y and l : Y Y R is a loss function. A predictor, or preduction algorithm, is any mapping : The risk of the prediction function g is : ĝ : X Y) n Y X R P g ] = E P l Y, gx))] The risk of a predictor ĝ is R P ĝ ], which is random since ĝ depends on the data. R P ĝ ] = l y, ĝx)) dp x, y) Examples: X Y. Binary classification: Y = 0,, with any X l y, ỹ) = 0, if y = ỹ, otherwise = y = ỹ) = y ỹ) Leastsquares regression: Y R, with any X l y, ỹ) = y ỹ) 2. 3 Excess risk and Bayes predictor We have Z i = X i, Y i ) R P g ] = X Y l y, gx)) P dx, dy) P dx, dy) = P Y X dy X = x) P X dx) Definition 2 Given a loss function l : Y Y R, the Bayes predictor, or oracle is the prediction function minimizing the risk : g arg min g Y X R P g ] 3
4 Remark In practice, g is unavailable, since it depends on P, which is unknown. The ultimate goal is to do almost as well as the oracle. A predictor ĝ n will be considered as a good one if : lim R P ĝ n ] R P g ] = 0 n + excess risk Definition 3 We say that the predictor ĝ n is consistent universally consistent) if P, we have : Theorem lim E P R P ĝ n ]] R P g ] = 0 n +. Suppose that x X,the infimum of y E P l Y, y) X = x] is reached. Then the funcion g defined by : g x) arg min y Y E P l Y, y) X = x]...is a Bayes predictor. 2. In the case of the binary classification, Y = 0, and l y, ỹ) = y = ỹ), g x) = η x) > ) where η x) = P Y = X = x]. 2 Furthermore, the excess risk can be computed by R P g] R P g ] = E P gx) g X)) 2η X))]. 4) 3. In the case of the least squares regression, Furthermore, for any η : X Y, we have : g x) = η x) where η x) = E P Y X = x] R P η ] R P η ] = E P η X) η X)) 2] Proof. Let g Y X and let : We have : g x) arg min y Y E P l Y, y) X = x]. R P g ] = E P l Y, g X))] = EP l Y, g X)) X = x] P X dx) EP l Y, g x)) X = x] P X dx) = R P g ]. 4
5 2. Using the first assertion, Therefore, g x) arg min P Y = y) X = x] y 0, = arg min P Y = y X = x) y 0, = arg max P Y = y X = x) y 0, To check 4), it suffices to remark that = arg max η x)y = ) + η x))y = 0). y 0, g x) = 0, if P Y = X = x) 2, otherwise. R P g] = E P gx) Y) 2 ] = E P gx) 2 ] + E P Y 2 ] 2E P YgX)] = E P gx)] + E P Y] 2E P E P YgX) X)] = E P gx)] + E P Y] 2E P gx)e P Y X)] = E P gx)] + E P Y] 2E P gx)η P X)] = E P gx) 2η P X)] + E PY]. Writing the same identity for g P and making the difference of these two identities, we get the desired result. 3. In view of the first assertion of the theorem, we have: g x) arg min y R E ] P Y y) 2 X = x = arg min y R ϕ y) where ϕ y) = E P Y 2 X = x ] 2yE P Y X = x] + y 2 is a second order polynomial. The minimization of such a polynomial is straightforward and leads to: arg min y R ϕ y) = E P Y X = x]. This shows that the Bayes predictor is equal to the regression function η x). The risk of this predictor is: R P η ] = E P Y η X)) 2] ]) = E P EP Y η X)) 2 X ] = E P EP Y η X)) 2 X ] = R P η ] E P η η) 2 X), where the crossproduct term vanishes since ) + 2E P Y η X)) η η) X) X] + η η) 2 X) E P Y η X)) η η) X) X] = η η) X)E P Y η X)) X] = 0. This completes the proof of the theorem. 5
6 3. Link between Binary Classification & Regression Plugin rule We start by estimating η x) by ˆη n x), ) We define ĝ n x) = ˆη n > 2. Question: How good the plugin rule ĝ n is? ) Proposition Let ˆη be an estimator of the regression function η, and let ĝx) = ˆη x) > 2. Then, we have : R class ĝ] R class g ] 2 R reg ˆη] R reg η ] ) Proof Let η : X Y R, and gx) = ηx) > 2, and let s compute the excess risk of g. We have, R class g] R class g ] = E P gx) g X)) 2η X))]. Since g and g are both indicator functions and, therefore, take only the values 0 and, their difference will be nonzero if and only if one of them is equal to and the other one is equal to 0. This leads to R class g ] Rclass E P ηx) /2 < η X) ) 2η X) ] +E P η X) /2 < ηx) ) 2η X) ] = 2E P /2 η X), ηx)] ) η X) /2 ] If ηx) /2 and η X) > /2, then η X) /2 η X) ηx), and thus : ] R class g Rclass g ] 2E P /2 ηx), η X)] ) η X ) η X) ] 2E P ηx) η X) ] 2 EP ηx) η X) ) 2] = 2 R reg η) R reg η ). Since this inequality is true for every deterministic η, we get the desired property. 6
Machine Learning and Applications Christoph Lampert
Machine Learning and Applications Christoph Lampert Spring Semester 2014/2015 Lecture 2 Decision Theory (for Supervised Learning Problems) Goal: Understand existing algorithms Develop new algorithms with
More informationAdvanced Introduction to Machine Learning CMU10715
Advanced Introduction to Machine Learning CMU10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationKnowledge Engineering and Expert Systems
Knowledge Engineering and Expert Systems Lecture Notes on Machine Learning Matteo Mattecci matteucci@elet.polimi.it Department of Electronics and Information Politecnico di Milano Lecture Notes on Machine
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Statistical Learning Theory Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationLecture Notes 5. For now, we focus on parametric models. Later we consider nonparametric models.
Lecture Notes 5 Statistical Models (Chapter 6) A statistical model P is a collection of probability distributions (or a collection of densities) Examples of nonparametric models are { P = p : (p (x)) dx
More informationStatistical Prediction
Statistical Prediction Matteo Pelagatti October 11, 2013 A statistical prediction is a guess about the value of a random variable Y based on the outcome of other random variables X 1,..., X m. Thus, a
More informationMaximum likelihood estimation: the optimization poin
Maximum likelihood estimation: the optimization point of view Guillaume Obozinski Ecole des Ponts  ParisTech Master MVA 20142015 1/26 Outline 1 Statistical concepts 2 A short review of convex analysis
More informationAn Introduction to Statistical Machine Learning  Classical Models 
An Introduction to Statistical Machine Learning  Classical Models  Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More informationMethods of Estimation
Chapter 2 Methods of Estimation 2.1 The plugin principles Framework: X P P, usually P = {P θ : θ Θ} for parametric models. More specifically, if X 1,, X n i.i.d.p θ, then P θ = P θ P θ. Unknown parameters:
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationModeFinding of Gaussian Mixtures
ModeFinding of Gaussian Mixtures Seppo Pulkkinen University of Turku January 13, 2012 Seppo Pulkkinen (University of Turku) ModeFinding of Gaussian Mixtures January 13, 2012 1 / 22 Outline 1 Introduction
More informationUNE ESTIMÉE DES SOMMES DE GAUSS DANS DES CORPS FINIS ARBITRAIRES
A GAUSS SUM ESTIMATE IN ARBITRARY FINITE FIELDS Jean Bourgain MeiChu Chang Summary. We establish bounds on exponential sums ψx n ) where q = p m, p prime, ψ an additive character on F q. They extend the
More informationMaster MVA: Apprentissage par renforcement Lecture: 5. Références bibliographiques: [LGM10b, MS08, MMLG10, ASM08] i=1 X ] log 2/δ (b i a i ) n.
Master MVA: Apprentissage par renforcement Lecture: 5 Sample complexity en apprentissage par renforcement Professeur: Rémi Munos http://researchers.lille.inria.fr/ munos/mastermva/ Références bibliographiques:
More informationReal vs. Complex Null Space Properties for Sparse Vector Recovery
Real vs. Complex Null Space Properties for Sparse Vector Recovery Simon Foucart, Rémi Gribonval Abstract We identify and solve an overlooked problem about the characterization of underdetermined systems
More informationClustering / Unsupervised Methods
Clustering / Unsupervised Methods Jason Corso, Albert Chen SUNY at Buffalo J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 1 / 41 Clustering Introduction Until now, we ve assumed our training
More informationMidterm Exam, Spring 2011
10701 Midterm Exam, Spring 2011 1. Personal info: Name: Andrew account: Email address: 2. There are 14 numbered pages in this exam (including this cover sheet). 3. You can use any material you brought:
More information508B (Statistics Camp, Wash U, Summer 2014) Point Estimation. Instructor: Andrés Hincapié. THIS VERSION: August 12, 2014
Point Estimation Instructor: Andrés Hincapié THIS VERSION: August 12, 2014 Point Estimation 2 When sample is assumed to come from a population with f (x θ), knowing θ yields knowledge about the entire
More informationOverfitting and Model Selection
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 22) Outline Outline I Overfitting 2 3 Regularized Linear Regression 4 Foray into Statistical
More informationEfficiency of the minimum quadratic distance estimator for the bivariate Poisson distribution
Efficiency of the minimum quadratic distance estimator for the bivariate Poisson distribution Louis G. Doray and Baba Madji Alhadji Kolo 1 Département de mathématiques et de statistique, Université de
More informationRecap. We are discussing nonparametric estimation of density functions. PR NPTEL course p.1/122
Recap We are discussing nonparametric estimation of density functions. PR NPTEL course p.1/122 Recap We are discussing nonparametric estimation of density functions. Here we do not assume any form for
More informationIntroduction to Statistical Learning
Introduction to Statistical Learning JeanPhilippe Vert JeanPhilippe.Vert@ensmp.fr Mines ParisTech and Institut Curie Master Course, 2011. JeanPhilippe Vert (Mines ParisTech) 1 / 46 Outline 1 Introduction
More informationA crash course in probability and Naïve Bayes classification
Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s
More informationUsing the Mean Absolute Percentage Error for Regression Models
Using the Mean Absolute Percentage Error for Regression Models Arnaud de Myttenaere 1,2, Boris Golden 1, Bénédicte Le Grand 3 & Fabrice Rossi 2 1  Viadeo 30 rue de la Victoire, 75009 Paris  France 2
More informationA real application of the filtered derivative with false discovery rate
A real application of the filtered derivative with false discovery rate MOHAMED ELMI 1 1 Université de Djibouti, Faculté de Science, mahamedelmifr@yahoo.fr Résumé. Dans ce travail, nous donnons une application
More informationAsymptotic Confidence Bands for Density and Regression Functions in the Gaussian Case
Journal Afrika Statistika Journal Afrika Statistika Vol 5, N,, page 79 87 ISSN 369X Asymptotic Confidence Bs for Density egression Functions in the Gaussian Case Nahima Nemouchi Zaher Mohdeb Department
More informationChapter 6. Kernel Methods
Chapter 6 Kernel Methods Below is the results of using running mean (K nearest neighbor) to estimate the effect of time to zero conversion on CD4 cell count. One of the reasons why the running mean (seen
More informationProbability Review. Rob Hall. September 9, 2010
Probability Review Rob Hall September 9, 2010 What is Probability? Probability reasons about a sample, knowing the population. The goal of statistics is to estimate the population based on a sample. Both
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo February 2012 J. Corso (SUNY at Buffalo) Parametric Techniques February 2012 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed
More informationGenerative Learning algorithms
CS9 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we ve mainly been talking about learning algorithms that model p(y x; θ), the conditional distribution of y given x For instance,
More informationIntroduction Introduction to Pattern Recognition. Lecture 7: Density Estimation and Parzen Windows p.4/29. Types of nonparametric methods
Introduction 8001652 Introduction to Pattern Recognition. Lecture 7: Density Estimation and Parzen Windows Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology
More information1 Theta Functions. 2 Poisson Summation for Lattices. April 22, 1:00 pm
April 22, 1:00 pm 1 Theta Functions We ve previously seen connections between modular forms and Ramanujan s work by verifying that Eisenstein series are indeed modular forms, and showing that the Discriminant
More informationPoints of nondifferentiability of typical Lipschitz functions
Points of nondifferentiability of typical Lipschitz functions D. Preiss 1 and J. Tišer 2 While the Lebesgue Theorem, according to which realvalued Lipschitz functions defined on the real line are differentiable
More informationDensity Estimation Trees
Department of Informatics Imanol Studer (09734575) Density Estimation Trees Seminar Report: Database Systems Department of Informatics  Database Technology University of Zurich Supervision Prof. Dr.
More informationAn introduction to statistical learning theory. Fabrice Rossi TELECOM ParisTech November 2008
An introduction to statistical learning theory Fabrice Rossi TELECOM ParisTech November 2008 About this lecture Main goal Show how statistics enable a rigorous analysis of machine learning methods which
More informationLECTURE NOTES PROF. ALAN YUILLE
LECTURE NOTES PROF. ALAN YUILLE 1. NonParametric Learning In previous lectures, we described ML learning for parametric distributions in particular, for exponential models of form p(x λ) = (1/Z[λ]) exp{λ
More informationThe Elements of Statistical Learning
The Elements of Statistical Learning http://wwwstat.stanford.edu/~tibs/elemstatlearn/download.html http://wwwbcf.usc.edu/~gareth/isl/islr%20first%20printing.pdf Printing 10 with corrections Contents
More informationCOS 511: Foundations of Machine Learning. Rob Schapire Lecture #14 Scribe: Qian Xi March 30, 2006
COS 511: Foundations of Machine Learning Rob Schapire Lecture #14 Scribe: Qian Xi March 30, 006 In the previous lecture, we introduced a new learning model, the Online Learning Model. After seeing a concrete
More informationLocally Weighted Regression. 2 Parametric vs nonparametric egression methods
CMSC 35900 (Spring 2009) Large Scale Learning Lecture: 6 Locally Weighted Regression Instructors: Sham Kakade and Greg Shakhnarovich NN in a subspace A common preprocessing step is to project the data
More information1.7.1 Moments and Moment Generating Functions
18 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY 1.7.1 Moments and Moment Generating Functions Definition 1.12. The nth moment n N) of a random variable X is defined as µ n E Xn The nth central
More informationEcon 514: Probability and Statistics. Lecture 9: Point estimation. Point estimators
Econ 514: Probability and Statistics Lecture 9: Point estimation Point estimators In Lecture 7 we discussed the setup of a study of the income distribution in LA. Regarding the population we considered
More information4. Continuous Random Variables 4.1: Definition. Density and distribution functions. Examples: uniform, exponential, Laplace, gamma. Expectation, variance. Quantiles. 4.2: New random variables from old.
More informationLecture 2: Linear Algebra and Fourier Series.
Lecture : Linear Algebra and Fourier Series. 1 Introduction. At the beginning of the first lecture we gave the definition of Fourier series. Here we begin with the same definition: Definition 1.1 The Fourier
More informationProbability Review. Gonzalo Mateos
Probability Review Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ September 28, 2016 Introduction
More informationMissing Data and the EM algorithm
Missing Data and the EM algorithm MSc Further Statistical Methods Lecture 4 and 5 Hilary Term 2007 Steffen Lauritzen, University of Oxford; January 31, 2007 Missing data problems case A B C D E F 1 a 1
More informationLecture 16. Point Estimation, Sample Analogue Principal
Lecture 16. Point Estimation, Sample Analogue Principal 11/14/2011 Point Estimation In a typical statistical problem, we have a random variable/vector X of interest but its pdf f X (x or pmf p X (x is
More informationLinear Discriminant Analysis
Linear Discriminant Analysis Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Notation The prior probability of class k is π k, K k=1 π k = 1. π k is usually estimated
More informationPart II: Web Content Mining Chapter 3: Clustering
Part II: Web Content Mining Chapter 3: Clustering Learning by Example and Clustering Hierarchical Agglomerative Clustering KMeans Clustering ProbabilityBased Clustering Collaborative Filtering Slides
More information1. Univariate Random Variables
The Transformation Technique By Matt Van Wyhe, Tim Larsen, and Yongho Choi If one is trying to find the distribution of a function (statistic) of a given random variable X with known distribution, it is
More informationMachine Learning. Last time: PAC and Agnostic Learning
Machine Learning 070/5 70/578, 78, Spring 2008 Computational Learning Theory II Eric Xing Lecture 2, February 25, 2008 Reading: Chap. 7 T.M book Last time: PAC and Agnostic Learning Finite H, assume
More informationPMR 2728 / 5228 Probability Theory in AI and Robotics. Machine Learning. Fabio G. Cozman  Office MS08 
PMR 2728 / 5228 Probability Theory in AI and Robotics Machine Learning Fabio G. Cozman  Office MS08  fgcozman@usp.br November 12, 2012 Machine learning Quite general term: learning from explanations,
More informationLinear models for classification
Linear models for classification Grzegorz Chrupa la and Nicolas Stroppa Saarland University Google META Workshop Chrupala and Stroppa (UdS) Linear models 2010 1 / 62 Outline 1 Linear models 2 Perceptron
More informationMax Min Word Problems. Our approach to max min word problems is modeled after our approach to related rates word problems. We will
Max Min Word Problems Our approach to max min word problems is modeled after our approach to related rates word problems. We will Max Min Word Problems Our approach to max min word problems is modeled
More informationStatistiques en grande dimension
Statistiques en grande dimension Christophe Giraud 1,2 et Tristan MaryHuart 3,4 (1) Université ParisSud (2) Ecole Polytechnique (3) AgroParistech (4) INRA  Le Moulon M2 MathSV & Maths Aléa C. Giraud
More informationLecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,
Lecture Slides for INTRODUCTION TO Machine Learning ETHEM ALPAYDIN The MIT Press, 2004 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml CHAPTER 4: Parametric Methods Parametric Estimation X
More informationSummer School in Statistics for Astronomers. June 26, 2014
Summer School in Statistics for Astronomers June 26, 2014 Inference II: Maximum Likelihood Estimation, the CramérRao Inequality, and the Bayesian Information Criterion James L Rosenberger Acknowledgements:
More informationMath 6810 (Probability and Fractals) Spring Lecture notes
Math 681 (Probability and Fractals Spring 216 Lecture notes Pieter Allaart University of North Texas March 23, 216 2 Recommended reading: (Do not purchase these books before consulting with your instructor!
More informationActualization Process and Financial Risk. Summary. Résumé. Procede d'actualisation et Risque Financier
Actualization Process and Financial Risk Pierre Devolder Royale Belge, 25 boulevard du Souverain, 1170 Brussels, Belgium Summary The purpose of this paper is to present a general stochastic model of capitalization,
More informationMath 209B Homework 1
Math 09B Homework Edward Burkard 3.5. Functions of Bounded Variation. 3. Signed Measures and Differentiation Exercise 30 Construct an increasing function on R whose set of discontinuities is Q. Let Q {q
More informationA simple lackoffit test for a wide class of regression models
A simple lackoffit test for a wide class of regression models JeanBaptiste Aubin, Samuela LeoniAubin To cite this version: JeanBaptiste Aubin, Samuela LeoniAubin. A simple lackoffit test for a
More informationCHAPTER 2. CramerRao lower bound
CHAPTER 2. CramerRao lower bound Given an estimation problem, what is the variance of the best possible estimator? This quantity is given by the CramerRao lower bound (CRLB), which we will study in this
More informationLearning Theory. 1 Introduction. 2 Hoeffding s Inequality. Statistical Machine Learning Notes 10. Instructor: Justin Domke
Statistical Machine Learning Notes Instructor: Justin Domke Learning Theory Introduction Most of the methods we have talked about in the course have been introduced somewhat heuristically, in the sense
More informationMachine Learning Fall 2011: Homework 2 Solutions
10701 Machine Learning Fall 2011: Homework 2 Solutions 1 Linear regression, model selection 1.1 Ridge regression Starting from our true model y Xθ + ϵ, we express ˆθ in terms of ϵ and θ: y Xθ + ϵ X X
More informationExercises Chapter 2 Maximum Likelihood Estimation
Exercises Chapter 2 Maximum Likelihood Estimation Advanced Econometrics  HEC Lausanne Christophe Hurlin University of Orléans November 2013 Christophe Hurlin (University of Orléans) Advanced Econometrics
More informationRegression Estimation  Least Squares and Maximum Likelihood. Dr. Frank Wood
Regression Estimation  Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization Function to minimize w.r.t. b 0, b 1 Q = n (Y i (b 0 + b 1 X i )) 2 i=1 Minimize this by maximizing
More informationFeature selection, L1 vs. L2 regularization, and rotational invariance
Feature selection, L vs. L2 regularization, and rotational invariance Andrew Ng ICML 2004 Presented by Paul Hammon April 4, 2005 Outline. Background information 2. L regularized logistic regression 3.
More informationLecture 9. sup. Exercise 1.3 Use characteristic functions to show that if µ is infinitely divisible, and for each
Lecture 9 1 Infinitely Divisible Distributions Given a triangular array of independent random variables, X ) 1 i n, we have previously given sufficient conditions for the law of S n := n X to converge
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Study material Handouts, your notes and course readings Primary
More information8 Laws of large numbers
8 Laws of large numbers 8.1 Introduction The following comes up very often, especially in statistics. We have an experiment and a random variable X associated with it. We repeat the experiment n times
More informationProblems 1(a)(b)(c), 2(a)(b)(c), 3(a)(b)(c)(d)(e), 4(a)
Problems 1(a)(b)(c), 2(a)(b)(c), 3(a)(b)(c)(d)(e), 4(a) Bob Lutz MATH 138 Final xamination 03/06/2012 Problem 1. Suppose that F and G are increasing functions on R. Recall that µ F is the unique Borel
More informationECON 3150/4150, Spring term Lecture 3
ECON 3150/4150, Spring term 2013. Lecture 3 Review of theoretical statistics for econometric modelling (I) Ragnar Nymoen University of Oslo 22 January 2013 1 / 47 References to Lecture 3 HGL: Probability
More informationCh. 17 Maximum Likelihood Estimation
Ch 7 Maximum Likelihood Estimation Introduction The identification process having led to a tentative formulation for the model, we then need to obtain efficient estimates of the parameters After the parameters
More informationIntroduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011
Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning
More informationLearning GPs from Multiple Tasks
Learning GPs from Multiple Tasks Kai Yu 1 Joint work with Volker Tresp 1, and Anton Schwaighofer 2 1 Corporate Technology, Siemens, Munich 2 Intelligent Data Analysis, Fraunhofer FIRST, Berlin First Prev
More informationProblems with solution to the written Master s ExaminationOption I
Problems with solution to the written Master s ExaminationOption I Probability and Statistics, Spring 7 [] Math 3 MS Exam, Spring 7 If A is a real symmetric n n matrix then show that A is idempotent
More informationLecture 1: Introduction to regression and prediction
Lecture 1: Introduction to regression and prediction Rafael A. Irizarry and Hector Corrada Bravo January, 2010 Introduction A common situation in applied sciences is that one has an independent variable
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,
More informationIntroduction to Machine Learning Lecture 3. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 3 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Bayesian Learning Terminology: Bayes Formula/Rule Pr[Y X] = posterior probability likelihood
More informationExtending from bijections between marked occurrences of patterns to all occurrences of patterns.
FPSAC 2012, Nagoya, Japan DMTCS proc. AR, 2012, 985 992 Extending from bijections between marked occurrences of patterns to all occurrences of patterns. Jeffrey Remmel and Mark Tiefenbruck Department of
More information6 Jointly continuous random variables
6 Jointly continuous random variables Again, we deviate from the order in the book for this chapter, so the subsections in this chapter do not correspond to those in the text. 6.1 Joint density functions
More informationINDIAN INSTITUTE OF SCIENCE STOCHASTIC HYDROLOGY. Lecture 6 Course Instructor : Prof. P. P. MUJUMDAR Department of Civil Engg., IISc.
INDIAN INSTITUTE OF SCIENCE STOCHASTIC HYDROLOGY Lecture 6 Course Instructor : Prof. P. P. MUJUMDAR Department of Civil Engg., IISc. Summary of the previous lecture Normal distribu@on Central limit theorem
More informationLecture 24: Completeness
Lecture 24: Completeness Definition 6.2.16 (ancillary statistics) A statistic V (X) is ancillary iff its distribution does not depend on any unknown quantity. A statistic V (X) is firstorder ancillary
More informationBAYESIAN ESTIMATION UNDER ESTIMATION CONSTRAINT. 1. Introduction
ACTA MATHEMATICA VIETNAMICA 201 Volume 28, Number 2, 2003, pp. 201207 BAYESIAN ESTIMATION UNDER ESTIMATION CONSTRAINT PHAM GIA THU AND TRAN LOC HUNG Abstract. We suppose that a constraint is imposed on
More informationThe training of the RBF neural networks is often composed of two stages; (i) find the number and the centers c i and then (ii) find the weights w i.
RBF center selection methods The training of the RBF neural networks is often composed of two stages; (i) find the number and the centers c i and then (ii) find the weights w i. So far we assumed that
More informationGeneralized Linear Model Theory
Appendix B Generalized Linear Model Theory We describe the generalized linear model as formulated by Nelder and Wedderburn (1972), and discuss estimation of the parameters and tests of hypotheses. B.1
More informationConsistency of Surrogate Risk Minimization Methods for Binary Classification using Strongly Proper Losses
E0 370 Statistical Learning Theory Lecture 13 Sep 4, 013) Consistency of Surrogate Risk Minimization Methods for Binary Classification using Strongly Proper Losses Lecturer: Shivani Agarwal Scribe: Rohit
More informationManipulating the Multivariate Gaussian Density
Manipulating the Multivariate Gaussian Density Thomas B. Schön and Fredrik Lindsten Division of Automatic Control Linköping University SE 5883 Linköping, Sweden. Email: {schon, lindsten}@isy.liu.se January,
More informationAmath 546/Econ 589 Copulas
Amath 546/Econ 589 Copulas Eric Zivot Updated: May 15, 2012 Reading FRF chapter 1 QRM chapter 4, sections 5 and 6; chapter 5 FMUND chapter 6 SDAFE chapter 8 Introduction Capturing comovement between financial
More information5. Moment Generating Functions and Functions of Random Variables
5. Moment Generating Functions and Functions of Random Variables 5.1 The Distribution of a Sum of Two Independent Random Variables  Convolutions Suppose X and Y are discrete random variables. Let S =
More informationPower Series. Chapter Introduction
Chapter 6 Power Series Power series are one of the most useful type of series in analysis. For example, we can use them to define transcendental functions such as the exponential and trigonometric functions
More informationMaximumLikelihood and Bayesian Parameter Estimation
MaximumLikelihood and Bayesian Parameter Estimation Expectation Maximization (EM) Estimating Missing Feature Value Estimating missing variable with known parameters Known Value In the absence of x, most
More informationStatistics 100A Homework 6 Solutions
Chapter 5 Statistics A Homework Solutions Ryan Rosario 3. The time in hours) required to repair a machine is an exponential distributed random variable with paramter λ. What is Let X denote the time in
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIALille) ENS Cachan  Master 2 MVA SequeL INRIA Lille MVARL Course In This Lecture How do we formalize the agentenvironment
More informationHypoelliptic simulated annealing
Hypoelliptic simulated annealing Christian Bayer 1 Josef Teichmann 2 Richard Warnung 3 1 Department of Mathematics Royal Institute of Technology, Stockholm 2 Department of Mathematics ETH Zurich 3 Raiffeisen
More informationUNSUPERVISED LEARNING AND CLUSTERING. Jeff Robble, Brian Renzenbrink, Doug Roberts
UNSUPERVISED LEARNING AND CLUSTERING Jeff Robble, Brian Renzenbrink, Doug Roberts Unsupervised Procedures A procedure that uses unlabeled data in its classification process. Why would we use these? Collecting
More informationLecture 9: Statistical models and experimental designs Chapters 8 & 9
Lecture 9: Statistical models and experimental designs Chapters 8 & 9 Geir Storvik Matematisk institutt, Universitetet i Oslo 12. Mars 2014 Content Why statistical models Statistical inference Experimental
More information5.5 Convergence Concepts
5.5 Convergence Concepts This section treats the somewhat fanciful idea of allowing the sample size to approach infinity and investigates the behavior of certain sample quantities as this happens. We are
More informationChapter 4 Continuous Random Variables
Chapter 4 Continuous Random Variables 曾志成 國立宜蘭大學電機工程學系 tsengcc@niu.edu.tw 1 Figure 4.1 X Y=n X Y=1 Y=2 Y=3 The random pointer on disk of circumference 1. 2 Example 4.1 Problem Suppose we have a wheel of
More informationCHAPTER 6. Max, Min, Sup, Inf
CHAPTER 6 Max, Min, Sup, Inf We would like to begin by asking for the maximum of the function f(x) = (sin x)/x. An approximate graph is indicated below. Looking at the graph, it is clear that f(x) 1 for
More informationLAPLACE S METHOD, FOURIER ANALYSIS, AND RANDOM WALKS ON Z d
LAPLACE S METHOD, FOURIER ANALYSIS, AND RANDOM WALKS ON Z d STEVEN P. LALLEY 1. LAPLACE S METHOD OF ASYMPTOTIC EXPANSION 1.1. Stirling s formula. Laplace s approach to Stirling s formula is noteworthy
More informationMicroeconometrics Blundell Lecture 1 Overview and Binary Response Models
Microeconometrics Blundell Lecture 1 Overview and Binary Response Models Richard Blundell http://www.ucl.ac.uk/~uctp39a/ University College London FebruaryMarch 2016 Blundell (University College London)
More information