Basics of Statistical Machine Learning



Similar documents
1 Prior Probability and Posterior Probability

Maximum Likelihood Estimation

Statistical Machine Learning from Data

CHAPTER 2 Estimating Probabilities

Part 2: One-parameter models

Efficiency and the Cramér-Rao Inequality

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Inference of Probability Distributions for Trust and Security applications

Practice problems for Homework 11 - Point Estimation

Bayes and Naïve Bayes. cs534-machine Learning

Bayesian Statistics in One Hour. Patrick Lam

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Principle of Data Reduction

Statistical Machine Learning

Linear Classification. Volker Tresp Summer 2015

171:290 Model Selection Lecture II: The Akaike Information Criterion

Markov Chain Monte Carlo Simulation Made Simple

Variations of Statistical Models

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

1 Sufficient statistics

STA 4273H: Statistical Machine Learning

Reject Inference in Credit Scoring. Jie-Men Mok

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Christfried Webers. Canberra February June 2015

Estimating an ARMA Process

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Bayesian Analysis for the Social Sciences

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

PS 271B: Quantitative Methods II. Lecture Notes

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Lecture 3: Linear methods for classification

Multivariate Normal Distribution

3F3: Signal and Pattern Processing

HT2015: SC4 Statistical Data Mining and Machine Learning

Master s Theory Exam Spring 2006

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

L4: Bayesian Decision Theory

Monte Carlo Simulation

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Non Parametric Inference

Stat 5102 Notes: Nonparametric Tests and. confidence interval

Efficient Cluster Detection and Network Marketing in a Nautural Environment

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Machine Learning and Pattern Recognition Logistic Regression

Centre for Central Banking Studies

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

5.1 Identifying the Target Parameter

Introduction to General and Generalized Linear Models

STAT 830 Convergence in Distribution

Gaussian Conjugate Prior Cheat Sheet

Gamma Distribution Fitting

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Constrained Bayes and Empirical Bayes Estimator Applications in Insurance Pricing

2WB05 Simulation Lecture 8: Generating random variables

Credit Risk Models: An Overview

Nonlinear Regression:

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Exact Confidence Intervals

Chapter 8. The exponential family: Basics. 8.1 The exponential family

Lecture 8. Confidence intervals and the central limit theorem

Likelihood: Frequentist vs Bayesian Reasoning

Hypothesis Testing for Beginners

An Internal Model for Operational Risk Computation

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Statistics Graduate Courses

Reliability estimators for the components of series and parallel systems: The Weibull model

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:

Least Squares Estimation

4. Continuous Random Variables, the Pareto and Normal Distributions

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Classification Problems


1 Maximum likelihood estimation

Econometrics Simple Linear Regression

6.4 Normal Distribution

Chapter 3 RANDOM VARIATE GENERATION

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

An Introduction to Basic Statistics and Probability

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

The Optimality of Naive Bayes

Chapter 4: Statistical Hypothesis Testing

Transcription:

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar concepts here with a different name. 1 arametric vs. Nonparametric Statistical Models A statistical model H is a set of distributions. In machine learning, we call H the hypothesis space. A parametric model is one that can be parametrized by a finite number of parameters. We write the DF f(x) = f(x; θ) to emphasize the parameter θ R d. In general, H = { f(x; θ) : θ Θ R d} (1) where Θ is the parameter space. We will often use the notation E θ (g) = g(x)f(x; θ) dx (2) x to denote the expectation of a function g with respect to f(x; θ). Note the subscript in E θ does not mean integrating over all θ. This notation is standard but unfortunately confusing. It means w.r.t. this fixed θ. In machine learning terms, this means w.r.t. different training sets all sampled from this θ. We will see integration over all θ when discussing Bayesian methods. Example 1 Consider the parametric model H = {N(µ, 1) : µ R}. Given iid data x 1,..., x n, the optimal estimator of the mean is µ = 1 n xi. All (parametric) models are wrong. Some are more useful than others. A nonparametric model is one which cannot be parametrized by a fixed number of parameters. Example 2 Consider the nonparametric model H = { : V ar (X) < }. Given iid data x 1,..., x n, the optimal estimator of the mean is again µ = 1 n xi. Example 3 In a naive Bayes classifier we are interested in computing the conditional p(y x; θ) p(y; θ) d i p(x i y; θ). Is this a parametric or nonparametric model? The model is specified by H = {p(x, y; θ)} where θ contains the parameter for the class prior multinomial distribution p(y) (finite number of parameters), and the class conditional distributions p(x i y) for each dimension. The latter can be parametric (such as a multinomial over the vocabulary, or a Gaussian), or nonparametric (such as 1D kernel density estimation). Therefore, naive Bayes can be either parametric or nonparametric, although in practice the former is more common. Should we prefer parametric or nonparametric models? Nonparametric makes weaker model assumptions and thus is preferred. However, parametric models converges faster and are more practical. In machine learning we are often interested in a function of the distribution T (F ), for example, the mean. We call T the statistical functional, viewing F the distribution itself a function of x. However, we will also abuse the notation and say θ = T (F ) is a parameter even for nonparametric models. 1

Basics of Statistical Machine Learning 2 2 Estimation Given X 1... X n F H, an estimator θ n is any function of X 1... X n that attempts to estimate a parameter θ. This is the learning in machine learning! In machine learning, a familiar case is classification where X i = (x i, y i ) and θ n is the parameters of the classifier learned from such training data. Note the subscript n for the training set size. Clearly, θ n is a random variable because the training set is random. Also note that the phrase training set is a misnomer because it is not a set: we allow multiple instances of the same element. Some people therefore prefer the term training sample. An estimator is consistent if θ n θ. (3) Consistency is a fundamental, desirable property of good machine learning algorithms. Here the sequence of random variables is w.r.t. training set size. Would you like a learning algorithm which gets worse with more training data? Because θ n is a random variable, we can talk about its expectation: E θ ( θ n ) (4) where E θ is w.r.t. the joint distribution f(x 1,..., x n ; θ) = n i=1 f(x i; θ). Then, the bias of the estimator is bias( θ n ) = E θ ( θ n ) θ. (5) An estimator is unbiased if bias( θ n ) = 0. The standard error of an estimator is se( θ n ) = Var θ ( θ n ). (6) Example 4 Don t confuse standard error with standard deviation of f. Let ˆµ = 1 n i x i, where x i N(0, 1). Then the standard deviation of x i is 1 regardless of n. In contrast, se(ˆµ) = 1/ n = n 1 2 which decreases with n. The mean squared error of an estimator is mse( θ n ) = E θ (( θ n θ) 2). (7) Theorem 1 mse( θ n ) = bias 2 ( θ n ) + se 2 ( θ n ) = bias 2 ( θ n ) + Var θ ( θ n ). If bias( θ n ) 0 and Var θ ( θ n ) 0 then mse( θ n ) 0. This implied θ n L 2 θ, and θ n θ, so that θn is consistent. Why are we interested in the mse? Normally we don t. We will see other quality measures later. 3 Maximum Likelihood For parametric statistical models, a common estimator is the maximum likelihood estimator. Let x 1,..., x n be iid with DF f(x; θ) where θ Θ. The likelihood function is L n (θ) = f(x 1,..., x n ; θ) = n f(x i ; θ). (8) i=1

Basics of Statistical Machine Learning 3 The log likelihood function is l n (θ) = log L n (θ). The maximum likelihood estimator (MLE) is θ n = argmax θ Θ L n (θ) = argmax θ Θ l n (θ). (9) Example 5 The MLE for p(head) from n coin flips is count(head)/n, sometimes called estimating probability by the frequency. This is also true for multinomials. The MLE for X 1,..., X N N(µ, σ 2 ) is µ = 1/n i X i and σ 2 = 1/n (X i µ) 2. These agree with our intuition. However, the MLE does not always agree with intuition. For example, the MLE for X 1,..., X n uniform(0, θ) is θ = max(x 1,..., X n ). You would think θ is larger, no? The MLE has several nice properties. The Kullback-Leibler divergence between two DFs is KL(f g) = f(x) log ( f(x) g(x) ) dx. (10) The model H is identifiable if θ, ψ Θ, θ ψ implies KL(f(x; φ) f(x; ψ)) > 0. parameters correspond to different DFs. That is, different Theorem 2 When H is identifiable, under certain conditions (see Wasserman Theorem 9.13), the MLE θ n θ, where θ is the true value of the parameter θ. That is, the MLE is consistent. Given n iid observations, the Fisher information is defined as [ ( ) ] 2 [ ] 2 I n (θ) = ne θ θ ln f(x; θ) = ne θ ln f(x; θ) θ2 (11) Example 6 Consider n iid observations x i {0, 1} from a Bernoulli distribution with true parameter p. f(x; p) = p x (1 p) 1 x. It follows that 2 θ ln f(x; θ), evaluated at p, is x/p 2 (1 x)/(1 p) 2. Taking 2 the expectation over x under f(x; p) and multiply by n, we arrive at I n (p) = n p(1 p). Informally, Fisher information measures the curvature of the log likelihood function around θ. A sharp peak around θ means the true parameter is distinct and should be easier to learn from n samples. The Fisher information is sometimes used in active learning to select queries. Note Fisher information is not a random variable. It does not depend on the particular n items, but rather only on the size n. Fisher information is not Shannon information. Theorem 3 (Asymptotic Normality of the MLE). Let se = V ar θ ( θ n ). Under appropriate regularity conditions, se 1/I n (θ), and Furthermore, let ŝe = 1/I n ( θ n ). Then θ n θ se N(0, 1). (12) θ n θ ŝe N(0, 1). (13) This says that the MLE is distributed asymptotically as N(θ, 1 I n( θ n) ). There is uncertainty, determined by both the sample size n and the Fisher information. It turns out that this uncertainty is fundamental, that no (unbiased) estimators can do better than this. This is captured by the Cramér-Rao bound. In other words, no (unbiased) machine learning algorithms can estimate the true parameter any better. Such information is very useful for designing machine learning algorithms.

Basics of Statistical Machine Learning 4 Theorem 4 (Cramér-Rao Lower Bound) Let θ n be any unbiased estimator (not necessarily the MLE) of θ. Then the variance is lower bounded by the inverse Fisher information: V ar θ ( θ n ) 1 I n (θ). (14) The Fisher information can be generalized to the high dimensional case. Let θ be a parameter vector. The Fisher information matrix has i, jth element [ 2 ] ln f(x; θ) I ij (θ) = E. (15) θ i θ j An unbiased estimator that achieves the Cramér-Rao lower bound is said to be efficient. It is asymptotically efficient if it achieves the bound as n. Theorem 5 The MLE is asymptotically efficient. However, a biased estimator can sometimes achieve lower mse. 4 Bayesian Inference The statistical methods discussed so far are frequentist methods: robability refers to limiting relative frequency. Data are random. Estimators are random because they are functions of data. arameters are fixed, unknown constants not subject to probabilistic statements. rocedures are subject to probabilistic statements, for example 95% confidence intervals traps the true parameter value 95 Classifiers, even learned with deterministic procedures, are random because the training set is random. AC bound is similarly frequentist. Most procedures in machine learning are frequentist methods. An alternative is the Bayesian approach: robability refers to degree of belief. Inference about a parameter θ is by producing a probability distributions on it. Typically, one starts with a prior distribution p(θ). One also chooses a likelihood function p(x θ) note this is a function of θ, not x. After observing data x, one applies the Bayes Theorem to obtain the posterior distribution p(θ x): p(θ)p(x θ) p(θ x) = p(θ)p(x θ), (16) p(θ )p(x θ )dθ where Z p(θ )p(x θ )dθ is known as the normalizing constant. The posterior distribution is a complete characterization of the parameter. Sometimes, one uses the mode of the posterior as a simple point estimate, known as the maximum a posteriori (MA) estimate of the parameter: Note MA is not a proper Bayesian approach. θ MA = argmax θ p(θ x). (17)

Basics of Statistical Machine Learning 5 rediction under an unknown parameter is done by integrating it out: p(x Data) = p(x θ)p(θ Data)dθ. (18) Here lies the major difference between frequentist and Bayesian approaches in machine learning practice. A frequentist approach would produce a point estimate ˆθ from Data, and predict with p(x ˆθ). In contrast, the Bayesian approach needs to integrate over different θs. In general, this integration is intractable and hence Bayesian machine learning has been focused on either finding special distributions for which the integration is tractable, or finding efficient approximations. Example 7 Let θ be a d-dim multinomial parameter. Let the prior be a Dirichlet p(θ) = Dir(α 1,..., α d ). The likelihood is multinomial p(x θ) = Multi(x θ), where x is a training count vector. These two distributions are called conjugate to each other as the posterior is again Dirichlet: p(θ x) = Dir(α 1 + x 1,..., α d + x d ). Now let s look into the predictive distribution for some test count vector x. If θ Dir(β), the result of integrating θ out is p(x β) = p(x θ)p(θ β)dθ (19) = ( k x k )! Γ ( k β k) k (x k!) Γ ( k β k + x k ) k Γ (β k + x k ) Γ (β k ) This is an example where the integration has a happy ending: it has a simple(?) closed-form. This is known as a Dirichlet compound multinomial distribution, also known as a multivariate ólya distribution. Where does the prior p(θ) come from? Ideally it comes from domain knowledge. One major advantage of Bayesian approaches is the principled way to incorporate prior knowledge in the form of the prior. Non-informative, or flat, prior, where there does not seem to be a reason to prefer any particular parameter. This may however create improper priors. Let X N(θ, σ 2 ) with σ 2 known. A flat prior p(θ) c > 0 would be improper because p(θ)dθ =, so it is not a density. Nonetheless, the posterior distribution is well-defined. A flat prior is not transformation invariant. Jeffrey s prior p(θ) I(θ) 1/2 is. It should be pointed out that in practice, the choice of prior is often dictated by computational convenience, in particular conjugacy. (20)