Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Similar documents
Chapter 6: Point Estimation. Fall Probability & Statistics

Multivariate Normal Distribution

Quadratic forms Cochran s theorem, degrees of freedom, and all that

Multiple Linear Regression in Data Mining

Introduction to General and Generalized Linear Models

Basics of Statistical Machine Learning

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

1 Another method of estimation: least squares

5. Linear Regression

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

2. Linear regression with multiple regressors

Penalized regression: Introduction

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Stat 704 Data Analysis I Probability Review

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Master s Theory Exam Spring 2006

Maximum likelihood estimation of mean reverting processes

Simple Regression Theory II 2010 Samuel L. Baker

Econometrics Simple Linear Regression

MATHEMATICAL METHODS OF STATISTICS

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Logistic Regression (1/24/13)

**BEGINNING OF EXAMINATION** The annual number of claims for an insured has probability function: , 0 < q < 1.

International Statistical Institute, 56th Session, 2007: Phil Everson

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

Notes on Applied Linear Regression

5. Multiple regression

Forecasting in supply chains

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Time Series Analysis

Elements of statistics (MATH0487-1)

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

GLM I An Introduction to Generalized Linear Models

1.5 Oneway Analysis of Variance

Statistical Models in R

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

The Method of Least Squares

4. Simple regression. QBUS6840 Predictive Analytics.

Regression Analysis: A Complete Example

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Fairfield Public Schools

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Regression Analysis. Regression Analysis MIT 18.S096. Dr. Kempthorne. Fall 2013

PS 271B: Quantitative Methods II. Lecture Notes

Imputing Missing Data using SAS

1 Teaching notes on GMM 1.

Lecture 3: Continuous distributions, expected value & mean, variance, the normal distribution

Factor analysis. Angela Montanari

Quantile Regression under misspecification, with an application to the U.S. wage structure

THE CENTRAL LIMIT THEOREM TORONTO

Chapter 5 Estimating Demand Functions

Chapter 4: Vector Autoregressive Models

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Recall this chart that showed how most of our course would be organized:

An extension of the factoring likelihood approach for non-monotone missing data

PROBABILITY AND STATISTICS. Ma To teach a knowledge of combinatorial reasoning.

Maximum Likelihood Estimation

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Coefficient of Determination

Statistical Machine Learning

1 Simple Linear Regression I Least Squares Estimation

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Least Squares Estimation

SAS Certificate Applied Statistics and SAS Programming

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

LOGNORMAL MODEL FOR STOCK PRICES

Statistics 104: Section 6!

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Machine Learning and Pattern Recognition Logistic Regression

2. Simple Linear Regression

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp

Lecture 3: Linear methods for classification

The equivalence of logistic regression and maximum entropy models

STAT 830 Convergence in Distribution

1 Maximum likelihood estimation

Linear Regression. Guy Lebanon

MULTIPLE REGRESSION WITH CATEGORICAL DATA

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

Gamma Distribution Fitting

ANOVA. February 12, 2015

Sections 2.11 and 5.8

Population Mean (Known Variance)

Exercise 1.12 (Pg )

STA 4273H: Statistical Machine Learning

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

1 Short Introduction to Time Series

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Transcription:

Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood

Least Squares Max(min)imization Function to minimize w.r.t. b 0, b 1 Q = n (Y i (b 0 + b 1 X i )) 2 i=1 Minimize this by maximizing Q Find partials and set both equal to zero dq db 0 = 0 dq db 1 = 0

Normal Equations The result of this maximization step are called the normal equations. b 0 and b 1 are called point estimators of β 0 and β 1 respectively. Yi = nb 0 + b 1 Xi Xi Y i = b 0 Xi + b 1 X 2 i This is a system of two equations and two unknowns. The solution is given by...

Solution to Normal Equations After a lot of algebra one arrives at b 1 = (Xi X )(Y i Ȳ ) (Xi X ) 2 b 0 = Ȳ b 1 X X = Ȳ = Xi n Yi n

Least Squares Fit

Guess #1

Guess #2

Looking Ahead: Matrix Least Squares Y 1 Y 2. Y n = X 1 1 X 2 1. X n 1 [ b1 Solution to this equation is solution to least squares linear regression (and maximum likelihood under normal error distribution assumption) b 0 ]

Questions to Ask Is the relationship really linear? What is the distribution of the of errors? Is the fit good? How much of the variability of the response is accounted for by including the predictor variable? Is the chosen predictor variable the best one?

Is This Better?

Goals for First Half of Course How to do linear regression Self familiarization with software tools How to interpret standard linear regression results How to derive tests How to assess and address deficiencies in regression models

Estimators for β 0, β 1, σ 2 We want to establish properties of estimators for β 0, β 1, and σ 2 so that we can construct hypothesis tests and so forth We will start by establishing some properties of the regression solution.

Properties of Solution The i th residual is defined to be e i = Y i Ŷ i The sum of the residuals is zero: e i = (Y i b 0 b 1 X i ) i = Y i nb 0 b 1 Xi = 0

Properties of Solution The sum of the observed values Y i equals the sum of the fitted values Ŷi Y i = Ŷ i i i = i (b 1 X i + b 0 ) = (b 1 X i + Ȳ b 1 X ) i = b 1 X i + nȳ b 1 n X = b 1 n X + i i Y i b 1 n X

Properties of Solution The sum of the weighted residuals is zero when the residual in the i th trial is weighted by the level of the predictor variable in the i th trial X i e i = (X i (Y i b 0 b 1 X i )) i = i = 0 X i Y i b 0 Xi b 1 (X 2 i )

Properties of Solution The regression line always goes through the point X, Ȳ

Estimating Error Term Variance σ 2 Review estimation in non-regression setting. Show estimation results for regression setting.

Estimation Review An estimator is a rule that tells how to calculate the value of an estimate based on the measurements contained in a sample i.e. the sample mean Ȳ = 1 n n i=1 Y i

Point Estimators and Bias Point estimator ˆθ = f ({Y 1,..., Y n }) Unknown quantity / parameter θ Definition: Bias of estimator B(ˆθ) = E{ˆθ} θ

One Sample Example

Distribution of Estimator If the estimator is a function of the samples and the distribution of the samples is known then the distribution of the estimator can (often) be determined Methods Distribution (CDF) functions Transformations Moment generating functions Jacobians (change of variable)

Example Samples from a Normal(µ, σ 2 ) distribution Y i Normal(µ, σ 2 ) Estimate the population mean θ = µ, ˆθ = Ȳ = 1 n n i=1 Y i

Sampling Distribution of the Estimator First moment E{ˆθ} = E{ 1 n = 1 n n Y i } i=1 n E{Y i } = nµ n = θ i=1 This is an example of an unbiased estimator B(ˆθ) = E{ˆθ} θ = 0

Variance of Estimator Definition: Variance of estimator σ 2 {ˆθ} = E{(ˆθ E{ˆθ}) 2 } Remember: σ 2 {cy } = c 2 σ 2 {Y } n σ 2 { Y i } = n σ 2 {Y i } i=1 i=1 Only if the Y i are independent with finite variance

Example Estimator Variance For N(0, 1) mean estimator σ 2 {ˆθ} = σ 2 { 1 n = 1 n 2 n i=1 n Y i } i=1 σ 2 {Y i } = nσ2 n 2 = σ2 n Note assumptions

Central Limit Theorem Review Central Limit Theorem Let Y 1, Y 2,..., Y n be iid random variables with E{Y i } = µ and σ 2 {Y i } = σ 2 <. Define. U n = (Ȳ ) µ n σ where Ȳ = 1 n n Y i (1) Then the distribution function of U n converges to a standard normal distribution function as n. Alternatively P(a U n b) b a i=1 ( 1 2π ) e u2 2 du (2)

Distribution of sample mean estimator

Bias Variance Trade-off The mean squared error of an estimator MSE(ˆθ) = E{[ˆθ θ] 2 } Can be re-expressed MSE(ˆθ) = σ 2 {ˆθ} + B(ˆθ) 2

MSE = VAR + BIAS 2 Proof MSE(ˆθ) = E{(ˆθ θ) 2 } = E{([ˆθ E{ˆθ}] + [E{ˆθ} θ]) 2 } = E{[ˆθ E{ˆθ}] 2 } + 2E{[E{ˆθ} θ][ˆθ E{ˆθ}]} + E{[E{ˆθ} θ] 2 } = σ 2 {ˆθ} + 2E{[E{ˆθ}[ˆθ E{ˆθ}] θ[ˆθ E{ˆθ}]} + B(ˆθ) 2 = σ 2 {ˆθ} + 2(0 + 0) + B(ˆθ) 2 = σ 2 {ˆθ} + B(ˆθ) 2

Trade-off Think of variance as confidence and bias as correctness. Intuitions (largely) apply Sometimes choosing a biased estimator can result in an overall lower MSE if it exhibits lower variance. Bayesian methods (later in the course) specifically introduce bias.

Estimating Error Term Variance σ 2 Regression model Variance of each observation Y i is σ 2 (the same as for the error term ɛ i ) Each Y i comes from a different probability distribution with different means that depend on the level X i The deviation of an observation Y i must be calculated around its own estimated mean.

s 2 estimator for σ 2 s 2 = MSE = SSE n 2 = (Yi Ŷi) 2 n 2 MSE is an unbiased estimator of σ 2 E{MSE} = σ 2 = e 2 i n 2 The sum of squares SSE has n-2 degrees of freedom associated with it. Cochran s theorem (later in the course) tells us where degree s of freedom come from and how to calculate them.

Normal Error Regression Model No matter how the error terms ɛ i are distributed, the least squares method provides unbiased point estimators of β 0 and β 1 that also have minimum variance among all unbiased linear estimators To set up interval estimates and make tests we need to specify the distribution of the ɛ i We will assume that the ɛ i are normally distributed.

Normal Error Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor variable in the i th trial ɛ i iid N(0, σ 2 ) note this is different, now we know the distribution i = 1,..., n

Notational Convention When you see ɛ i iid N(0, σ 2 ) It is read as ɛ i is distributed identically and independently according to a normal distribution with mean 0 and variance σ 2 Examples θ Poisson(λ) z G(θ)

Maximum Likelihood Principle The method of maximum likelihood chooses as estimates those values of the parameters that are most consistent with the sample data.

Likelihood Function If X i F (Θ), i = 1... n then the likelihood function is n L({X i } n i=1, Θ) = F (X i ; Θ) i=1

Example, N(10, 3) Density, Single Obs.

Example, N(10, 3) Density, Single Obs. Again

Example, N(10, 3) Density, Multiple Obs.

Maximum Likelihood Estimation The likelihood function can be maximized w.r.t. the parameter(s) Θ, doing this one can arrive at estimators for parameters as well. L({X i } n i=1, Θ) = n F (X i ; Θ) i=1 To do this, find solutions to (analytically or by following gradient) dl({x i } n i=1, Θ) = 0 dθ

Important Trick Never (almost) maximize the likelihood function, maximize the log likelihood function instead. log(l({x i } n i=1, Θ)) = log( = n F (X i ; Θ)) i=1 n log(f (X i ; Θ)) i=1 Quite often the log of the density is easier to work with mathematically.

ML Normal Regression Likelihood function L(β 0, β 1, σ 2 ) = = n i=1 1 (2πσ 2 ) 1 (2πσ 2 ) 1 e 2σ 1/2 2 (Y i β 0 β 1 X i ) 2 1 P n e 2σ n/2 2 i=1 (Y i β 0 β 1 X i ) 2 which if you maximize (how?) w.r.t. to the parameters you get...

Maximum Likelihood Estimator(s) β 0 b 0 same as in least squares case β 1 b 1 same as in least squares case σ 2 ˆσ 2 = i (Y i Ŷ i ) 2 Note that ML estimator is biased as s 2 is unbiased and s 2 = MSE = n n n 2 ˆσ2

Comments Least squares minimizes the squared error between the prediction and the true output The normal distribution is fully characterized by its first two central moments (mean and variance) Food for thought: What does the bias in the ML estimator of the error variance mean? And where does it come from?