Regression Analysis. Regression Analysis MIT 18.S096. Dr. Kempthorne. Fall 2013

Transcription

1 Lecture 6: Regression Analysis MIT 18.S096 Dr. Kempthorne Fall 2013 MIT 18.S096 Regression Analysis 1

2 Outline Regression Analysis 1 Regression Analysis MIT 18.S096 Regression Analysis 2

3 Multiple Linear Regression: Setup Data Set n cases i = 1, 2,..., n 1 Response (dependent) variable y i, i = 1, 2,..., n p Explanatory (independent) variables x i = (x i,1, x i,2,..., x i,p ) T, i = 1, 2,..., n Goal of Regression Analysis: Extract/exploit relationship between y i and x i. Examples Prediction Causal Inference Approximation Functional Relationships MIT 18.S096 Regression Analysis 3

4 General Linear Model: For each case i, the conditional distribution [y i x i ] is given by y i = ŷ i + ɛ i where ŷ i = β 1 x i,1 + β 2 x i,2 + + β i,p x i,p β = (β 1, β 2,..., β p ) T are p regression parameters (constant over all cases) ɛ i Residual (error) variable (varies over all cases) Extensive breadth of possible models Polynomial approximation (x i,j = (x i ) j, explanatory variables are different powers of the same variable x = x i ) Fourier Series: (x i,j = sin(jx i ) or cos(jx i ), explanatory variables are different sin/cos terms of a Fourier series expansion) Time series regressions: time indexed by i, and explanatory variables include lagged response values. Note: Linearity of ŷ i (in regression parameters) maintained with non-linear x. MIT 18.S096 Regression Analysis 4

5 Steps for Fitting a Model (1) Propose a model in terms of Response variable Y (specify the scale) Explanatory variables X 1, X 2,... X p (include different functions of explanatory variables if appropriate) Assumptions about the distribution of ɛ over the cases (2) Specify/define a criterion for judging different estimators. (3) Characterize the best estimator and apply it to the given data. (4) Check the assumptions in (1). (5) If necessary modify model and/or assumptions and go to (1). MIT 18.S096 Regression Analysis 5

6 Specifying Assumptions in (1) for Residual Distribution Gauss-Markov: zero mean, constant variance, uncorrelated Normal-linear models: ɛ i are i.i.d. N(0, σ 2 ) r.v.s Generalized Gauss-Markov: zero mean, and general covariance matrix (possibly correlated,possibly heteroscedastic) Non-normal/non-Gaussian distributions (e.g., Laplace, Pareto, Contaminated normal: some fraction (1 δ) of the ɛ i are i.i.d. N(0, σ 2 ) r.v.s the remaining fraction (δ) follows some contamination distribution). MIT 18.S096 Regression Analysis 6

7 Specifying Estimator Criterion in (2) Least Squares Maximum Likelihood Robust (Contamination-resistant) Bayes (assume β j are r.v. s with known prior distribution) Accommodating incomplete/missing data Case Analyses for (4) Checking Assumptions Residual analysis Model errors ɛ i are unobservable Model residuals for fitted regression parameters β j are: e i = y i [β 1x i,1 + β 2 x i,2 + + β px i,p ] Influence diagnostics (identify cases which are highly influential?) Outlier detection MIT 18.S096 Regression Analysis 7

9 Ordinary Least Squares Estimates Least Squares Criterion: For β = (β 1, β 2,..., β p ), define N Q(β) = i=1 [y i ŷ i] 2 = N i=1 [y i (β 1 x i,1 + β 2 x i,2 + + β i,p x 2 i,p)] Ordinary Least-Squares (OLS) estimate βˆ: minimizes Q(β). Matrix Notation y x x 1 1,1 x 1,2 1,p y = y 2 x 2,1 x 2,2 x 2,p X =. β = y n x n,2 x p,n x n,1 T β 1 β p MIT 18.S096 Regression Analysis 9

10 Solving for OLS Estimate βˆ ŷ = ŷ 1 ŷ 2. ŷ n = Xβ and Q(β) = n i=1 (y i ŷ i ) 2 = (y ŷ) T (y ŷ) = (y Xβ) T (y Xβ) OLS ˆβ solves Q(β) β j =0, j = 1, 2,..., p Q(β) β = ( n j β j i=1 [y i (x i,1 β 1 + x i,2 β 2 + x i,p β p )] 2) = n i=1 2( x i,j)[y i (x i,1 β 1 + x i,2 β 2 + x i,p β p )] = 2(X [j] ) T (y Xβ) where X [j] is the jth column of X MIT 18.S096 Regression Analysis 10

11 Solving for OLS Estimate βˆ Q β = Q β 1 Q β 2. Q βp XT [1] (y Xβ) X T [2](y Xβ) = 2 = 2X T (y Xβ). X T (y Xβ) So the OLS Estimate βˆ solves the Normal Equations X T (y Xβ) = 0 X T Xβˆ = X T y = βˆ = (X T X) 1 X T y [p] N.B. For βˆ to exist (uniquely) (X T X) must be invertible X must have Full Column Rank MIT 18.S096 Regression Analysis 11

12 (Ordinary) Least Squares Fit OLS Estimate: βˆ1 βˆ = βˆ 2. T = (X X) 1 X T y Fitted Values: Where βˆ p ŷ x βˆ + + x βˆ 1 1,1 1 1,p p ŷ 2 x βˆ + + x βˆ y = 2 1 ˆ,1 2,p p =.. ŷ n xn,1βˆ x n,p βˆ p = Xβˆ = X(X T X) 1 X T y = Hy H = X(X T X) 1 X T is the n n Hat Matrix MIT 18.S096 Regression Analysis 12

13 (Ordinary) Least Squares Fit The Hat Matrix H projects R n onto the column-space of X Residuals: ɛˆi = y i ŷ i, i = 1, 2,..., n ɛˆ1 ˆɛ = = y ŷ = (I ɛˆ2 n. H)y ɛˆn 0 Normal Equations: X T (y Xβˆ) = X T ˆɛ = 0 p =. 0 N.B. The Least-Squares Residuals vector ˆɛ is orthogonal to the column space of X MIT 18.S096 Regression Analysis 13

15 : Assumptions y1 x1, 1 x 1,2 x 1,p y 2 Data y = and X = x 2,1 x 2,2 x 2,p y n x n,1 x n,2 x p,n follow a linear model satisfying the Gauss-Markov Assumptions if y is an observation of random vector Y = (Y 1, Y 2,... Y N ) T and E(Y X, β) = Xβ, where β = (β 1, β T 2,... β p ) is the p-vector of regression parameters. Cov(Y X, β) = σ 2 I n, for some σ 2 > 0. I.e., the random variables generating the observations are uncorrelated and have constant variance σ 2 (conditional on X, and β). MIT 18.S096 Regression Analysis 15

16 For known constants c 1, c 2,..., c p, c p+1, consider the problem of estimating θ = c 1 β 1 + c 2 β 2 + c p β p + c p+1. Under the Gauss-Markov assumptions, the estimator θˆ = c1βˆ 1 + c 2 βˆ2 + cpβˆ p + c p+1, where βˆ1, βˆ 2,... β ˆ p are the least squares estimates is 1) An Unbiased Estimator of θ 2) A Linear Estimator of θ, that is θˆ = n i=1 b iy i, for some known (given X) constants b i. Theorem: Under the Gauss-Markov Assumptions, the estimator θˆ has the smallest (Best) variance among all Linear Unbiased Estimators of θ, i.e., θˆ is BLUE. MIT 18.S096 Regression Analysis 16

17 : Proof Proof: Without loss of generality, assume c p+1 = 0 and define c =(c 1, c T 2,..., c p ). The Least Squares Estimate of θ = c T β is: θˆ = c T βˆ = c T (X T X) 1 X T y d T y a linear estimate in y given by coefficients d = (d T 1, d 2,..., d n ). Consider an alternative linear estimate of θ: θ = b T y with fixed coefficients given by b = (b T 1,..., b n ). Define f = b d and note that θ = b T y = (d + f) T y = θˆ + f T y If θ is unbiased then because θˆ is unbiased 0 = E(f T y) = d T E(y) = f T (Xβ) for all β R p = f is orthogonal to column space of X = f is orthogonal to d = X(X T X) 1 c MIT 18.S096 Regression Analysis 17

18 If θ is unbiased then The orthogonality of f to d implies Var(θ ) = Var(b T y) = Var(d T y + f T y) = Var(d T y) + Var(f T y) + 2Cov(d T y, f T y) = Var(θˆ) + Var(f T y) + 2d T Cov(y)f = Var(θˆ) + Var(f T y) + 2d T (σ 2 I n )f = Var(θˆ) + Var(f T y) + 2σ 2 d T f = Var(θˆ) + Var(f T y) + 2σ 2 0 Var(θˆ) MIT 18.S096 Regression Analysis 18

20 Estimates Consider generalizing the Gauss-Markov assumptions for the linear regression model to Y = Xβ + ɛ where the random n-vector ɛ: E[ɛ] = 0 n and E[ɛɛ ] = σ 2 Σ. σ 2 is an unknown scale parameter Σ is a known (n n) positive definite matrix specifying the relative variances and correlations of the component observations. 1 Transform the data (Y, X) to Y = Σ 2 Y and X = Σ 1 2 X and the model becomes Y = X β + ɛ, where E[ɛ ] = 0 n and E[ɛ (ɛ ) ] = σ 2 I n By the, the BLUE ( GLS ) of β is βˆ = [(X ) T (X )] 1 (X ) T (Y ) = [X T Σ 1 X] 1 (X T Σ 1 Y) MIT 18.S096 Regression Analysis 20

22 Normal Linear Regression Models Distribution Theory Y i = x i,1 β 1 + x i,2 β 2 + x i,p β p + ɛ i = µ i + ɛ i Assume {ɛ 1, ɛ 2,..., ɛ n } are i.i.d N(0, σ 2 ). = [Y i x i,1, x i,2,..., x 2 i,p, β, σ ] N(µ i, σ 2 ), independent over i = 1, 2,... n. Conditioning on X, β, and σ 2 ɛ 1 Y = Xβ + ɛ, where ɛ = ɛ 2 N 2 n(o n, σ I n ). ɛ n MIT 18.S096 Regression Analysis 22

23 Distribution Theory Regression Analysis µ 1 µ =. µ n = E(Y X, β, σ 2 ) = Xβ MIT 18.S096 Regression Analysis 23

24 σ σ Σ = Cov(Y X, β, σ 2 ) = 0 0 σ σ 2 That is, Σ i,j = Cov(Y i, Y j X, β, σ 2 ) = σ 2 δ i,j. = σ 2 I n Apply Moment-Generating Functions (MGFs) to derive Joint distribution of Y = (Y 1, Y 2,..., Y n ) T Joint distribution of βˆ = (βˆ1, βˆ2,..., βˆp) T. MIT 18.S096 Regression Analysis 24

25 MGF of Y For the n-variate r.v. Y, and constant n vector t = (t 1,..., t n ) T, T M Y (t) = E(e t Y ) = E(e t 1Y 1 +t 2 Y 2 + t ny n ) = E(e t 1Y 1 ) E(e t 2Y 2 ) E(e tnyn ) = M Y 1 (t 1 ) M Y2 (t 2 ) M Yn (t n ) n 1 = =1 et i µ i + 2 i t2 i σ2 = e n i=1 t i µ i + 1 n 2 i,k=1 t i Σ i,k t k = e tt u+ 1 t T Σt 2 = Y N n (µ, Σ) Multivariate Normal with mean µ and covariance Σ MIT 18.S096 Regression Analysis 25

26 MGF of βˆ For the p-variate r.v. βˆ, and constant p vector τ = (τ 1,..., τ p ) T, M (τ ) = E(e τ T βˆ ) = E(e τ 1βˆ1+τ2βˆ 2 + τ pβ p βˆ ) Defining A = (X T X) 1 X T we can express βˆ = (X T X) 1 X T y = AY and T ˆ Mˆ(τ β ) = E(e τ β ) = E(e τ T AY ) T = E(e t Y ), with t = A T τ = M Y (t) 1 = e tt u+ 2 tt Σt MIT 18.S096 Regression Analysis 26

27 MGF of βˆ Regression Analysis For Plug in: Gives: M βˆ(τ ) = E(e τ T βˆ) = e tt u+ 1 t T Σt 2 t = A T τ = X(X T X) 1 τ µ = Xβ Σ = σ 2 I n t T µ = τ T β t T Σt = τ T (X T X) 1 X T [σ 2 I n ]X(X T X) 1 τ = τ T [σ 2 (X T X) 1 ]τ So the MGF of βˆ is 1 M βˆ(τ ) = e τ T β+ 2 τ T [σ 2 (X T X) 1 ]τ ˆ 2 T 1 MIT 18.S096 Regression Analysis 27

28 Marginal Distributions of Least Squares Estimates Because βˆ N p (β, σ 2 (X T X) 1 ) the marginal distribution of each βˆj is: βˆj N(β j, σ 2 C j,j ) where C j.j = jth diagonal element of (X T X) 1 MIT 18.S096 Regression Analysis 28

29 The Q-R Decomposition of X Consider expressing the (n p) matrix X of explanatory variables as X = Q R where Q is an (n p) orthonormal matrix, i.e., Q T Q = I p. R is a (p p) upper-triangular matrix. The columns of Q = [Q [1], Q [2],..., Q [p] ] can be constructed by performing the Gram-Schmidt Orthonormalization procedure on the columns of X = [X [1], X [2],..., X [p] ] MIT 18.S096 Regression Analysis 29

30 If Regression Analysis r1,1 r 1,2 r1,p 1 r 1,p 0 r 2,2 r 2,p 1 r 2,p R = , then 0 0 r p 1,p 1 rp 1,p r p,p X [1] = Q [1] r 1,1 = r1,1 2 = X T [1] X [1] Q [1] = X [1] /r 1,1 X [2] = Q [1] r 1,2 + Q [2] r 2,2 = Q[1] T X [2] = Q[1] T Q[1] r 1,2 + Q[1] T [2]r 2,2 = 1 r 1,2 + 0 r 2,2 = r 1,2 (known since Q [1] specfied) MIT 18.S096 Regression Analysis 30

31 With r 1,2 and Q [1] specfied we can solve for r 2,2 : = Q [2] r 2,2 = X [2] Q [1] r 1,2 Take squared norm of both sides: r 2 = X T X 2r Q T X + r 2 2,2 [2] [2] 1,2 [1] [2] 1,2 (all terms on RHS are known) With r 2,2 specified = Q [2] = 1 r 2,2 [ X[2] r 1,2 Q [1] ] Etc. (solve for elements of R, and columns of Q) MIT 18.S096 Regression Analysis 31

32 With the Q-R Decomposition X = QR (Q T Q = I p, and R is p p upper-triangular) ˆβ = (X T X) 1 X T y = R 1 Q T y (plug in X = QR and simplify) Cov(ˆβ) = σ 2 (X T X) 1 = σ 2 R 1 (R 1 ) T H = X(X T X) 1 X T = QQ T (giving ŷ = Hy and ˆɛ = (I n H)y) MIT 18.S096 Regression Analysis 32

33 More Distribution Theory Assume y = Xβ + ɛ, where {ɛ i } are i.i.d. N(0, σ 2 ), i.e., ɛ N n (0 n, σ 2 I n ) or y N n (Xβ, σ 2 I n ) Theorem* For any (m n) matrix A of rank m n, the random normal vector y transformed by A, z = Ay is also a random normal vector: z N m (µ z, Σ z ) where µ z = AE(y) = AXβ, and Σ z = ACov(y)A T = σ 2 AA T. Earlier, A = (X T X) 1 X T yields the distribution of βˆ = Ay With a different definition of A (and z) we give an easy proof of: MIT 18.S096 Regression Analysis 33

34 Theorem For the normal linear regression model y = Xβ + ɛ, where X (n p) has rank p and ɛ N n (0 2 n, σ I n ). (a) βˆ = (X T X) 1 X T y and ˆɛ = y Xβˆ are independent r.v.s (b) βˆ N p (β, σ 2 (X T X) 1 ) n (c) i=1 ɛˆ2i = ˆɛ T ˆɛ σ 2 χ 2 n p (Chi-squared r.v.) (d) For each j = 1, 2,..., p βˆj j ˆ β t j = t ˆσC n p (t distribution) j,j where ˆσ 2 = 1 n n p i=1 ɛˆ2i C j,j = [(X T X) 1 ] j,j MIT 18.S096 Regression Analysis 34

35 Proof: Note that (d) follows immediately from (a), (b), (c) [ ] Q T Define A =, where W T A is an (n n) orthogonal matrix (i.e. A T = A 1 ) Q is the column-orthonormal matrix in a Q-R decomposition of X Note: W can be constructed by continuing the Gram-Schmidt Orthonormalization process (which was used to construct Q from X) with X = [ X I n ]. Then, consider [ ] [ ] Q T y z Q ( p 1) z = Ay = T = W y (n p) 1 z W MIT 18.S096 Regression Analysis 35

36 The distribution of z = Ay is N n (µ z, Σ z ) where [ ] T Q µ z = [A][Xβ] = T [Q R β] W [ ] Q T Q = W T [R β] [ Q ] Ip = [R β] [ 0 (n p) p ] R β = 0 (n p) p 2 Σ z = A [σ I n ] A T = σ 2 [AA T ] = σ 2 I n since A T = A 1 MIT 18.S096 Regression Analysis 36

37 ( ) [( ) ] zq Rβ Thus z = N σ z 2 n, In W On p = z Q N p [(Rβ), σ 2 I p ] 2 z W N(n p)[(o (n p), σ I (n p) ] and z Q and z W are independent. The Theorem follows by showing (a*) βˆ = R 1 z Q and ˆɛ = Wz W, (i.e. βˆ and ˆɛ are functions of different independent vecctors). (b*) Deducing the distribution of βˆ = R 1 z Q, applying Theorem* with A = R 1 and y = z Q (c*) ˆɛ T ˆɛ = zw T z W = sum of (n p) squared r.v s which are i.i.d. N(0, σ 2 ). σ 2 χ 2, a scaled Chi-Squared r.v. (n p) MIT 18.S096 Regression Analysis 37

38 Proof of (a*) βˆ = R 1 z Q follows from βˆ = (X T X) 1 Xy and X = QR with Q : Q T Q = I p ˆɛ = y ŷ = y Xβˆ = y (QR) (R 1 z Q ) = y Qz Q = y QQ T y = (In QQ T )y = WW T y (since I n = A T A = QQ T + WW T ) = Wz W MIT 18.S096 Regression Analysis 38

40 Maximum-Likelihood Estimation Consider the normal linear regression model: y = Xβ + ɛ, where {ɛ i } are i.i.d. N(0, σ 2 ), i.e., ɛ N n (0 n, σ 2 I n ) or y N n (Xβ, σ 2 I n ) Definitions: The likelihood function is L(β, σ 2 ) = p(y X, B, σ 2 ) where p(y X, B, σ 2 ) is the joint probability density function (pdf) of the conditional distribution of y given data X, (known) and parameters (β, σ 2 ) (unknown). The maximum likelihood estimates of (β, σ 2 ) are the values maximizing L(β, σ 2 ), i.e., those which make the observed data y most likely in terms of its pdf. MIT 18.S096 Regression Analysis 40

41 Because the y i are independent r.v. s with y i N(µ 2 i, σ ) where p µ i = j=1 β j x i,j, L(β, σ 2 ) = n i=1 p [ (y 2 i β, σ ) n 2σ 2 (y i j=1 β j x i,j ) 2] 1 = 1 i=1 2πσ 2 e 1 (y Xβ) = 1 e 1 (y Xβ) T (σ 2 I n) (2πσ 2 ) n/2 2 The maximum likelihood estimates (βˆ, σˆ2) maximize the log-likeliood function (dropping constant terms) logl(β, σ 2 ) = n 2 log(σ2 ) 1 2 (y Xβ)T (σ 2 I n ) 1 (y Xβ) = n 2 log(σ2 ) 1 Q(β) 2σ 2 where Q(β) = (y Xβ) T (y Xβ) ( Least-Squares Criterion!) The OLS estimate ˆβ is also the ML-estimate. The ML estimate of σ 2 solves 2 log L(βˆ,σ ) (σ 2 = 0,i.e., n 1 ) 2 σ 2 1 ( 1)(σ 2 ) 2 Q(βˆ) = 0 2 = σ ˆ2 ML = Q(βˆ)/n = ( n i=1 ɛ ˆ2)/n i (biased!) MIT 18.S096 Regression Analysis 41

43 For data y, X fit the linear regression model y i = x T i β + ɛ i, i = 1, 2,..., n. by specifying β = βˆ to minimize Q(β) = n i=1 h(y i, x 2 i, β, σ ) The choice of the function h( ) distinguishes different estimators. (1) Least Squares: h(y i, x i, β, σ 2 ) = (y i x T i β) 2 (2) Mean Absolue Deviation (MAD): h(y i, x 2 T i, β, σ ) = y i x i β (3) Maximum Likelihood (ML): Assume the y i are independent with pdf s p(y i β, x i, σ 2 ), h(y i, x i, β, σ 2 ) = log p(y i β, x i, σ 2 ) (4) Robust M Estimator: h(y i, x i, β, σ 2 ) = χ(y i x T i β) χ( ) is even, monotone increasing on (0, ). MIT 18.S096 Regression Analysis 43

44 (5) Quantile Estimator: For { τ : 0 < τ < 1, a fixed quantile τ y T 2 i x i β, if y h(y i x i, x i, β, σ ) = i β (1 τ) y i x T i β, if y i < x i β E.g., τ = 0.90 corresponds to the 90th quantile / upper-decile. τ = 0.50 corresponds to the MAD Estimator MIT 18.S096 Regression Analysis 44

45 MIT OpenCourseWare 18.S096 Topics in Mathematics with Applications in Finance Fall 2013 For information about citing these materials or our Terms of Use, visit: