Regression Models for Time Series Analysis

Transcription

1 Regression Models for Time Series Analysis Benjamin Kedem 1 and Konstantinos Fokianos 2 1 University of Maryland, College Park, MD 2 University of Cyprus, Nicosia, Cyprus Wiley, New York,

2 Cox (1975). Partial likelihood, Biometrika, 62, Fahrmeir and Tutz (2001). Multivariate Statistical Modelling Based on GLM. 2nd ed., Springer, NY. Fokianos (1996). Categorical Time Series: Prediction and Control. Ph.D. Thesis, University of Maryland. Fokianos and Kedem (1998), Prediction and classification of non-stationary categorical time series. J. Multivariate Analysis,67, Kedem (1980). Binary Time Series, Marcel Dekker, NY ( ) Kedem and Fokianos (2002). Regression Models for Time Series Analysis, Wiley, NY. 2

3 McCullagh and Nelder (1989). Generalized Linear Models. 2nd edi., Chapman & Hall, London. Nelder and Wedderburn (1972). Generalized linear models. JRSS, A, 135, Slud (1982). Consistency and efficiency of inference with the partial likelihood. Biometrika, 69, Slud and Kedem (1994). Partial likelihood analysis of logistic regression and autoregression. Statistica Sinica, 4, Wong (1986). Theory of partial likelihood. Annals of Statistics, 14,

4 Part I: GLM and Time Series Overview Extension of Nelder and Wedderburn (1972), McCullagh and Nelder(1989) GLM to time series is possible due to: Increasing sequence of histories relative to an observer. Partial likelihood. The partial score is a martingale. Well behaved covariates. 4

5 Partial Likelihood Suppose we observe a pair of jointly distributed time series, (X t, Y t ), t = 1,..., N, where {Y t } is a response series and {X t } is a time dependent random covariate. Employing the rules of conditional probability, the joint density of all the X, Y observations can be expressed as, f θ (x 1, y 1,..., x N, y N ) = f θ (x 1 ) N t=2 f θ (x t d t ) d t = (y 1, x 1,..., y t 1, x t 1 ) c t = (y 1, x 1,..., y t 1, x t 1, x t ). N t=1 f θ (y t c t ) (1) The second product on the right hand side of (1) constitutes a partial likelihood according to Cox(1975). 5

6 An increasing sequence of σ-fields F 0 F 1 F Y 1, Y 2,... a sequence of random variables on some common probability space such that Y t is F t measurable. Y t F t 1 f t (y t ; θ). θ R p is a fixed parameter. The partial likelihood (PL) function relative to θ, F t, and the data Y 1, Y 2,..., Y N, is given by the product PL(θ; y 1,..., y N ) = N t=1 f t (y t ; θ) (2) 6

7 The General Regression Problem {Y t } is a response time series with the corresponding p dimensional covariate process, Z t 1 = (Z (t 1)1,..., Z (t 1)p ). Define, F t 1 = σ{y t 1, Y t 2,...,Z t 1,Z t 2,...}. Note: Z t 1 may already include past Y t s. The conditional expectation of the response given the past: µ t = E[Y t F t 1 ], ( ) The problem is to relate µ t to the covariates. 7

8 Time Series Following GLM 1. Random Component. The conditional distribution of the response given the past belongs to the exponential family of distributions in natural or canonical form, { } yt θ t b(θ t ) f(y t ; θ t, φ F t 1 ) = exp + c(y t ; φ). α t (φ) (3) α t (φ) = φ/ω t, dispersion φ, prior weight ω t. 2. Systematic Component. There is a monotone function g( ) such that, g(µ t ) = η t = p j=1 β j Z (t 1)j = Z t 1β. (4) g( ): the link function η t : the linear predictor of the model. 8

9 Typical choices for η t = Z t 1β, could be or or β 0 + β 1 Y t 1 + β 2 Y t 2 + β 3 X t cos(ω 0 t) β 0 + β 1 Y t 1 + β 2 Y t 2 + β 3 Y t 1 X t + β 4 X t 1 β 0 + β 1 Y t 1 + β 2 Y 7 t 2 + β 3Y t 1 log(x t 12 ) 9

10 GLM Equations: f(y; θ t ; φ F t 1 )dy = 1. This implies the relationships: µ t = E[Y t F t 1 ] = b (θ t ). (5) Var[Y t F t 1 ] = α t (φ)b (θ t ) α t (φ)v (µ t ). (6) Since Var[Y t F t 1 ] > 0, it follows that b is monotone. Therefore, equation (5) implies that θ t = (b ) 1 (µ t ). (7) We see that θ t itself is a monotone function of µ t and hence it can be used to define a link function. The link function g(µ t ) = θ t (µ t ) = η t = Z t 1 β (8) is called the canonical link function. 10

11 Example: Poisson Time Series. f(y t ; θ t, φ F t 1 ) = exp {(y t log µ t µ t ) log y t!}. E[Y t F t 1 ] = µ t, b(θ t ) = µ t = exp(θ t ), V (µ t ) = µ t, φ = 1, and ω t = 1. The canonical link is g(µ t ) = θ t (µ t ) = log µ t = η t = Z t 1 β. As an example, if Z t 1 = (1, X t, Y t 1 ), then log µ t = β 0 + β 1 X t + β 2 Y t 1 with {X t } standing for some covariate process, or a possible trend, or a possible seasonal component. 11

12 Example: Binary Time Series {Y t } takes the values 0,1. Let Then π t = P(Y t = 1 F t 1 ). f(y t ; θ t, φ F t 1 ) = { exp y t log ( πt 1 π t ) + log(1 π t ) } The canonical link gives the logistic regression model g(π t ) = θ t (π t ) = log π t = η t = Z t 1β. (9) 1 π t Note: π t = F l (η t ) (10) 12

13 Partial Likelihood Inference Given a time series {Y t }, t = 1,..., N, conditionally distributed as (3). The partial likelihood of the observed series is PL(β) = N t=1 f(y t ; θ t, φ F t 1 ). (11) Then from (3), the log partial likelihood, l(β), is given by l(β) = = = N t=1 N t=1 N t=1 log f(y t ; θ t, φ F t 1 ) N l t t=1 } { yt θ t b(θ t ) + c(y t, φ) α t (φ) { yt u(z t 1 β) b(u(z t 1 β)) + c(y t, φ) α t (φ) } (12) 13

14 ( β 1, β 2,, β p ). The partial score is a p dimensional vector, S N (β) l(β) = N t=1 with σ 2 t (β) = Var[Y t F t 1 ]. µ t (Y t µ t (β)) Z t 1 η t σt 2(β) (13) The partial score vector process {S t (β)}, t = 1,..., N, is defined from the partial sums S t (β) = t s=1 µ s (Y s µ s (β)) Z s 1 η s σs(β) 2. (14) 14

15 The solution of the score equation, S N (β) = logpl(β) = 0 (15) is denoted by ˆβ, and is referred to as the maximum partial likelihood estimator (MPLE) of β. The system of equations (15) is non linear and is customarily solved by the Fisher scoring method, an iterative algorithm resembling the Newton Raphson procedure. Before turning to the Fisher scoring algorithm in our context of conditional inference, it is necessary to introduce several important matrices. 15

16 An important role in partial likelihood inference is played by the cumulative conditional information matrix, G N (β), defined by a sum of conditional covariance matrices, G N (β) = = N t=1 N t=1 We also need: Cov [ = Z W(β)Z. µ t (Y t µ t (β)) Z t 1 η t σt 2(β) ( ) 2 µt 1 Z t 1 η t σt 2(β)Z t 1 H N (β) l(β). Define R N (β) from the difference H N (β) = G N (β) R N (β). Fact: For canonical links R N (β) = 0. F t 1 ] 16

17 Fisher Scoring: In Newton-Raphson replace H N (β) by its conditional expectation: ˆβ (k+1) = ˆβ (k) + G 1 N (ˆβ (k) )S N (ˆβ (k) ). Fisher scoring becomes Newton-Raphson for canonical links. Fisher scoring simplifies to Iterative Reweighted Least Squares: ˆβ (k+1) = ( Z W(ˆβ (k) )Z) 1 Z W(ˆβ (k) )q (k). 17

18 Asymptotic Theory Assumption A A1. The true parameter β belongs to an open set B R p. A2. The covariate vector Z t 1 almost surely lies in a nonrandom compact subset Γ of R p, such that P[ N t=1 Z t 1 Z t 1 > 0] = 1. In addition, Z t 1β lies almost surely in the domain H of the inverse link function h = g 1 for all Z t 1 Γ and β B. A3. The inverse link function h defined in (A2) is twice continuously differentiable and h(γ)/ γ = 0. 18

19 A4. There is a probability measure ν on R p such that R p zz ν(dz) is positive definite, and such that under (3) and (4) for Borel sets A R p, 1 N N t=1 I [Zt 1 A] ν(a) in probability as N, at the true value of β. A4 calls for asymptotically well behaved covariates: Nt=1 f(z t 1 ) N f(z)ν(dz) R p in probability as N. Thus, there exists a p p limiting information matrix per observation, G(β), such that G N (β) G(β) (16) N in probability, as N. 19

20 Slud and K (1994), Fokianos and K (1998): 1. {S t (β)} relative to {F t }, t = 1,..., N, is a martingale. 2. R N(β) N S N(β) N N p (0,G(β)). 4. N(ˆβ β) N p (0,G 1 (β)). 20

21 100(1 α)% prediction interval (h = g 1 ) µ t (β) = µ t (ˆβ)±z α/2 h (Z t 1 β) N Z t 1 G 1 (β)z t 1 Hypothesis Testing Let β be the MPLE of β obtained under H 0 with r < p restrictions, H 0 : β 1 = = β r = 0. Let ˆβ be the unrestricted MPLE. The log partial likelihood ratio statistic converges to χ 2 r. λ N = 2 { l(ˆβ) l( β) } (17) 21

22 More generally: Assume C is a known matrix with full rank r, r < p. Under the general linear hypothesis, H 0 : Cβ = β 0 against H 1 : Cβ β 0, (18) {Cˆβ β 0 } {CG 1 (ˆβ)C } 1 {Cˆβ β 0 } χ 2 r 22

23 Diagnostics l(y; y): maximum log partial likelihood corresponding to the saturated model. l(ˆµ; y): maximum log partial likelihood from the reduced model. Scaled Deviance: D 2{l(y;y) l(ˆµ;y)} χ 2 N p AIC(p) = 2logPL(ˆβ) + 2p, BIC(p) = 2logPL(ˆβ) + plog N 23

24 Analysis of Mortality Count Data in LA Weekly data from Los Angeles County during a period of 10 years from January 1, 1970, to December 31, 1979: Weekly sampled filtered time series. N = 508. Response Y Weather T RH Pollution CO SO 2 NO 2 HC OZ KM Total Mortality (filtered) Temperature Relative humidity Carbon monoxide Sulfur dioxide Nitrogen dioxide Hydrocarbons Ozone Particulates 24

25 Mortality Lag Temperature ACF Series : Mort Lag log(co) Mortality, Temperature, log(co) ACF ACF Series : Temp Series : log(co) Lag Weekly data of filtered total mortality and temperature, and log-filtered CO, and the corresponding estimated autocorrelation functions. N =

26 Covariates and η t used in Poisson regression. S = SO 2, N = NO 2. To recover η t, insert the β s. For Model 2, η t = β 0 + β 1 Y t 1 + β 2 Y t 2, etc. Model 0 T t + RH t + CO t + S t + N t +HC t + OZ t + KM t Model 1 Y t 1 Model 2 Y t 1 + Y t 2 Model 3 Y t 1 + Y t 2 + T t 1 Model 4 Y t 1 + Y t 2 + T t 1 + log(co t ) Model 5 Y t 1 + Y t 2 + T t 1 + T t 2 + log(co t ) Model 6 Y t 1 + Y t 2 + T t + T t 1 + log(co t ) 26

27 Comparison of 7 Poisson regression models. N = 508. Model p D df AIC BIC Choose Model 4: log(ˆµ t ) = ˆβ 0 + ˆβ 1 Y t 1 + ˆβ 2 Y t 2 + ˆβ 3 T t 1 + ˆβ 4 log(co t ) 27

28 Poisson Regression of Filtered LA Mortality Mrt(t) on Mrt(t-1), Mrt(t-2), Tmp(t-1), log(crb(t)) Data... Fit Observed and predicted weekly filtered total mortality from Model 4. 28

29 Comparison of Residuals Model 0 Residuals Model 4 Residuals Series : Resid0 Series : Resid4 ACF ACF Lag Lag Working residuals from Models 0 and 4, and their respective estimated autocorrelation. 29

30 Part II: Binary Time Series Example of bts: Two categories obtained by clipping. Y t I [Xt C] = { 1, if Xt C 0, if X t C (19) Y t I [Xt r] = { 1, if Xt r 0, if X t < r (20) Other examples: at time t = 1,2,..., (Rain, No Rain), (S&P Up, S&P Down), etc. 30

31 {Y t } taking the values 0 or 1, t = 1,2,3,. {Z t 1 } p-dim covariate stochastic data. Against the backdrop of the general framework presented above we wish to relate µ t (β) = π t (β) = P β (Y t = 1 F t 1 ) (21) to the covariates. For this we need good links! 31

32 Standard logistic distribution, F l (x) = Then, ex 1 + e x = e x, < x < Fl 1 (x) = log(x/(1 x)) is the natural link under some conditions. 32

33 Fact: For any bts, there are θ j such that { } P(Yt = 1 Y log t 1 = y t 1,..., Y 1 = y 1 ) = P(Y t = 0 Y t 1 = y t 1,..., Y 1 = y 1 ) θ 0 + θ 1 y t θ p y t p or π t (β) = exp[ (θ 0 + θ 1 y t θ p y t p )] Fact: Consider an AR(p) time series X t = γ 0 + γ 1 X t γ p X t p + λǫ t where ǫ t are i.i.d. logistically distributed. Define: Y t = I [Xt r]. Then π t (β) = exp[ (γ 0 r + γ 1 X t γ p X t p )/λ] 33

34 This motivates logistic regression: π t (β) P β (Y t = 1 F t 1 ) = F l (β Z t 1 ) 1 = 1 + exp[ β Z t 1 ] or equivalently, the link function is { } πt (β) logit(π t (β)) log = β Z t 1 1 π t (β) This is the canonical link. 34

35 Link functions for binary time series. logit β Z t 1 = log{π t (β)/(1 π t (β))} probit β Z t 1 = Φ 1 {π t (β)} log-log β Z t 1 = log{ log(π t (β))} C-log-log β Z t 1 = log{ log(1 π t (β))} Note: Here all the inverse links are cdf s. In what follows we always assume the inverse link is a differentiable cdf F(x). 35

36 The partial likelihood of β takes on the simple product form, PL(β) = = N t=1 N t=1 [π t (β)] y t [1 π t (β)] 1 y t [F(β Z t 1 )] y t [1 F(β Z t 1 )] 1 y t We have under Assumption A: N(ˆβ β) N p (0,G 1 (β)). For the canonical link (logistic regression): G N (β) N G(β) = Rp e β z (1 + e β z ) 2zz ν(dz) 36

37 Illustration of asymptotic normality. ( ) 2πt logit(π t (β)) = β 1 + β 2 cos + β 3 Y t 1 12 so that Z t 1 = (1,cos(2πt/12), Y t 1 ). (a) (b) t Logistic autoregression with a sinusoidal component. a. Y t. b. π t (β) where logit(π t (β)) = cos(2πt/12) + y t 1. 37

38 b1 b b Histograms of normalized MPLE s where β = (0.3,0.75,1), N = 200. Each histogram consists of 1000 estimates. 38

39 Goodness of Fit C 1,, C k, a partition of R p. For j = 1,, k, define, and Put: M j E j (β) N t=1 N t=1 I [Zt 1 C j ] Y t I [Zt 1 C j ] π t(β) M (M 1,, M k ), E(β) (E 1 (β),, E k (β)). 39

40 Slud and K (1994), K and Fokianos (2002): With σ 2 j C j F(β z)(1 F(β z))ν(dz) χ 2 (β) 1 N k j=1 (M j E j (β)) 2 /σ 2 j χ2 k. In practice need to adjust the df when replacing β by ˆβ. 40

41 When ˆβ and (M E(β)) are obtained from the same data set, E(χ 2 (ˆβ)) k k j=1 (B G(β)B) jj /σ 2 j When ˆβ and (M E(β)) are obtained from independent data sets, E(χ 2 (ˆβ)) k + k j=1 (B G(β)B) jj /σ 2 j 41

42 Illustration of the distribution of χ 2 (β) using Q-Q plots. Consider the previous logistic regression model with a periodic component. Use the partition C 1 = {Z : Z 1 = 1, 1 Z 2 < 0, Z 3 = 0} C 2 = {Z : Z 1 = 1, 1 Z 2 < 0, Z 3 = 1} C 3 = {Z : Z 1 = 1,0 Z 2 1, Z 3 = 0} C 4 = {Z : Z 1 = 1,0 Z 2 1, Z 3 = 1} Then, k=4, M j is the sum of those Y t s for which Z t 1 is in C j, j = 1,2,3,4, and the E j (β) are obtained similarly. Estimate σ 2 j by, σ 2 j = 1 N N t=1 I [Zt 1 C j ] π t(β)(1 π t (β)) 42

43 The χ 2 4 approximation is quite good. N=200, b=(0.3,0.75,1) N=400, b=(0.3,1,2) TRUE CHI SQ TRUE CHI SQ CHI SQ STATISTIC CHI SQ STATISTIC 43

44 Example: Modeling successive eruptions of the Old Faithful geyser in Yellowstone National Park, Wyoming. 1 if duration is greater than 3 minutes 0 if duration is less than 3 minutes N =

45 Candidate models η t = β Z t 1 for Old Faithful. 1 β 0 + β 1 Y t 1 2 β 0 + β 1 Y t 1 + β 2 Y t 2 3 β 0 + β 1 Y t 1 + β 2 Y t 2 + β 3 Y t 3 4 β 0 + β 1 Y t 1 + β 2 Y t 2 + β 3 Y t 3 + β 4 Y t 4 Comparison of models using the logistic link: Model 2 is best. Probit reg. gives similar results. Model p X 2 D AIC BIC ˆπ t = π t (ˆβ) = exp { (ˆβ 0 + ˆβ 1 Y t 1 + ˆβ 2 Y t 2 ) }. 45

46 Part III: Categorical Time Series EEG sleep state classified or quantized in four categories as follows. 1 : quiet sleep, 2 : indeterminate sleep, 3 : active sleep, 4 : awake. Here the sleep state categories or levels are assigned integer values. This is an example of a categorical time series {Y t }, t = 1,..., N, taking the values 1,...,4. This is an arbitrary integer assignment. Why not the values 7.1, 15.8, 19.24, 71.17? Any other scale? 46

47 Assume m categories. The t th observation of any categorical time series regardless of the measurement scale can be represented by the vector Y t = (Y t1,..., Y tq ) of length q = m 1, with elements { 1, if the jth category is observed at time t Y tj = 0, otherwise for t = 1,..., N and j = 1,..., q. BTS is a special case with m = 2, q = 1. 47

48 Write for j = 1,..., q, π tj = E[Y tj F t 1 ] = P(Y tj = 1 F t 1 ), Define: π t = (π t1,... π tq ) Y tm = 1 π tm = 1 q Y tj j=1 q j=1 π tj. Let {Z t 1 }, t = 1,..., N, for a p q matrix that represents a covariate process. Y tj corresponds to a vector of length p of random time dependent covariates which forms the jth column of Z t 1. 48

49 Assume the general regression model: ( ) π t (β) = π t1 (β) π t2 (β) π tq (β) = h 1 (Z t 1 β) h 2 (Z t 1 β) h q (Z t 1 β) = h(z t 1 β). The inverse link function h is defined on R q and takes values in R q. We shall only examine nominal and ordinal time series. 49

50 Nominal Time Series. Nominal categorical variables lack natural ordering. A model for nominal time series: Multinomial logit model π tj (β) = Note that exp(β j z t 1) 1 + q l=1 exp(β l z t 1), j = 1,..., q π tm (β) = q l=1 exp(β l z t 1). 50

51 Multinomial logit model is a special case of (*). Indeed, define β to be the p qd-vector β = (β 1,..., β q ), and Z t 1 the qd q matrix Z t 1 = z t z t z t 1. Let h stand for the vector valued function whose components h j, j = 1,..., q, are given by π tj (β) = h j (η t ) = with exp(η tj ) 1 + q l=1 exp(η tl), j = 1,..., q η t = (η t1,..., η tq ) = Z t 1 β. 51

52 Ordinal Time Series. Measured on a scale endowed with a natural ordering. A model for ordinal time series: Need a latent or auxiliary variable. Put X t = γ z t 1 + e t, 1. e t cdf F i.i.d. 2. γ d-dim vector of parameters. 3. z t 1 covariate d-dim vector. Define a categorical time series {Y t }, from the levels of {X t }, Y t = j Y tj = 1 θ j 1 X t < θ j = θ 0 < θ 1 <... < θ m =. 52

53 Then π tj = P(θ j 1 X t < θ j F t 1 ) = F(θ j + γ z t 1 ) F(θ j 1 + γ z t 1 ), for j = 1,..., m. There are many possibilities depending on F. Special case: Proportional Odds Model, F(x) = F l (x) = exp( x). Then we have for j = 1,..., q, { } P[Yt j F log t 1 ] = θ j + γ z t 1 P[Y t > j F t 1 53

54 Proportional odds model has the form (*) with p = (q + d): β = (θ 1,..., θ q, γ ) and Z t 1 the (q + d) q matrix Now set Z t 1 = z t 1 z t 1 z t 1. h = (h 1,..., h q ), and let for j = 2,..., q, π t1 (β) = h 1 (η t ) = F(η t1 ), π tj (β) = h j (η t ) = F(η tj ) F(η t(j 1) ), where η t = (η t1,..., η tq ) = Z t 1 β. 54

55 Partial likelihood estimation. Introduce the multinomial probability f(y t ;β F t 1 ) = m j=1 π tj (β) y tj. The partial likelihood is a product of the multinomial probabilities, PL(β) = = N t=1 N t=1 j=1 f(y t ; β F t 1 ) m π y tj tj (β), so that the partial log-likelihood is given by l(β) logpl(β) = N m t=1 j=1 y tj log π tj (β). 55

56 Under a modified Assumption A: G N (β) N R p q ZU(β)Σ(β)U (β)z ν(dz) = G(β) ( N(ˆβ β) N p 0,G 1 (β) ) N ( πt (ˆβ) π t (β) ) N q ( 0,Zt 1 D t (β)g 1 (β)d t(β)z t 1 ) 56

57 Example: Sleep State. Covariates: Heart Rate, Temperature. N = : quiet sleep, 2 : indeterminate sleep, 3 : active sleep, 4 : awake. Ordinal CTS: 4 < 1 < 2 < 3. State time Heart Rate time Temperature time 57

58 Fit proportional odds models. Model Covariates AIC 1 1+Y t Y t 1 +log R t Y t 1 +log R t +T t Y t 1 +T t Y t 1 +Y t 2 +log R t Y t 1 +log R t log R t

59 Model 2: 1 + Y t 1 + log R t [ ] P(Yt 4 F log t 1 ) = P(Y t > 4 F t 1 ) θ 1 + γ 1 Y (t 1)1 + γ 2 Y (t 1)2 + γ 3 Y (t 1)3 + γ 4 log R t, [ ] P(Yt 1 F log t 1 ) = P(Y t > 1 F t 1 ) θ 2 + γ 1 Y (t 1)1 + γ 2 Y (t 1)2 + γ 3 Y (t 1)3 + γ 4 log R t, ] [ P(Yt 2 F log t 1 ) = P(Y t > 2 F t 1 ) θ 3 + γ 1 Y (t 1)1 + γ 2 Y (t 1)2 + γ 3 Y (t 1)3 + γ 4 log R t, ˆθ 1 = , ˆθ 2 = , ˆθ 3 = , ˆγ 1 = , ˆγ 2 = 9.533, ˆγ 3 = 4.755, ˆγ 4 = The corresponding standard errors are , , , 0.872, 0.630, and

60 (a) State (b) State (a) Observed versus (b) predicted sleep states for Model 2 of Table applied to the testing data set. N =