Regression Models for Time Series Analysis

Similar documents

SAS Software to Fit the Generalized Linear Model

STATISTICA Formula Guide: Logistic Regression. Table of Contents

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Logistic Regression (1/24/13)

Generalized Linear Models

Poisson Models for Count Data

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Linear Threshold Units

Lecture 8: Gamma regression

Lecture 6: Poisson regression

Introduction to General and Generalized Linear Models

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BayesX - Software for Bayesian Inference in Structured Additive Regression

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Statistics in Retail Finance. Chapter 6: Behavioural models

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

GLM, insurance pricing & big data: paying attention to convergence issues.

NON-LIFE INSURANCE PRICING USING THE GENERALIZED ADDITIVE MODEL, SMOOTHING SPLINES AND L-CURVES

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Computer exercise 4 Poisson Regression

Logistic regression modeling the probability of success

Least Squares Estimation

Local classification and local likelihoods

An extension of the factoring likelihood approach for non-monotone missing data

Logit Models for Binary Data

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Analysis and Computation for Finance Time Series - An Introduction

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Ordinal Regression. Chapter

Regression Modeling Strategies

Lecture Notes 1. Brief Review of Basic Probability

4. Simple regression. QBUS6840 Predictive Analytics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Factorial experimental designs and generalized linear models

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Statistics Graduate Courses

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Lecture 3: Linear methods for classification

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistical Machine Learning

Poisson Regression or Regression of Counts (& Rates)

GLM I An Introduction to Generalized Linear Models

Underwriting risk control in non-life insurance via generalized linear models and stochastic programming

Statistics and Probability Letters

How To Understand The Theory Of Probability

TIME SERIES ANALYSIS

240ST014 - Data Analysis of Transport and Logistics

Time Series Analysis

VI. Introduction to Logistic Regression

Univariate and Multivariate Methods PEARSON. Addison Wesley

Trend and Seasonal Components

Time Series Analysis III

MATHEMATICAL METHODS OF STATISTICS

TIME SERIES ANALYSIS

7 Generalized Estimating Equations

Analysis of ordinal data with cumulative link models estimation with the R-package ordinal

Interpretation of Somers D under four simple models

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Examples. David Ruppert. April 25, Cornell University. Statistics for Financial Engineering: Some R. Examples. David Ruppert.

Nominal and ordinal logistic regression

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Introduction to Predictive Modeling Using GLMs

Chapter 6: Multivariate Cointegration Analysis

13. Poisson Regression Analysis

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care.

Estimating an ARMA Process

171:290 Model Selection Lecture II: The Akaike Information Criterion

Logistic Regression (a type of Generalized Linear Model)

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

15.1 The Structure of Generalized Linear Models

Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]

Profitable Strategies in Horse Race Betting Markets

Simple Linear Regression Inference

Handling attrition and non-response in longitudinal data

Penalized Logistic Regression and Classification of Microarray Data

Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009

Nonnested model comparison of GLM and GAM count regression models for life insurance data

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.

STA 4273H: Statistical Machine Learning

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Chapter 10 Introduction to Time Series Analysis

11. Analysis of Case-control Studies Logistic Regression

Master s Theory Exam Spring 2006

UNDERGRADUATE DEGREE DETAILS : BACHELOR OF SCIENCE WITH

Multivariate Statistical Inference and Applications

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Factor analysis. Angela Montanari

Transcription:

Regression Models for Time Series Analysis Benjamin Kedem 1 and Konstantinos Fokianos 2 1 University of Maryland, College Park, MD 2 University of Cyprus, Nicosia, Cyprus Wiley, New York, 2002 1

Cox (1975). Partial likelihood, Biometrika, 62, 69 76. Fahrmeir and Tutz (2001). Multivariate Statistical Modelling Based on GLM. 2nd ed., Springer, NY. Fokianos (1996). Categorical Time Series: Prediction and Control. Ph.D. Thesis, University of Maryland. Fokianos and Kedem (1998), Prediction and classification of non-stationary categorical time series. J. Multivariate Analysis,67, 277-296. Kedem (1980). Binary Time Series, Marcel Dekker, NY ( ) Kedem and Fokianos (2002). Regression Models for Time Series Analysis, Wiley, NY. 2

McCullagh and Nelder (1989). Generalized Linear Models. 2nd edi., Chapman & Hall, London. Nelder and Wedderburn (1972). Generalized linear models. JRSS, A, 135, 370 384. Slud (1982). Consistency and efficiency of inference with the partial likelihood. Biometrika, 69, 547 552. Slud and Kedem (1994). Partial likelihood analysis of logistic regression and autoregression. Statistica Sinica, 4, 89 106. Wong (1986). Theory of partial likelihood. Annals of Statistics, 14, 88 123. 3

Part I: GLM and Time Series Overview Extension of Nelder and Wedderburn (1972), McCullagh and Nelder(1989) GLM to time series is possible due to: Increasing sequence of histories relative to an observer. Partial likelihood. The partial score is a martingale. Well behaved covariates. 4

Partial Likelihood Suppose we observe a pair of jointly distributed time series, (X t, Y t ), t = 1,..., N, where {Y t } is a response series and {X t } is a time dependent random covariate. Employing the rules of conditional probability, the joint density of all the X, Y observations can be expressed as, f θ (x 1, y 1,..., x N, y N ) = f θ (x 1 ) N t=2 f θ (x t d t ) d t = (y 1, x 1,..., y t 1, x t 1 ) c t = (y 1, x 1,..., y t 1, x t 1, x t ). N t=1 f θ (y t c t ) (1) The second product on the right hand side of (1) constitutes a partial likelihood according to Cox(1975). 5

An increasing sequence of σ-fields F 0 F 1 F 2.... Y 1, Y 2,... a sequence of random variables on some common probability space such that Y t is F t measurable. Y t F t 1 f t (y t ; θ). θ R p is a fixed parameter. The partial likelihood (PL) function relative to θ, F t, and the data Y 1, Y 2,..., Y N, is given by the product PL(θ; y 1,..., y N ) = N t=1 f t (y t ; θ) (2) 6

The General Regression Problem {Y t } is a response time series with the corresponding p dimensional covariate process, Z t 1 = (Z (t 1)1,..., Z (t 1)p ). Define, F t 1 = σ{y t 1, Y t 2,...,Z t 1,Z t 2,...}. Note: Z t 1 may already include past Y t s. The conditional expectation of the response given the past: µ t = E[Y t F t 1 ], ( ) The problem is to relate µ t to the covariates. 7

Time Series Following GLM 1. Random Component. The conditional distribution of the response given the past belongs to the exponential family of distributions in natural or canonical form, { } yt θ t b(θ t ) f(y t ; θ t, φ F t 1 ) = exp + c(y t ; φ). α t (φ) (3) α t (φ) = φ/ω t, dispersion φ, prior weight ω t. 2. Systematic Component. There is a monotone function g( ) such that, g(µ t ) = η t = p j=1 β j Z (t 1)j = Z t 1β. (4) g( ): the link function η t : the linear predictor of the model. 8

Typical choices for η t = Z t 1β, could be or or β 0 + β 1 Y t 1 + β 2 Y t 2 + β 3 X t cos(ω 0 t) β 0 + β 1 Y t 1 + β 2 Y t 2 + β 3 Y t 1 X t + β 4 X t 1 β 0 + β 1 Y t 1 + β 2 Y 7 t 2 + β 3Y t 1 log(x t 12 ) 9

GLM Equations: f(y; θ t ; φ F t 1 )dy = 1. This implies the relationships: µ t = E[Y t F t 1 ] = b (θ t ). (5) Var[Y t F t 1 ] = α t (φ)b (θ t ) α t (φ)v (µ t ). (6) Since Var[Y t F t 1 ] > 0, it follows that b is monotone. Therefore, equation (5) implies that θ t = (b ) 1 (µ t ). (7) We see that θ t itself is a monotone function of µ t and hence it can be used to define a link function. The link function g(µ t ) = θ t (µ t ) = η t = Z t 1 β (8) is called the canonical link function. 10

Example: Poisson Time Series. f(y t ; θ t, φ F t 1 ) = exp {(y t log µ t µ t ) log y t!}. E[Y t F t 1 ] = µ t, b(θ t ) = µ t = exp(θ t ), V (µ t ) = µ t, φ = 1, and ω t = 1. The canonical link is g(µ t ) = θ t (µ t ) = log µ t = η t = Z t 1 β. As an example, if Z t 1 = (1, X t, Y t 1 ), then log µ t = β 0 + β 1 X t + β 2 Y t 1 with {X t } standing for some covariate process, or a possible trend, or a possible seasonal component. 11

Example: Binary Time Series {Y t } takes the values 0,1. Let Then π t = P(Y t = 1 F t 1 ). f(y t ; θ t, φ F t 1 ) = { exp y t log ( πt 1 π t ) + log(1 π t ) } The canonical link gives the logistic regression model g(π t ) = θ t (π t ) = log π t = η t = Z t 1β. (9) 1 π t Note: π t = F l (η t ) (10) 12

Partial Likelihood Inference Given a time series {Y t }, t = 1,..., N, conditionally distributed as (3). The partial likelihood of the observed series is PL(β) = N t=1 f(y t ; θ t, φ F t 1 ). (11) Then from (3), the log partial likelihood, l(β), is given by l(β) = = = N t=1 N t=1 N t=1 log f(y t ; θ t, φ F t 1 ) N l t t=1 } { yt θ t b(θ t ) + c(y t, φ) α t (φ) { yt u(z t 1 β) b(u(z t 1 β)) + c(y t, φ) α t (φ) } (12) 13

( β 1, β 2,, β p ). The partial score is a p dimensional vector, S N (β) l(β) = N t=1 with σ 2 t (β) = Var[Y t F t 1 ]. µ t (Y t µ t (β)) Z t 1 η t σt 2(β) (13) The partial score vector process {S t (β)}, t = 1,..., N, is defined from the partial sums S t (β) = t s=1 µ s (Y s µ s (β)) Z s 1 η s σs(β) 2. (14) 14

The solution of the score equation, S N (β) = logpl(β) = 0 (15) is denoted by ˆβ, and is referred to as the maximum partial likelihood estimator (MPLE) of β. The system of equations (15) is non linear and is customarily solved by the Fisher scoring method, an iterative algorithm resembling the Newton Raphson procedure. Before turning to the Fisher scoring algorithm in our context of conditional inference, it is necessary to introduce several important matrices. 15

An important role in partial likelihood inference is played by the cumulative conditional information matrix, G N (β), defined by a sum of conditional covariance matrices, G N (β) = = N t=1 N t=1 We also need: Cov [ = Z W(β)Z. µ t (Y t µ t (β)) Z t 1 η t σt 2(β) ( ) 2 µt 1 Z t 1 η t σt 2(β)Z t 1 H N (β) l(β). Define R N (β) from the difference H N (β) = G N (β) R N (β). Fact: For canonical links R N (β) = 0. F t 1 ] 16

Fisher Scoring: In Newton-Raphson replace H N (β) by its conditional expectation: ˆβ (k+1) = ˆβ (k) + G 1 N (ˆβ (k) )S N (ˆβ (k) ). Fisher scoring becomes Newton-Raphson for canonical links. Fisher scoring simplifies to Iterative Reweighted Least Squares: ˆβ (k+1) = ( Z W(ˆβ (k) )Z) 1 Z W(ˆβ (k) )q (k). 17

Asymptotic Theory Assumption A A1. The true parameter β belongs to an open set B R p. A2. The covariate vector Z t 1 almost surely lies in a nonrandom compact subset Γ of R p, such that P[ N t=1 Z t 1 Z t 1 > 0] = 1. In addition, Z t 1β lies almost surely in the domain H of the inverse link function h = g 1 for all Z t 1 Γ and β B. A3. The inverse link function h defined in (A2) is twice continuously differentiable and h(γ)/ γ = 0. 18

A4. There is a probability measure ν on R p such that R p zz ν(dz) is positive definite, and such that under (3) and (4) for Borel sets A R p, 1 N N t=1 I [Zt 1 A] ν(a) in probability as N, at the true value of β. A4 calls for asymptotically well behaved covariates: Nt=1 f(z t 1 ) N f(z)ν(dz) R p in probability as N. Thus, there exists a p p limiting information matrix per observation, G(β), such that G N (β) G(β) (16) N in probability, as N. 19

Slud and K (1994), Fokianos and K (1998): 1. {S t (β)} relative to {F t }, t = 1,..., N, is a martingale. 2. R N(β) N 0. 3. S N(β) N N p (0,G(β)). 4. N(ˆβ β) N p (0,G 1 (β)). 20

100(1 α)% prediction interval (h = g 1 ) µ t (β) = µ t (ˆβ)±z α/2 h (Z t 1 β) N Z t 1 G 1 (β)z t 1 Hypothesis Testing Let β be the MPLE of β obtained under H 0 with r < p restrictions, H 0 : β 1 = = β r = 0. Let ˆβ be the unrestricted MPLE. The log partial likelihood ratio statistic converges to χ 2 r. λ N = 2 { l(ˆβ) l( β) } (17) 21

More generally: Assume C is a known matrix with full rank r, r < p. Under the general linear hypothesis, H 0 : Cβ = β 0 against H 1 : Cβ β 0, (18) {Cˆβ β 0 } {CG 1 (ˆβ)C } 1 {Cˆβ β 0 } χ 2 r 22

Diagnostics l(y; y): maximum log partial likelihood corresponding to the saturated model. l(ˆµ; y): maximum log partial likelihood from the reduced model. Scaled Deviance: D 2{l(y;y) l(ˆµ;y)} χ 2 N p AIC(p) = 2logPL(ˆβ) + 2p, BIC(p) = 2logPL(ˆβ) + plog N 23

Analysis of Mortality Count Data in LA Weekly data from Los Angeles County during a period of 10 years from January 1, 1970, to December 31, 1979: Weekly sampled filtered time series. N = 508. Response Y Weather T RH Pollution CO SO 2 NO 2 HC OZ KM Total Mortality (filtered) Temperature Relative humidity Carbon monoxide Sulfur dioxide Nitrogen dioxide Hydrocarbons Ozone Particulates 24

140 160 180 200 220 50 60 70 80 90 100 1.0 1.5 2.0 2.5 3.0 Mortality 0 100 200 300 400 500 0 5 10 15 20 25 Lag Temperature ACF 0.0 0.2 0.4 0.6 0.8 1.0 Series : Mort 0 100 200 300 400 500 0 5 10 15 20 25 Lag log(co) Mortality, Temperature, log(co) ACF -0.5 0.0 0.5 1.0 ACF -0.4 0.0 0.2 0.4 0.6 0.8 1.0 Series : Temp Series : log(co) 0 100 200 300 400 500 0 5 10 15 20 25 Lag Weekly data of filtered total mortality and temperature, and log-filtered CO, and the corresponding estimated autocorrelation functions. N = 508. 25

Covariates and η t used in Poisson regression. S = SO 2, N = NO 2. To recover η t, insert the β s. For Model 2, η t = β 0 + β 1 Y t 1 + β 2 Y t 2, etc. Model 0 T t + RH t + CO t + S t + N t +HC t + OZ t + KM t Model 1 Y t 1 Model 2 Y t 1 + Y t 2 Model 3 Y t 1 + Y t 2 + T t 1 Model 4 Y t 1 + Y t 2 + T t 1 + log(co t ) Model 5 Y t 1 + Y t 2 + T t 1 + T t 2 + log(co t ) Model 6 Y t 1 + Y t 2 + T t + T t 1 + log(co t ) 26

Comparison of 7 Poisson regression models. N = 508. Model p D df AIC BIC 0 9 315.69 499 333.69 371.76 1 2 276.07 506 280.07 288.53 2 3 222.23 505 228.23 240.92 3 4 203.52 504 211.52 228.44 4 5 174.55 503 184.55 205.71 5 6 174.53 502 186.53 211.91 6 6 171.41 502 183.41 208.79 Choose Model 4: log(ˆµ t ) = ˆβ 0 + ˆβ 1 Y t 1 + ˆβ 2 Y t 2 + ˆβ 3 T t 1 + ˆβ 4 log(co t ) 27

Poisson Regression of Filtered LA Mortality Mrt(t) on Mrt(t-1), Mrt(t-2), Tmp(t-1), log(crb(t)) 140 160 180 200 220 Data... Fit 0 100 200 300 400 500 Observed and predicted weekly filtered total mortality from Model 4. 28

Comparison of Residuals Model 0 Residuals Model 4 Residuals -0.1 0.1 0.3 0 100 200 300 400 500-0.15 0.0 0.10 0 100 200 300 400 500 Series : Resid0 Series : Resid4 ACF 0.0 0.4 0.8 ACF 0.0 0.4 0.8 0 5 10 15 20 25 Lag 0 5 10 15 20 25 Lag Working residuals from Models 0 and 4, and their respective estimated autocorrelation. 29

Part II: Binary Time Series Example of bts: Two categories obtained by clipping. Y t I [Xt C] = { 1, if Xt C 0, if X t C (19) Y t I [Xt r] = { 1, if Xt r 0, if X t < r (20) Other examples: at time t = 1,2,..., (Rain, No Rain), (S&P Up, S&P Down), etc. 30

{Y t } taking the values 0 or 1, t = 1,2,3,. {Z t 1 } p-dim covariate stochastic data. Against the backdrop of the general framework presented above we wish to relate µ t (β) = π t (β) = P β (Y t = 1 F t 1 ) (21) to the covariates. For this we need good links! 31

Standard logistic distribution, F l (x) = Then, ex 1 + e x = 1 1 + e x, < x < Fl 1 (x) = log(x/(1 x)) is the natural link under some conditions. 32

Fact: For any bts, there are θ j such that { } P(Yt = 1 Y log t 1 = y t 1,..., Y 1 = y 1 ) = P(Y t = 0 Y t 1 = y t 1,..., Y 1 = y 1 ) θ 0 + θ 1 y t 1 + + θ p y t p or π t (β) = 1 1 + exp[ (θ 0 + θ 1 y t 1 + + θ p y t p )] Fact: Consider an AR(p) time series X t = γ 0 + γ 1 X t 1 +... + γ p X t p + λǫ t where ǫ t are i.i.d. logistically distributed. Define: Y t = I [Xt r]. Then π t (β) = 1 1 + exp[ (γ 0 r + γ 1 X t 1 + + γ p X t p )/λ] 33

This motivates logistic regression: π t (β) P β (Y t = 1 F t 1 ) = F l (β Z t 1 ) 1 = 1 + exp[ β Z t 1 ] or equivalently, the link function is { } πt (β) logit(π t (β)) log = β Z t 1 1 π t (β) This is the canonical link. 34

Link functions for binary time series. logit β Z t 1 = log{π t (β)/(1 π t (β))} probit β Z t 1 = Φ 1 {π t (β)} log-log β Z t 1 = log{ log(π t (β))} C-log-log β Z t 1 = log{ log(1 π t (β))} Note: Here all the inverse links are cdf s. In what follows we always assume the inverse link is a differentiable cdf F(x). 35

The partial likelihood of β takes on the simple product form, PL(β) = = N t=1 N t=1 [π t (β)] y t [1 π t (β)] 1 y t [F(β Z t 1 )] y t [1 F(β Z t 1 )] 1 y t We have under Assumption A: N(ˆβ β) N p (0,G 1 (β)). For the canonical link (logistic regression): G N (β) N G(β) = Rp e β z (1 + e β z ) 2zz ν(dz) 36

Illustration of asymptotic normality. ( ) 2πt logit(π t (β)) = β 1 + β 2 cos + β 3 Y t 1 12 so that Z t 1 = (1,cos(2πt/12), Y t 1 ). (a) 0.0 0.4 0.8 0 50 100 150 200 (b) 0.4 0.6 0.8 0 50 100 150 200 t Logistic autoregression with a sinusoidal component. a. Y t. b. π t (β) where logit(π t (β)) = 0.3 + 0.75cos(2πt/12) + y t 1. 37

b1 b2 0.0 0.1 0.2 0.3 0.4-2 -1 0 1 2 3 4 0.0 0.1 0.2 0.3 0.4-2 0 2 4 b3 0.0 0.1 0.2 0.3 0.4-4 -2 0 2 Histograms of normalized MPLE s where β = (0.3,0.75,1), N = 200. Each histogram consists of 1000 estimates. 38

Goodness of Fit C 1,, C k, a partition of R p. For j = 1,, k, define, and Put: M j E j (β) N t=1 N t=1 I [Zt 1 C j ] Y t I [Zt 1 C j ] π t(β) M (M 1,, M k ), E(β) (E 1 (β),, E k (β)). 39

Slud and K (1994), K and Fokianos (2002): With σ 2 j C j F(β z)(1 F(β z))ν(dz) χ 2 (β) 1 N k j=1 (M j E j (β)) 2 /σ 2 j χ2 k. In practice need to adjust the df when replacing β by ˆβ. 40

When ˆβ and (M E(β)) are obtained from the same data set, E(χ 2 (ˆβ)) k k j=1 (B G(β)B) jj /σ 2 j When ˆβ and (M E(β)) are obtained from independent data sets, E(χ 2 (ˆβ)) k + k j=1 (B G(β)B) jj /σ 2 j 41

Illustration of the distribution of χ 2 (β) using Q-Q plots. Consider the previous logistic regression model with a periodic component. Use the partition C 1 = {Z : Z 1 = 1, 1 Z 2 < 0, Z 3 = 0} C 2 = {Z : Z 1 = 1, 1 Z 2 < 0, Z 3 = 1} C 3 = {Z : Z 1 = 1,0 Z 2 1, Z 3 = 0} C 4 = {Z : Z 1 = 1,0 Z 2 1, Z 3 = 1} Then, k=4, M j is the sum of those Y t s for which Z t 1 is in C j, j = 1,2,3,4, and the E j (β) are obtained similarly. Estimate σ 2 j by, σ 2 j = 1 N N t=1 I [Zt 1 C j ] π t(β)(1 π t (β)) 42

The χ 2 4 approximation is quite good. N=200, b=(0.3,0.75,1) N=400, b=(0.3,1,2) TRUE CHI SQ 0 2 4 6 8 10 12 14 TRUE CHI SQ 0 5 10 15 0 5 10 15 20 25 CHI SQ STATISTIC 0 5 10 15 20 CHI SQ STATISTIC 43

Example: Modeling successive eruptions of the Old Faithful geyser in Yellowstone National Park, Wyoming. 1 if duration is greater than 3 minutes 0 if duration is less than 3 minutes 101110110101011010110101010 111110101010101010101010101 011111010101011010111011111 011101010101010101010101010 101101010101011101111111011 111011111110101010101011111 101010101110101011010111101 010101110101011011011101010 101101111111010101111011011 101101011101011111011101010 110101111111101010101010101 10 N = 299 44

Candidate models η t = β Z t 1 for Old Faithful. 1 β 0 + β 1 Y t 1 2 β 0 + β 1 Y t 1 + β 2 Y t 2 3 β 0 + β 1 Y t 1 + β 2 Y t 2 + β 3 Y t 3 4 β 0 + β 1 Y t 1 + β 2 Y t 2 + β 3 Y t 3 + β 4 Y t 4 Comparison of models using the logistic link: Model 2 is best. Probit reg. gives similar results. Model p X 2 D AIC BIC 1 2 165.00 227.38 231.38 238.46 2 3 165.00 215.53 221.53 232.15 3 4 165.00 215.08 223.08 237.24 4 5 164.97 213.99 223.99 241.69 ˆπ t = π t (ˆβ) = 1 1 + exp { (ˆβ 0 + ˆβ 1 Y t 1 + ˆβ 2 Y t 2 ) }. 45

Part III: Categorical Time Series EEG sleep state classified or quantized in four categories as follows. 1 : quiet sleep, 2 : indeterminate sleep, 3 : active sleep, 4 : awake. Here the sleep state categories or levels are assigned integer values. This is an example of a categorical time series {Y t }, t = 1,..., N, taking the values 1,...,4. This is an arbitrary integer assignment. Why not the values 7.1, 15.8, 19.24, 71.17? Any other scale? 46

Assume m categories. The t th observation of any categorical time series regardless of the measurement scale can be represented by the vector Y t = (Y t1,..., Y tq ) of length q = m 1, with elements { 1, if the jth category is observed at time t Y tj = 0, otherwise for t = 1,..., N and j = 1,..., q. BTS is a special case with m = 2, q = 1. 47

Write for j = 1,..., q, π tj = E[Y tj F t 1 ] = P(Y tj = 1 F t 1 ), Define: π t = (π t1,... π tq ) Y tm = 1 π tm = 1 q Y tj j=1 q j=1 π tj. Let {Z t 1 }, t = 1,..., N, for a p q matrix that represents a covariate process. Y tj corresponds to a vector of length p of random time dependent covariates which forms the jth column of Z t 1. 48

Assume the general regression model: ( ) π t (β) = π t1 (β) π t2 (β) π tq (β) = h 1 (Z t 1 β) h 2 (Z t 1 β) h q (Z t 1 β) = h(z t 1 β). The inverse link function h is defined on R q and takes values in R q. We shall only examine nominal and ordinal time series. 49

Nominal Time Series. Nominal categorical variables lack natural ordering. A model for nominal time series: Multinomial logit model π tj (β) = Note that exp(β j z t 1) 1 + q l=1 exp(β l z t 1), j = 1,..., q π tm (β) = 1 1 + q l=1 exp(β l z t 1). 50

Multinomial logit model is a special case of (*). Indeed, define β to be the p qd-vector β = (β 1,..., β q ), and Z t 1 the qd q matrix Z t 1 = z t 1 0 0 0 z t 1 0...... 0 0 z t 1. Let h stand for the vector valued function whose components h j, j = 1,..., q, are given by π tj (β) = h j (η t ) = with exp(η tj ) 1 + q l=1 exp(η tl), j = 1,..., q η t = (η t1,..., η tq ) = Z t 1 β. 51

Ordinal Time Series. Measured on a scale endowed with a natural ordering. A model for ordinal time series: Need a latent or auxiliary variable. Put X t = γ z t 1 + e t, 1. e t cdf F i.i.d. 2. γ d-dim vector of parameters. 3. z t 1 covariate d-dim vector. Define a categorical time series {Y t }, from the levels of {X t }, Y t = j Y tj = 1 θ j 1 X t < θ j = θ 0 < θ 1 <... < θ m =. 52

Then π tj = P(θ j 1 X t < θ j F t 1 ) = F(θ j + γ z t 1 ) F(θ j 1 + γ z t 1 ), for j = 1,..., m. There are many possibilities depending on F. Special case: Proportional Odds Model, F(x) = F l (x) = 1 1 + exp( x). Then we have for j = 1,..., q, { } P[Yt j F log t 1 ] = θ j + γ z t 1 P[Y t > j F t 1 53

Proportional odds model has the form (*) with p = (q + d): β = (θ 1,..., θ q, γ ) and Z t 1 the (q + d) q matrix Now set Z t 1 = 1 0 0 0 1 0...... 0 0 1 z t 1 z t 1 z t 1. h = (h 1,..., h q ), and let for j = 2,..., q, π t1 (β) = h 1 (η t ) = F(η t1 ), π tj (β) = h j (η t ) = F(η tj ) F(η t(j 1) ), where η t = (η t1,..., η tq ) = Z t 1 β. 54

Partial likelihood estimation. Introduce the multinomial probability f(y t ;β F t 1 ) = m j=1 π tj (β) y tj. The partial likelihood is a product of the multinomial probabilities, PL(β) = = N t=1 N t=1 j=1 f(y t ; β F t 1 ) m π y tj tj (β), so that the partial log-likelihood is given by l(β) logpl(β) = N m t=1 j=1 y tj log π tj (β). 55

Under a modified Assumption A: G N (β) N R p q ZU(β)Σ(β)U (β)z ν(dz) = G(β) ( N(ˆβ β) N p 0,G 1 (β) ) N ( πt (ˆβ) π t (β) ) N q ( 0,Zt 1 D t (β)g 1 (β)d t(β)z t 1 ) 56

Example: Sleep State. Covariates: Heart Rate, Temperature. N = 700. 1 : quiet sleep, 2 : indeterminate sleep, 3 : active sleep, 4 : awake. Ordinal CTS: 4 < 1 < 2 < 3. State 1.0 2.0 3.0 4.0 0 200 400 600 800 1000 time Heart Rate 100 140 180 0 200 400 600 800 1000 time Temperature 36.9 37.2 0 200 400 600 800 1000 time 57

Fit proportional odds models. Model Covariates AIC 1 1+Y t 1 401.56 2 1+Y t 1 +log R t 401.51 3 1+Y t 1 +log R t +T t 403.32 4 1+Y t 1 +T t 403.52 5 1+Y t 1 +Y t 2 +log R t 407.28 6 1+Y t 1 +log R t 1 403.40 7 1+log R t 1692.31 58

Model 2: 1 + Y t 1 + log R t [ ] P(Yt 4 F log t 1 ) = P(Y t > 4 F t 1 ) θ 1 + γ 1 Y (t 1)1 + γ 2 Y (t 1)2 + γ 3 Y (t 1)3 + γ 4 log R t, [ ] P(Yt 1 F log t 1 ) = P(Y t > 1 F t 1 ) θ 2 + γ 1 Y (t 1)1 + γ 2 Y (t 1)2 + γ 3 Y (t 1)3 + γ 4 log R t, ] [ P(Yt 2 F log t 1 ) = P(Y t > 2 F t 1 ) θ 3 + γ 1 Y (t 1)1 + γ 2 Y (t 1)2 + γ 3 Y (t 1)3 + γ 4 log R t, ˆθ 1 = 30.352, ˆθ 2 = 23.493, ˆθ 3 = 20.349, ˆγ 1 = 16.718, ˆγ 2 = 9.533, ˆγ 3 = 4.755, ˆγ 4 = 3.556. The corresponding standard errors are 12.051, 12.012, 11.985, 0.872, 0.630, 0.501 and 2.470. 59

(a) State 1.0 2.0 3.0 4.0 0 50 100 150 200 250 300 (b) State 1.0 2.0 3.0 4.0 0 50 100 150 200 250 300 (a) Observed versus (b) predicted sleep states for Model 2 of Table applied to the testing data set. N = 322. 60