GAM for large datasets and load forecasting Simon Wood, University of Bath, U.K. in collaboration with Yannig Goude, EDF
French Grid load GW 3 5 7 5 1 15 day GW 3 4 5 174 176 178 18 182 day Half hourly load data from 1st Sept 22. Pierrot and Goude (211) successfully predicted one day ahead using 48 separate models, one for each half hour period of the day.
Why 48 models? The 48 separate models each have a form something like MW t = f (MW t 48 ) + j g j (x jt ) + ɛ t where f and the g j are a smooth functions to be estimated, and the x j are other covariates. 48 Separate models has some disadvantages 1. Continuity of effects across the day is not enforced. 2. Information is wasted as the correlation between neighbouring time points is not utilised. 3. Fitting 48 models, each to 1/48 of the data promotes estimate instability and creates a huge model checking task each time the model is updated (every day?) But existing methods for fitting such smooth additive models could not cope with fitting one model to all the data.
Smooth additive models The basic model is y i = j f j (x ji ) + ɛ i where the f j are smooth functions to be estimated. Need to estimate the f j (including how smooth). Represent each f j using a spline basis expansion f j (x j ) = k b jk(x j )β jk where b jk are basis functions and β jk are coefficients (maybe 1-1 of each). So model becomes y i = j b jk (x ji )β jk + ɛ i = X i β + ɛ i k... a linear model. But it is much too flexible.
Penalizing overfit To avoid over fitting, we penalize function wiggliness (lack of smoothness) while fitting. In particular let J j (f j ) = β T Sβ measure wiggliness of f j. Estimate β to minimize (y i X i β) 2 + i j λ j β T Sβ where λ j are smoothing parameters. High λ j smooth f j. Low λ j wiggly f j.
Example basis-penalty: P-splines Eilers and Marx have popularized the use of B-spline bases with discrete penalties. If b k (x) is a B-spline and β k an unknown coefficient, then K f (x) = β k b k (x). Wiggliness can be penalized by e.g. k K 1 P = (β j 1 2β j + β j+1 ) 2 = β T Sβ. k=2 b k (x)..2.4.6..2.4.6.8 1. x f(x)..2.4.6.8..2.4.6.8 1. x
Example varying P-spline penalty y..2.4.6.8 1. basis functions y 2 2 4 6 8 1 (β i 1 2β i + β i+1 ) 2 = 27 i..2.4.6.8 1. x..2.4.6.8 1. x y 2 2 4 6 8 1 (β i 1 2β i + β i+1 ) 2 = 16.4 i y 2 2 4 6 8 1 (β i 1 2β i + β i+1 ) 2 =.4 i..2.4.6.8 1...2.4.6.8 1.
Choosing how much to smooth 2 2 4 6 8 y λ too high λ about right 2 2 4 6 8 2 2 4 6 8 y y λ too low.2.4.6.8 1. x.2.4.6.8 1. x.2.4.6.8 1. x Choose λ j to optimize model s ability to predict data it wasn t fitted too. Generalized cross validation chooses λ j to minimize V(λ) = y Xβ 2 [n tr(f)] 2 where F = (X T X + j λ js j ) 1 X T X. tr(f) = effective degrees of freedom.
The main problem We estimate β by the minimizer of y Xβ 2 + j λ j β T Sβ and the λ j to minimise V(λ) = y Xβ 2 [n tr(f)] 2 If X is too large, then we can easily run out of computer memory when trying to fit the model. A simple approach will let us fit the model without forming X whole.
A smaller equivalent fitting problem Let X be n p Consider the QR decomposition X = QR, where R is p p and Q has orthogonal columns. Let f = Q T y and γ = y 2 f 2. y Xβ 2 = f Rβ 2 + γ and V(λ) = y Xβ 2 [n tr(f)] 2 = f Rβ 2 + γ [n tr(f)] 2 where F = (R T R + j λ js j ) 1 R T R. So once we have R and f we can estimate β and λ without X...
QR updating for R and f [ ] [ ] X y Let X =, y = X 1 y 1 [ ] R Form X = Q R & = Q X 1 R. 1 [ ] Q Then X = QR where Q = Q I 1. [ ] Q T y Also f = Q T y = Q T 1 y 1 Applying these results recursively, R and f can be accumulated from sub-blocks of X without forming X whole.
X T X updating and Choleski R is the Choleski factor of X T X. Partitioning X row-wise into sub-matrices X 1, X 2,..., we have X T X = j X T j X j which can be used to accumulate X T X without needing to form X whole. At same time accumulate X T y = j XT j y. Choleski decomposition gives R T R = X T X, and f = R 1 X T y. This is twice as fast as QR approach, but less numerically stable.
Generalizations 1. Generalized additive models (GAMs) allow distributions other than normal (and non-identity link functions). 2. GAMs can be estimated by iteratively estimating working linear models, using R and f accumulation tricks on each. 3. REML, AIC etc can also be used for λ j estimation. REML can be made especially efficient in this context. 4. The accumulation methods are easy to parallelize. 5. Updating a model fit with new data is very easy. 6. AR1 residuals can also be handled easily in the non-generalized case.
Air pollution example Around 5 daily death rates, for Chicago, along with time, ozone, pm1, tmp (last 3 averaged over preceding 3 days). Peng and Welty (24). Appropriate GAM is: death i Poi, log{e(death i )} = f 1 (time i ) + f 2 (ozone i, tmp i ) + f 3 (pm1 i ). f 1 and f 3 penalized cubic regression splines, f 2 tensor product spline. Results suggest a very strong ozone - temperature interaction.
Air pollution Chicago results s(time,136.85).1..1.2 2 1 1 2 time 1se te(o3,tmp,38.63) +1se tmp 1 2 3 s(pm1,3.42).1..1.2 5 5 1 o3 1 12 14 16 18 pm1
Air pollution Chicago results To test the truth of ozone-temp interaction it would be good to fit to the equivalent data for around 1 other cities, simultaneously. Model is log{e(death i )} = γ j + α k + f k (o3 i, temp i ) + f 4 (t i ) if observation i is from city j and age group k (there are 3 age groups recorded). Model has 82 coefs and is estimated from 1.2M data. Fitting takes 12.5 minutes using 4 cores of a $6 PC. Same model to just Chicago, takes 11.5 minutes by previous methods.
Air pollution all cities results <65 s(time,277.36).1..1.2.3 tmp 1 2 3 4.1.1.2.3.5.2.1.1.2.3.2 2 1 1 2 time 1 1 2 o3 65 74 >74 tmp 1 2 3 4.15.5.1.5.5.1.5.1.15 tmp 1 2 3 4.4.6.8.2.2.4.6.2.2.6.2.6.4.4.8.8.4.4.12.14.6.1 1 1 2 o3 1 1 2 o3
Back to grid load GW 3 5 7 5 1 15 day GW 3 4 5 174 176 178 18 182 day Now the 48 separate grid load model can be replaced by L i = γ j + f k (I i, L i 48 ) + g 1 (t i ) + g 2 (I i, toy i ) + g 3 (T i, I i ) + g 4 (T.24 i, T.48 i ) + g 5 (cloud i ) + ST i h(i i ) + e i if observation i is from day of the week j, and day class k. e i = ρe i 1 + ɛ i and ɛ i N(, σ 2 ) (AR1).
Fitting grid load model Using QR update fitting takes under 3 minutes and we estimate ρ =.98. Update with new data then takes under 2 minutes. We fit up to 31 August 28, and then predicted the next years data, one day at a time, with model update. Predictive RMSE was 1156MW (124MW for fit). Predictive MAPE 1.62% (1.46% for fit). Setting ρ = gives overfit, and worse predictive performance.
Residuals 6 2 2 4 6 MAPE (%).5 1. 1.5 2. 2.5 3. Mo Tu We Th Fr Sa Su 1 2 3 4 1 2 3 4 Instant 9/1/22 12/19/22 4/8/23 8/12/23 12/4/23 3/22/24 7/26/24 11/17/24 3/7/25 7/6/25 1/23/25 2/18/26 6/2/26 1/8/26 2/3/27 6/4/27 9/21/27 1/17/28 5/6/28 8/31/28 1/9/28 17/9/28 4/1/28 21/1/28 15/11/28 2/12/28 19/12/28 13/1/29 3/1/29 16/2/29 4/3/29 21/3/29 7/4/29 28/4/29 27/5/29 17/6/29 4/7/29 25/7/29 11/8/29 31/8/29 6 2 2 4 6 6 4 2 2 4 6
Temperature effect Temperature Effect 7 65 L[t] 6 55 5 3 2 2 4 3 T[t] 1 1 I[t]
L[t] L[t] L[t] L[t] Lagged load Effects Lagged Load Effect, WW Lagged Load Effect, WH 7 8 6 7 5 6 4 8 7 6 5 L[t 1] 4 3 1 2 I[t] 3 4 5 8 7 6 5 L[t 1] 4 3 1 2 I[t] 3 4 Lagged Load Effect, HH Lagged Load Effect, HW 9 8 8 7 7 6 6 5 5 8 7 6 5 L[t 1] 4 3 1 2 I[t] 3 4 4 8 7 6 5 L[t 1] 4 3 1 2 I[t] 3 4