GAM for large datasets and load forecasting

Similar documents

How To Calculate Electricity Load Data

Fitting Subject-specific Curves to Grouped Longitudinal Data

GLAM Array Methods in Statistics

3. Regression & Exponential Smoothing

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Time series Forecasting using Holt-Winters Exponential Smoothing

Estimation of σ 2, the variance of ɛ

Trend and Seasonal Components

Part 2: Analysis of Relationship Between Two Variables

Simple Linear Regression Inference

Lecture 3: Linear methods for classification

Penalized regression: Introduction

Milk Data Analysis. 1. Objective Introduction to SAS PROC MIXED Analyzing protein milk data using STATA Refit protein milk data using PROC MIXED

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Penalized Logistic Regression and Classification of Microarray Data

A Study on the Comparison of Electricity Forecasting Models: Korea and China

Polynomial Neural Network Discovery Client User Guide

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Forecasting in supply chains

Sales forecasting # 2

Regression Analysis LECTURE 2. The Multiple Regression Model in Matrices Consider the regression equation. (1) y = β 0 + β 1 x β k x k + ε,

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

Energy Demand Forecasting Industry Practices and Challenges

Smoothing and Non-Parametric Regression

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

Penalized Splines - A statistical Idea with numerous Applications...

A Regime-Switching Model for Electricity Spot Prices. Gero Schindlmayr EnBW Trading GmbH

Regularized Logistic Regression for Mind Reading with Parallel Validation

Rob J Hyndman. Forecasting using. 11. Dynamic regression OTexts.com/fpp/9/1/ Forecasting using R 1

Iterative Solvers for Linear Systems

Least Squares Estimation

Statistical Machine Learning

Random effects and nested models with SAS

Computer Graphics. Geometric Modeling. Page 1. Copyright Gotsman, Elber, Barequet, Karni, Sheffer Computer Science - Technion. An Example.

Univariate Regression

exspline That: Explaining Geographic Variation in Insurance Pricing

BayesX - Software for Bayesian Inference in Structured Additive Regression

ISSUES IN UNIVARIATE FORECASTING

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Long-term Health Spending Persistence among the Privately Insured: Exploring Dynamic Panel Estimation Approaches

Detection of changes in variance using binary segmentation and optimal partitioning

Corporate Defaults and Large Macroeconomic Shocks

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Chapter 4: Vector Autoregressive Models

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Flexible geoadditive survival analysis of non-hodgkin lymphoma in Peru

Decision Trees from large Databases: SLIQ

Smoothing. Fitting without a parametrization

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

[3] Big Data: Model Selection

Hedging Options In The Incomplete Market With Stochastic Volatility. Rituparna Sen Sunday, Nov 15

Estimating an ARMA Process

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Multiple Choice: 2 points each

Package smoothhr. November 9, 2015

Sales and operations planning (SOP) Demand forecasting

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

PREDICTIVE MODELS IN LIFE INSURANCE

Nonlinear Algebraic Equations Example

Credit Scoring Modelling for Retail Banking Sector.

Nonnested model comparison of GLM and GAM count regression models for life insurance data

Multivariate Logistic Regression

Regression Analysis. Regression Analysis MIT 18.S096. Dr. Kempthorne. Fall 2013

Regression Modeling Strategies

Volatility modeling in financial markets

7 Gaussian Elimination and LU Factorization

Factor Analysis. Factor Analysis

The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network

D-optimal plans in observational studies

2.2 Elimination of Trend and Seasonality

SAS Syntax and Output for Data Manipulation:

Linear Algebra Methods for Data Mining

SUPPLEMENT TO USAGE-BASED PRICING AND DEMAND FOR RESIDENTIAL BROADBAND : APPENDIX (Econometrica, Vol. 84, No. 2, March 2016, )

ARMA, GARCH and Related Option Pricing Method

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

5. Multiple regression

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

5. Linear Regression

Standard errors of marginal effects in the heteroskedastic probit model

Hypothesis testing - Steps

Final Mathematics 5010, Section 1, Fall 2004 Instructor: D.A. Levin

Optimal Hiring of Cloud Servers A. Stephen McGough, Isi Mitrani. EPEW 2014, Florence

HETEROGENEOUS AGENTS AND AGGREGATE UNCERTAINTY. Daniel Harenberg University of Mannheim. Econ 714,

CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

7 Time series analysis

Transcription:

GAM for large datasets and load forecasting Simon Wood, University of Bath, U.K. in collaboration with Yannig Goude, EDF

French Grid load GW 3 5 7 5 1 15 day GW 3 4 5 174 176 178 18 182 day Half hourly load data from 1st Sept 22. Pierrot and Goude (211) successfully predicted one day ahead using 48 separate models, one for each half hour period of the day.

Why 48 models? The 48 separate models each have a form something like MW t = f (MW t 48 ) + j g j (x jt ) + ɛ t where f and the g j are a smooth functions to be estimated, and the x j are other covariates. 48 Separate models has some disadvantages 1. Continuity of effects across the day is not enforced. 2. Information is wasted as the correlation between neighbouring time points is not utilised. 3. Fitting 48 models, each to 1/48 of the data promotes estimate instability and creates a huge model checking task each time the model is updated (every day?) But existing methods for fitting such smooth additive models could not cope with fitting one model to all the data.

Smooth additive models The basic model is y i = j f j (x ji ) + ɛ i where the f j are smooth functions to be estimated. Need to estimate the f j (including how smooth). Represent each f j using a spline basis expansion f j (x j ) = k b jk(x j )β jk where b jk are basis functions and β jk are coefficients (maybe 1-1 of each). So model becomes y i = j b jk (x ji )β jk + ɛ i = X i β + ɛ i k... a linear model. But it is much too flexible.

Penalizing overfit To avoid over fitting, we penalize function wiggliness (lack of smoothness) while fitting. In particular let J j (f j ) = β T Sβ measure wiggliness of f j. Estimate β to minimize (y i X i β) 2 + i j λ j β T Sβ where λ j are smoothing parameters. High λ j smooth f j. Low λ j wiggly f j.

Example basis-penalty: P-splines Eilers and Marx have popularized the use of B-spline bases with discrete penalties. If b k (x) is a B-spline and β k an unknown coefficient, then K f (x) = β k b k (x). Wiggliness can be penalized by e.g. k K 1 P = (β j 1 2β j + β j+1 ) 2 = β T Sβ. k=2 b k (x)..2.4.6..2.4.6.8 1. x f(x)..2.4.6.8..2.4.6.8 1. x

Example varying P-spline penalty y..2.4.6.8 1. basis functions y 2 2 4 6 8 1 (β i 1 2β i + β i+1 ) 2 = 27 i..2.4.6.8 1. x..2.4.6.8 1. x y 2 2 4 6 8 1 (β i 1 2β i + β i+1 ) 2 = 16.4 i y 2 2 4 6 8 1 (β i 1 2β i + β i+1 ) 2 =.4 i..2.4.6.8 1...2.4.6.8 1.

Choosing how much to smooth 2 2 4 6 8 y λ too high λ about right 2 2 4 6 8 2 2 4 6 8 y y λ too low.2.4.6.8 1. x.2.4.6.8 1. x.2.4.6.8 1. x Choose λ j to optimize model s ability to predict data it wasn t fitted too. Generalized cross validation chooses λ j to minimize V(λ) = y Xβ 2 [n tr(f)] 2 where F = (X T X + j λ js j ) 1 X T X. tr(f) = effective degrees of freedom.

The main problem We estimate β by the minimizer of y Xβ 2 + j λ j β T Sβ and the λ j to minimise V(λ) = y Xβ 2 [n tr(f)] 2 If X is too large, then we can easily run out of computer memory when trying to fit the model. A simple approach will let us fit the model without forming X whole.

A smaller equivalent fitting problem Let X be n p Consider the QR decomposition X = QR, where R is p p and Q has orthogonal columns. Let f = Q T y and γ = y 2 f 2. y Xβ 2 = f Rβ 2 + γ and V(λ) = y Xβ 2 [n tr(f)] 2 = f Rβ 2 + γ [n tr(f)] 2 where F = (R T R + j λ js j ) 1 R T R. So once we have R and f we can estimate β and λ without X...

QR updating for R and f [ ] [ ] X y Let X =, y = X 1 y 1 [ ] R Form X = Q R & = Q X 1 R. 1 [ ] Q Then X = QR where Q = Q I 1. [ ] Q T y Also f = Q T y = Q T 1 y 1 Applying these results recursively, R and f can be accumulated from sub-blocks of X without forming X whole.

X T X updating and Choleski R is the Choleski factor of X T X. Partitioning X row-wise into sub-matrices X 1, X 2,..., we have X T X = j X T j X j which can be used to accumulate X T X without needing to form X whole. At same time accumulate X T y = j XT j y. Choleski decomposition gives R T R = X T X, and f = R 1 X T y. This is twice as fast as QR approach, but less numerically stable.

Generalizations 1. Generalized additive models (GAMs) allow distributions other than normal (and non-identity link functions). 2. GAMs can be estimated by iteratively estimating working linear models, using R and f accumulation tricks on each. 3. REML, AIC etc can also be used for λ j estimation. REML can be made especially efficient in this context. 4. The accumulation methods are easy to parallelize. 5. Updating a model fit with new data is very easy. 6. AR1 residuals can also be handled easily in the non-generalized case.

Air pollution example Around 5 daily death rates, for Chicago, along with time, ozone, pm1, tmp (last 3 averaged over preceding 3 days). Peng and Welty (24). Appropriate GAM is: death i Poi, log{e(death i )} = f 1 (time i ) + f 2 (ozone i, tmp i ) + f 3 (pm1 i ). f 1 and f 3 penalized cubic regression splines, f 2 tensor product spline. Results suggest a very strong ozone - temperature interaction.

Air pollution Chicago results s(time,136.85).1..1.2 2 1 1 2 time 1se te(o3,tmp,38.63) +1se tmp 1 2 3 s(pm1,3.42).1..1.2 5 5 1 o3 1 12 14 16 18 pm1

Air pollution Chicago results To test the truth of ozone-temp interaction it would be good to fit to the equivalent data for around 1 other cities, simultaneously. Model is log{e(death i )} = γ j + α k + f k (o3 i, temp i ) + f 4 (t i ) if observation i is from city j and age group k (there are 3 age groups recorded). Model has 82 coefs and is estimated from 1.2M data. Fitting takes 12.5 minutes using 4 cores of a $6 PC. Same model to just Chicago, takes 11.5 minutes by previous methods.

Air pollution all cities results <65 s(time,277.36).1..1.2.3 tmp 1 2 3 4.1.1.2.3.5.2.1.1.2.3.2 2 1 1 2 time 1 1 2 o3 65 74 >74 tmp 1 2 3 4.15.5.1.5.5.1.5.1.15 tmp 1 2 3 4.4.6.8.2.2.4.6.2.2.6.2.6.4.4.8.8.4.4.12.14.6.1 1 1 2 o3 1 1 2 o3

Back to grid load GW 3 5 7 5 1 15 day GW 3 4 5 174 176 178 18 182 day Now the 48 separate grid load model can be replaced by L i = γ j + f k (I i, L i 48 ) + g 1 (t i ) + g 2 (I i, toy i ) + g 3 (T i, I i ) + g 4 (T.24 i, T.48 i ) + g 5 (cloud i ) + ST i h(i i ) + e i if observation i is from day of the week j, and day class k. e i = ρe i 1 + ɛ i and ɛ i N(, σ 2 ) (AR1).

Fitting grid load model Using QR update fitting takes under 3 minutes and we estimate ρ =.98. Update with new data then takes under 2 minutes. We fit up to 31 August 28, and then predicted the next years data, one day at a time, with model update. Predictive RMSE was 1156MW (124MW for fit). Predictive MAPE 1.62% (1.46% for fit). Setting ρ = gives overfit, and worse predictive performance.

Residuals 6 2 2 4 6 MAPE (%).5 1. 1.5 2. 2.5 3. Mo Tu We Th Fr Sa Su 1 2 3 4 1 2 3 4 Instant 9/1/22 12/19/22 4/8/23 8/12/23 12/4/23 3/22/24 7/26/24 11/17/24 3/7/25 7/6/25 1/23/25 2/18/26 6/2/26 1/8/26 2/3/27 6/4/27 9/21/27 1/17/28 5/6/28 8/31/28 1/9/28 17/9/28 4/1/28 21/1/28 15/11/28 2/12/28 19/12/28 13/1/29 3/1/29 16/2/29 4/3/29 21/3/29 7/4/29 28/4/29 27/5/29 17/6/29 4/7/29 25/7/29 11/8/29 31/8/29 6 2 2 4 6 6 4 2 2 4 6

Temperature effect Temperature Effect 7 65 L[t] 6 55 5 3 2 2 4 3 T[t] 1 1 I[t]

L[t] L[t] L[t] L[t] Lagged load Effects Lagged Load Effect, WW Lagged Load Effect, WH 7 8 6 7 5 6 4 8 7 6 5 L[t 1] 4 3 1 2 I[t] 3 4 5 8 7 6 5 L[t 1] 4 3 1 2 I[t] 3 4 Lagged Load Effect, HH Lagged Load Effect, HW 9 8 8 7 7 6 6 5 5 8 7 6 5 L[t 1] 4 3 1 2 I[t] 3 4 4 8 7 6 5 L[t 1] 4 3 1 2 I[t] 3 4