Demand Forecasting in Smart Grids

Transcription

1 Demand Forecasting in Smart Grids Piotr Mirowski, Sining Chen, Tin Kam Ho, and Chun-Nam Yu Data analytics in smart grids can be leveraged to channel the data downpour from individual meters into knowledge valuable to electric power utilities and end-consumers. Short-term load forecasting (STLF) can address issues vital to a utility but it has traditionally been done mostly at system (city or country) level. In this case study, we exploit rich, multi-year, and highfrequency annotated data collected via a metering infrastructure to perform STLF on aggregates of power meters in a mid-sized city. For smart meter aggregates complemented with geo-specific weather data, we benchmark several state-of-the-art forecasting algorithms, including kernel methods for nonlinear regression, seasonal and temperature-adjusted auto-regressive models, exponential smoothing and state-space models. We show how STLF accuracy improves at larger meter aggregation (at feeder, substation, and system-wide level). We provide an overview of our algorithms for load prediction and discuss system performance issues that impact real time STLF Alcatel-Lucent. Introduction Smart grid deployments carry the promise of allowing better control and balance of energy supply and demand through near real time, continuous visibility into detailed energy generation and consumption patterns. Methods to extract knowledge from near real time and accumulated observations are hence critical to the extraction of value from the infrastructure investment. On the demand side, widespread deployment of smart meters that provide frequent readings allows insight into continuous traces of usage patterns that are unique to each premise and each aggregate at different levels of the distribution hierarchy. This in turn enables better designs and triggers of demand response actions and pricing strategies, and provides input to the planning for growth and changes in the distribution network. Customers may also gain better awareness of their own consumption patterns. In this paper we report a study on demand prediction, where we analyzed near real time power consumption monitored by tens of thousands of smart meters in a medium-size U.S. city. We developed and adapted several short-term forecasting methods for predicting the load at several levels of aggregation. In this context, short-term load forecasting (STLF) refers to the prediction of power consumption levels in the next hour, next day, or up to a week ahead. Within this time scope, one can have reliable weather forecasts, which provide important input to the prediction, as historically the load in this city is highly influenced by weather because electricity is used for both heating and cooling. Bell Labs Technical Journal 18(4), (2014) 2014 Alcatel-Lucent. Published by Wiley Periodicals, Inc. Published online in Wiley Online Library (wileyonlinelibrary.com) DOI: /bltj.21650

2 Panel 1. Abbreviations, Acronyms, and Terms ACF Auto-correlation function ARIMA Auto-regressive integrated moving average ARIMAX Auto-regressive moving average with external inputs ARMA Auto-regressive moving average DASARIMA Dummy-adjusted seasonal autoregressive integrated moving average ENEL Ente Nazionale d Electricita GARCH Generalized auto-regressive conditional heteroscedastic GDP Gross domestic product HWT Holt-Winters model i.i.d. Independently and identically distributed KPSS Kwiatkowiski, Phillips, Schmidt, and Shin LOESS Locally-weighted scatterplot smoothing LSE Least square error LTLF Long-term load forecast MAPE Mean absolute percentage error ML Machine learning MTLF Medium-term load forecast NCDC National Climatic Data Center NOAA National Oceanic and Atmospheric Administration PACF Partial ACF RAM Random access memory SARIMA Seasonal auto-regressive integrated moving average SARIMAX Seasonal auto-regressive integrated moving average with external inputs SSM State-space model ssvr Sigma SVR STLF Short-term load forecast SVM Support vector machine SVR Support vector regression Wh Watt hour WKR Weighted kernel regression Scenario of the Study In our study, the meters are deployed at customer locations and their readings are sampled every 15 minutes. The meter s network description includes its geographical location (latitude, longitude); date of installation and planned removal; type of customer served; as well as which pole, which feeder section, and which substation the meter is connected to. Weather data is collected by the utility company at the substation level, and consists of hourly temperature, wind speed, and wind chill temperature. Additional weather data, made available by the National Climatic Data Center (NCDC) and the National Oceanic and Atmospheric Administration (NOAA), provide additional measurements, such as humidity or sky cover at the location of the city airport, and hourly weather forecasts up to seven days ahead. The load prediction algorithms that we have investigated and implemented are embedded in a module of a data analytic system being developed for the utility company. The module receives meter measurements and converts them to power usage values, and aggregates usage at different levels: individual meters, feeder sections, distribution substations, and at the system level. It then generates load forecasts at prediction horizons that range from 60 minutes (next-hour predictions) to 24 hours (next-day predictions), or even 168 hours (next-week predictions). As we will detail in a later section, the load forecasts operate independently for each meter and meter aggregate, communicating through a limited set of inputs/outputs with a database to read the latest weather forecasts and per-meter usage history and return corresponding load forecasts. This procedure can be parallelized, which enables some degree of asynchronous behavior within the prediction timeframe (the load forecasts are made with a granularity of one hour). Short-term load forecasts generated at the meter (customer premise) level will provide the utility company with customer-level smart grid capabilities and help the company communicate with the customer about energy saving and billing issues. STLF generated at higher levels of aggregation (from feeder section to city-wide) will help in planning and operation of the relevant components of the electric grid. 136 Bell Labs Technical Journal DOI: /bltj

3 State-of-the-Art in Load Forecasting Electric load forecasting is a mature field of investigation and the statistical methodologies have been implemented and deployed in industrial applications. Several meta-review papers provide a good overview of the demand prediction literature [13, 17, 25, 27] and identify three sub-fields, depending on the prediction horizon. The prominent sub-field of investigation, shortterm load forecasting (STLF), handles prediction horizons of one hour up to one week and typically relies on time series analysis and modeling. Daily, weekly, and sometimes yearly seasonality can be explicitly modeled. These methods consider variables such as date (e.g., day of week and hour of the day), temperature (including weather forecasts), humidity, temperature-humidity index, wind-chill index and most importantly, historical load. Residential versus commercial or industrial uses are rarely specified. Representative algorithms for STLF include time series models of linear dynamic systems involving load and weather regressors, typically relying on auto-regressive models such as the auto-regressive moving average (ARMA) [15] and the seasonal autoregressive integrated moving average (SARIMA) [35]. State-space models offer further refinement to linear dynamics by defining additional (so-called hidden or latent ) state variables representing underlying load dynamics and seasonality, either by explicit variables as in the exponential smoothing methods [37] or in spline representations of daily load [19]. An alternative approach to modeling load and weather dynamics is to consider nonlinear models and a machine learning approach. A popular class of algorithms for STLF, which we do not report here but which has been used by several electric companies for system-wide predictions, is neural networks [22]. We focused instead on so-called kernel methods, starting from simple weighted kernel regression [6] all the way up to support vector machines [7] and kernel ridge regression. The section on Short-Term Load Forecasting Methodology provides more details about the methods that we implemented and investigated for this comparative study. The remaining two fields of investigation, not covered in this paper, are medium-term load forecasting (MTLF), handling horizons of one week up to one year, and long-term load forecasting (LTLF), with predictions at horizons of multiple years. These methods typically proceed by the regression on input variables, which, in addition to historic load and climate forecasts, typically incorporate demographic and economic factors such as the gross domestic product (GDP), real estate statistics, or population growth projections, as well as estimated demands of electric equipment. Our key finding was that most of the research focused on large aggregated load data, typically at city level or even at country level, where most individual variations are averaged out by the effect of the law of large numbers. These methods were seldom tried on individual meters or at meter aggregate levels such as distribution feeders and substations, with a few exceptions such as recent work on STLF [3] in non-residential buildings or a clustering analysis of individual meters and aggregate load forecasting on feeder sections in a neighborhood of Seoul [32]. In this paper, we propose to continue bridging this gap by systematically evaluating when the stateof-the-art STLF algorithms break down, i.e., how the performance degrades when the number of considered meters goes down. Other Datasets With Individual Meters The specificity of the unique dataset that we investigated is that it contains energy consumption data from the system-wide (city-wide) level down to the level of individual meters. To our knowledge, few such complex datasets [32] have been investigated for short-term load forecasting, even though a few localized (e.g., building-specific) smart meter datasets have been studied [3]. The Italian energy provider Enel has deployed over 32 million smart meters. Remote monitoring is done by sending the readings from each customer s location [31] through a low-bandwidth network to data aggregators located at substations. Data is sampled and stored at 15 minute frequency. The readings are sent about every two weeks or every month. The DOI: /bltj Bell Labs Technical Journal 137

4 motivation for the utility is the ability to leverage customized hourly-based tariffs [11] to price services for its customers. Although the individual meter data collected by ENEL has been used in studies on grouping individual customers based on clustering load profiles [16], we are unaware of analyses of these data from the perspective of load aggregates. At the individual home level, there are several studies on peak load prediction [33] and on energy disaggregation of individual appliances in households [23]. However, these datasets are much smaller (typically < 100 meters) than our current dataset. The frequency of load measurement is also much higher (one measurement every few seconds or milliseconds) and is not typical of smart meters currently under deployment. The rest of the paper is divided as follows. We begin by explaining the structure of our unique, hierarchical dataset of load consumption coming from a mid-size U.S. city. We then provide an overview of key algorithms for short-term load forecasting that exploit both historical load and weather data. The section titled Short-Term Load Forecasting Results details the essentially state-of-the-art STLF results that we obtain at the system (city) level and how STLF performance depends on the size of the load aggregate. We conclude with a discussion of performance, parallelism, and runtime issues waged by performing STLF at all levels of load aggregation; it also introduces ensemble prediction that leverages multiple STLF algorithms for improved predictions. Smart Grid Data The specificity of our study on short-term load forecasting is in its unique dataset consisting of hundreds of thousands of individual meters interconnected in a hierarchy of feeders and substations. We provide details on how the meter data is aggregated and how we associate it with weather data. System Hierarchy of Meters, Feeders and Substations This study actually exploits two sets of data, collected in a mid-sized U.S. city (population of about 200,000 inhabitants) over the course of several years: System-wide data representing total city consumption (residential and industrial), collected over the course of 2007, 2008, and 2009 at hourly intervals. This dataset is typical to classic STLF studies. Individual meter readings coming from over a hundred thousand meters installed at customer locations. Out of this rich dataset, we use 32,000 mostly residential meters that satisfied a number of conditions detailed earlier. This data was collected between January 2011 and June The individual meters measure consumption (in Watt hours, Wh) at 15 minute intervals and are referenced in a meter pole feeder section substation district hierarchy. Meter measurements included in our analysis are those from the residential and small business customers (with contract demand under 5000 kw). Load predictions are made at these levels: 1. Single customer. Load measurements are derived from meter measurements by differentiation to obtain the value increment within a sampling time interval, divided over the duration of the sampling interval. Single-customer STLF performance and methods are not the object of this paper. 2. Feeder section. We define a feeder section as a subset of transformers connected to a feeder (such as serving a neighborhood). The system network topology considered in this paper consists of about 300 unique feeders employed in the time period The historical load measurements at a given time and at the level of a feeder section are based on aggregating the load derived from all the meters connected to that feeder section. 3. Substation. Each substation serves a small geographical area. There were about 100 unique substations in the distribution network over the time period we consider. Aggregation at the substation level works in the same way as aggregation at the feeder section level. 4. System-wide. This highest level of aggregation comprises all the residential and small business meters indexed in the distribution hierarchy. As explained in the next section, weather data are geo-located with the average (center) location of the meters connected to that feeder or substation section. 138 Bell Labs Technical Journal DOI: /bltj

5 From Consumer Meters to Consistent Load Aggregates The main problem in aggregating meter data into aggregates is the lack of consistency, across time, of the constituents of each aggregate. Fortunately, our dataset contained, in addition to meter readings, periodically updated metadata that described each meter, its connection to the feeder and substation, its geographical (latitude, longitude) coordinates as well as customer-specific data. To obtain a consistent dataset for method evaluation, we used this metadata to discard meters that were disconnected and reconnected, keeping only meters that satisfied several consistency requirements: same owner, same feeder connection, and same geographical location throughout the evaluation period. Although our per-meter dataset lists nearly a hundred thousand meters, only a subset of the 32,000 meters satisfies the consistency requirements and contains non-zero meter readings. We concentrate on the load aggregates derived from these 32,000 meters. Load aggregates at feeder level are basically obtained by summing up the 15-minute load from all the meters connected to that feeder. Similarly, the load aggregates at a substation are determined by adding up loads at all the feeders connected to that substation. Each aggregate s load is then down-sampled to hourly time intervals. While aggregating the loads at individual meters, we had to handle non-aligned time stamps, meter reading resets, missing values or repeated readings, sometimes resorting to linear interpolation of the load. All processing for this 18 month dataset, representing about 100 GB of data, was done using Perl* and shell scripts. As shown in Figure 1, we were able to reconstruct a smooth load profile at the system-wide level. Geo-Specific Weather Data The area covered by the individual meters in our mid-sized U.S. city dataset encompasses a gently hilly area of about 40 km by 60 km, traversed by a river and subject to micro-climatic variations. The weather data (temperature and wind speed) are measured hourly at 22 substations across that area. It is common to measure a difference of 15 degrees (F) in temperature between weather substations. Because the STLF methods detailed in the next section are temperature-dependent, we are prompted to interpolate the temperatures at the locations of all meters and all meter aggregates. This interpolation is done through the simple Kriging algorithm [14]. Kriging refers here to the temperature interpolation based on temperature regression against observed temperature values at a set of surrounding locations, each of them weighted according to spatial covariance. We employed the mgstat Matlab* toolbox for geo-statistics [18] and performed simple Kriging for each hour independently, using the latitude and longitude coordinates of about 400 feeders and substation meter load aggregates and the geographical coordinates and temperatures of 22 weather substations. A similar procedure was adopted for wind speed. The final result is illustrated on Figure 2, which shows the temperature interpolation at feeder and substation aggregate levels at two times of the year, and proves the large temperature variations. Short-Term Load Forecasting Methodology Time series modeling for short-term load forecasting (STLF) has been widely used over the last 30 years and a myriad of approaches have been developed. Kyriakides and Polycarpou [25] summarized these methods as follows: 1. Regression models that represent electricity load as a linear combination of variables related to weather factors, day type, and customer class. 2. Linear time series-based methods including the ARMA model, autoregressive integrated moving average (ARIMA) model, auto regressive moving average with external inputs (ARIMAX) model, generalized auto-regressive conditional heteroscedastic (GARCH) model and state-space models. 3. State-space models (SSMs) typically relying on a filtering- (e.g., Kalman) based technique and a characterization of dynamical systems. 4. Nonlinear time series modeling through machine learning methods such as nonlinear regression. Principles of Statistical Learning for Time Series In the sections that follow, we will discuss the temperature regression and load residual, linear DOI: /bltj Bell Labs Technical Journal 139

6 35 30 System load (MW per 15min) Jan11 Apr11 Jul11 Oct11 Jan12 Apr12 Jul12 (a) January 2011 through June System load (MW per 15min) Aug11 Sep11 (b) August 2011 through September 2011 Oct11 Figure 1. System load aggregated from about 32,000 individual meters over 18 months. time series approaches, state-space models, and nonlinear time series models. Before delving into more detailed descriptions of learning algorithms, we will begin by outlining their commonalities in a section on the Principles of Statistical Learning for Time Series. Supervised learning of the predictor. Supervised learning consists of fitting a predictive model to a training dataset (X; L), which consists of pairs (x i ; L i ) of data points or samples x i and of associated target values L i. In the case of load forecasting, samples x represent historical values of electric load, weather, or other types of data, collected over a short time interval (e.g., one day). The target labels L i correspond to the electric load at the prediction horizon. The objective is to optimize a function f such that for each data point x i, the prediction f(x i ) is as close as possible to the ground truth target L i. The discrepancy between all the predictions and the target labels is quantified here by the mean absolute percentage error (MAPE), whose formula is given in Short- Term Load Forecasting Results. 140 Bell Labs Technical Journal DOI: /bltj

7 Temperatures (21k meters and weather stations) on 01 Jan :45: Temperatures (21k meters and weather stations) on 30 Jun :00: Latitude Latitude Longitude Longitude (a) (b) Figure 2. Example of temperature variations and of spatial temperature interpolation using Kriging, at two different times of the year. Training, validation and test sets. Good statistical learning algorithms are capable of extrapolating knowledge and of generalizing it on unseen data points. For this reason, we separate the known data points into a training (in-sample) set, used to define model f, and a test (out-of-sample) set, used exclusively to quantify the predictive power of f. In the experiments previously reported in our section on short-term load forecasting results, we use one year of data for training and we test the model in the calendar month immediately following. When evaluating STLF on the data, we retrain the model 24 times and provide predictions for January 2008 through December Using the 18-month aggregate dataset from January 2011 through June 2012, we trained six different STLF models for predicting results for January through June Direct prediction versus iterated prediction in time series. In a time series prediction problem, as represented in Figure 3, the variable of interest (here, the load) might be present at the same time in the targets (output predictions) of the system and in the inputs, particularly when that variable is serially correlated or when it is produced by a dynamic system (e.g., the weather/climate model or a model for the human activities). Knowing the history of immediate previous time samples of that variable helps in that prediction. In our study, we consider hourly load and weather data, and are interested in making load forecasts at prediction horizons ranging from h = 1 hour (next hour) to h = 168 hours (next week). Predictions at all these different horizons can be achieved in two different ways, through direct prediction and iterated prediction. Let us note t the current time and assume that we have access to historical load up to time t, as well as to weather forecasts up to time t Direct prediction. This predictor takes all the data known up to time t, for instance load values in the past 24 hours (L t 23, L t 22,, L t 1, L t ) and temperature forecasts at any horizon h, namely T t+h, and directly predicts the load L t+h that will occur h hours ahead (see Figure 3b). Direct prediction has a huge computational cost, because different predictors need to be trained for each prediction horizon (168 in our case). Iterated prediction. This predictor is simply designed to make one-step-ahead predictions, at horizon h = 1. As the predictive model moves forward in time, the outputs of the predictor DOI: /bltj Bell Labs Technical Journal 141

8 Input Prediction at h + 1 Load Temperature temperature forecast h 23 Load Temperature Input h 1 h + 1 h Prediction at h + 2 temperature forecast h 22 Load Temperature Input h h + 2 h + 1 Prediction at h + 3 temperature forecast h 21 (a) Iterated prediction on load with a 24-hour history of load values and the temperature at the prediction horizon. Input h + 1 h + 3 Prediction at h + 3 Load Temperature temperature forecast h 23 h 21 h h + 3 (b) Direct prediction on load at horizon h = 3. Figure 3. Direct prediction versus iterated prediction in a time series. (here, load at time t + h) can in turn become its inputs (see Figure 3a), albeit introducing the prediction error directly into the model. This iterated prediction can be seen as the discretization of a dynamic system. Temperature Regression and Load Residual The simplest method for load forecasting relates the load to temperature. This is particularly relevant for residential and business-related consumption, where a significant portion of power usage might be due to electric heating in the winter and/or air conditioning in the summer. In our data set, electricity was used to both heat and cool many buildings, in addition to gas heating. The total load decreases with temperature first and then increases, the minimum occurring at or around 66 degrees Fahrenheit. We observed that this relationship varies slightly throughout the day. We investigated two approaches for load regression. 142 Bell Labs Technical Journal DOI: /bltj

9 log(load +1) lag 1 hr temperature (F) The first used local polynomial regression, locally-weighted scatterplot smoothing (LOESS) [8] to fit a surface of load on temperature and time of day (see Figure 4). Specifically, for the fit at point x, a polynomial surface of degree 1 or 2 is made using points in a neighborhood of x, weighted by their distance from x, to minimize the least square error (LSE). The size of the neighborhood is controlled by a parameter α chosen to be 0.2 in this situation for a balance between smoothness and goodness of fit. The MAPE for this fit is between six to seven percent for system-wide prediction with an average load of approximately 0.7M kwh ( system-wide load) when the surface is fitted to the previous full year s data. log(l) = s(t,h) + ε time of day Figure 4. Dependency among the temperature, the time of the day and the load modeled as a smooth surface. Load is expressed on the logarithmic scale and the temperature is taken one hour prior to the load value. where L is the hourly load, T is the temperature, H is the hour of day, and ε is the residual. The log transformation is used here to make the distribution more Gaussian-like and to stabilize the variance, such that the subsequent modeling assumptions hold. Note that the residuals ε are not independently and identically distributed (i.i.d.) and will continue to exhibit a daily cyclic pattern. In the SARIMA, SSM, and Holt-Winters model (HWT) methods detailed in the following sections, those methods are applied to the residuals ε, not to the load time series. A second method relies on fitting a cubic polynomial directly on the temperature values, using 24 sets of coefficients {a i (H) } 3 i=0, one for each hour H of the day. Temperature regression using cubic polynomials is a simple benchmark for STLF [21]. L = a 0 (H) + a 1 (H) T + a 2 (H) T 2 + a 3 (H) T 3 + ε Note that we may use the apparent temperature, or the wind-chill temperature, or an average of both, instead of the raw temperature. The apparent temperature (temperature taking into account the nonlinear heat index due to humidity) may improve the fit in some cases, particularly during the hot and humid Summer season [36]. Similarly, the windspeed dependent wind chill temperature may help for Winter load forecasts. We make our choices based on cross-validation performance. Hobby et al. [20] study the residential energy consumption measured at an aggregate of all residential meters by separating the weather- and illuminationdependent load consumption from the residual consumption. To fit the weather- and illumination-dependent component, they use 24 cubic spline surfaces, one per hour of the day, indexed by apparent temperature and illumination. They observe a strong cubic dependency of load on temperature and an almost negligible small linear term due to illumination. Linear Time Series Approaches Linear time series models exploit directly the historical values of the load, and enable us to make iterated load forecasts thanks to previously observed load values. Gross and Galiana [15] wrote the reference paper on short-term load forecasting using statistical linear time series models, in particular the auto-regressive moving average (ARMA) model. These models have been later extended to cope with seasonality and non-stationarity in so-called seasonal DOI: /bltj Bell Labs Technical Journal 143

10 auto-regressive integrated moving average (SARIMA) models. Further extensions have been made in the work of Soares and Medeiros [35], where they compared two-level seasonal auto-regressive model and dummy-adjusted seasonal auto-regressive integrated moving average (DASARIMA) on Brazilian electric load data. Seasonal Auto-Regressive Integrated Moving Average Models In seasonal auto-regressive integrated moving average (SARIMA) models, the seasonality component comes from the daily load cyclic pattern. In this paper we apply the SARIMA model to residuals from the LOESS fit (we refer to this method as residual SARIMA ). We also considered the SARIMAX model, i.e., SARIMA with exogenous variables, namely temperature. However, the temperature coefficient is difficult to interpret and the model offers poor prediction accuracy compared to residual SARIMA. In contrast, residual SARIMA explicitly models the relationship between the time series and the exogenous variable. It is especially appealing when changes in exogenous variable(s) are concurrent with changes in the original time series, which is the case with temperature and power usage. A SARIMA model has seven order parameters. We can write the model as: SARIMA (p, d, q) (P, D, Q) s Φ p (B s )ϕ P (B)(1 B) d (1 B S ) D X t = Θ Q (B S )θ q (B)ε t where B is the lag operator that satisfies: B i (X t ) = X t i and Φ p (B s ), Θ Q (B S ) and (1 B S ) D are corresponding autoregressive, moving average and differencing parts for seasonal components, while ϕ P (B), θ q (B) and (1 B) d are corresponding autoregressive, moving average and differencing parts for the non-seasonal component. S is the period length (S = 24 with hourly load reading and a daily cyclic pattern). The procedure of determining the order parameters follows Box-Jenkins procedures by examining the auto-correlation function (ACF) and partial ACF (PACF) of the differenced and original time series. Investigating the order parameters on the one-year training data, we concluded that d = 1, D = 1, p = 0, P = 0, while q = 1 and Q = 1, essentially ignoring the auto-regressive component. Stationarity of the differenced data were checked using the Kwiatkowiski, Phillips, Schmidt, and Shin (KPSS) test [2, 24]. The p-value was greater than 0.1, suggesting stationarity in differenced data. Note that for the residual SARIMA, shortening the training period for estimating the parameters of the model, from one year down to the last month immediately preceding the prediction (test) period offered a better fit. State-Space Models The state-space model (SSM) is an online adaptive method for forecasting. SSMs introduce hidden (unknown) variables representing the quantity to be estimated. The main state-space model used across scientific disciplines is the Kalman filter. In their review paper, Pigazo and Moreno [30] described how the Kalman filter can predict electric load values from the previous load measurements, and then update that prediction using other regressors such as temperature data. Harvey and Koopman [19] modeled load time series through cubic spline interpolation on intra-daily and intra-weekly patterns, where the spline coefficients were time-varying and updated using a Kalman filter. Dordonnat et al. [10] defined a custom state-space that took into account calendar days and used it to predict nationwide French electric load. Taylor and McSharry [37] reformulated the state-space model as a multi-level linear time series model, which can handle weekly and daily seasonality in electric load. State-space model on the spline fit of load residuals. The SSM in [19] does not require offline training and updates the model parameters in real time as each reading comes in. This method has been successfully applied to the online monitoring of timevarying network streams [4]. In that SSM, the computation for each update is inexpensive thanks to Kalman filtering, making it an ideal method for online forecasting. It uses B-splines to model the daily cyclic pattern, as the nonlinear 144 Bell Labs Technical Journal DOI: /bltj

11 trends in the load time series can be transformed into a linear model with respect to the spline basis. Moreover, a cyclic spline basis ensures the periodic constraint (namely, the daily cyclic pattern of the load). We place K equally spaced knots, or K 1 spline bases to cover a full day (here K = 8 for 24 hourly load readings on a given day). The state space model consists of two equations: the observation equation, which generates the load data from the hidden variable, and the state equation, which explains dynamics in the hidden (spline coefficient) data. The observation equation is: ε t = Bα t + u t u t N(0, σ u I) ε t is the one-day time series of the L load residuals from one day; B is a 24 by K matrix of B-spline bases, each column corresponding to one spline; α t is the vector of coefficients for the splines; u t is a vector of i.i.d. Gaussian white noise with standard deviation σ u. The vector α t characterizes the daily pattern on day t. To accommodate day-to-day variations in the daily pattern α t, we use a random walk for the spline coefficients, specified by the state equation: α t = α t 1 + v t v t N(0, σ v I) where the spline coefficients on day t are equal to those on t 1, plus i.i.d. white noise of variance σ v. The above SSM is fitted online with a Kalman filter, such that the updating is done for each incoming data point. This ensures that forecasts are done in an online fashion. Hyper-parameters are estimated empirically by fitting them to spline coefficients for individual days. We also applied this approach directly to the logtransformed load without the regression on temperature (results not reported here). The performance is slightly worse than using the residuals but still reasonable. This approach would work well if temperature forecasts were unavailable or unreliable. Holt-Winters double seasonal exponential smoothing. The HWT model [37] is a variation on the statespace model designed specifically for data that have two seasonalities: an intra-day (24 h) seasonality, and an intra-week (168 h) seasonality. The state equations involve three state variables, essentially corresponding to the smoothing, daily and weekly effect in the data. ŷ t (k) = l t + d t m1 +k 1 + w t m2 +k 2 + φ k e t e t = y t (l t 1 + d t m1 + w t m2 ) l t = l t 1 + αe t d t = d t m1 + δe t w t = w t m2 + ωe t In the above equations, y is the estimated value of the load, l is the exponentially smoothed firstorder auto-regressive component of the load, d is the intra-day seasonal component of the load (m 1 = 24 hours) and w is the intra-week seasonal component of the load (m 2 = 168 hours); finally e is the exponentially decaying error term. The values of the state variables are initialized in the following way: the model is on about one month of data. The four coefficients α, δ, ω and φ are fitted by least square optimization (i.e., by minimizing the error between the actual observed load and the predicted load, and we use simple heuristic search using genetic algorithms to find their optimal values. Nonlinear Time Series Models Machine learning (ML) techniques focus on learning a prediction function that takes as input the historical load and other data such as weather, and outputs the predicted load. Unlike the statistical methods reviewed in the previous section, the ML methods chosen in our study enable us to learn a nonlinear prediction function. Parametric machine learning techniques focus on tuning the parameters of the load prediction function. Khontanzad et al. [22] described a state-of-the-art implementation of neural networks for load forecasting, that has been used by several electrical companies. Fan and Chen [12] employed self-organizing maps to cluster the load and weather data into several regimes, before using them as inputs to a nonlinear regression function. DOI: /bltj Bell Labs Technical Journal 145

12 We focus in this paper on kernel-based methods, learning the relationship between data samples: in this case, each sample corresponds to a pair of historical load and weather data, taken over a short time interval, and the electric load at the next time point. We compared three standard, proven techniques: weighted kernel regression (WKR) [6], support vector regression (SVR) [7], and kernel ridge regression with learnable feature coefficients. In addition to kernel methods, we investigated simple neural network models with one hidden layer. Although the latter achieved good performance at one-hour prediction horizons, they would perform poorly on iterated forecasts and the error would rapidly increase after a few iterations of the neural network predictor (results not reported). Research on modeling dynamic systems using one hidden-layer neural networks showed indeed that these nonlinear models are very sensitive to noise and that they can generate predictions that diverge from the training set patterns. More complex neural network models that provide stable iterated predictions and are capable of learning long-term dependencies [1] are beyond the scope of this paper. In parallel, it has been proven experimentally that kernel methods such as SVR provide more stable iterated predictions on highly nonlinear time series than the basic embodiment of neural networks [26]. While they do not model long-term dependencies, they at least provide a solution that is bounded and stays within the patterns seen in the training set. This statement does not apply to more complex neural network architectures (that involve state space models and learning hidden representations of time series). Weighted kernel regression. Weighted kernel regression (WKR) [28] is the simplest among the non-parametric regression algorithms. It consists of computing the Euclidean distance metric between the input sample x and each data point sample y(t) at time t in the training set and then using it in a Gaussian kernel function k(x,y(t)) that can be seen as a measure of symmetric similarity between the two samples x and y(t). The Gaussian kernel takes a value equal to one when x and y(t) are identical and therefore when their distance is equal to zero. The kernel function takes decreasing values down to zero as the input sample x becomes dissimilar from the training point y(t) and therefore as their distance increases. k(x,y(t)) = exp( 1 2 k k=1 1 σ (x k y 2 k (t)) ) 2 The kernel function is used as the weight of data point y(t) in the decision function (equation 2). The decision function is a weighted interpolation over the entire training dataset. Lˆ = Σ tl t k(x,y(t)) Σ t k(x,y(t)) WKR assumes smoothness within the input data, controlled through a spread coefficient σ that depends on the dataset and is fitted by n-fold crossvalidation on the training data. We resorted to fivefold cross-validation on five non-overlapping sets. More specifically, for each choice of hyperparameters, we used 80 percent of the training data to fit the model and the remaining 20 percent to compute the prediction performance, and repeated that step five times. Support vector regression. Support vector machines (SVMs) [9, 34] are a popular and efficient statistical learning tool that can be qualified as mostly non-parametric. SVMs are also called maximum margin classifiers, because their decision boundary is, by construction, as far as possible from the training data points, so that they remain well separated according to their labels. Maximum margin training enables better generalization of the classifier to unseen examples. The work on support vector regression (SVR) by Chen et al. [7] was indeed the winning entry to a competition on the prediction of electric load and can be considered as a state-of-the-art method. SVR relies on the definition of a kernel function k(x,y(t)) and in using a decision function f(x) for a sample x that is defined in terms of the kernel function between x and the data points in the training set, but involving a minimal, sparse, set of support vectors S = {y(t)} that are each given a weight α t. Learning in SVM corresponds to finding a minimal set S of support vectors that minimizes the error on the training labels. Lˆ = Σ t L t α t k(x,y(t)) 146 Bell Labs Technical Journal DOI: /bltj

13 We cross-validated the SVM s regularization coefficient C as well as the Gaussian spread coefficient using five-fold cross-validation. Kernel ridge regression. Kernel ridge regression is a generalized version of support vector regression. One can see it as a trivial extension of SVR, where the Gaussian spread coefficient is tuned for each input regressor (feature) separately using a gradientdescent optimization procedure and cross-validation [5]. This method, which we call sigma-svr, differs from SVR by this simple equation: k(x,y(t)) = exp( 1 k 2 k=1 1 σ (x 2 k y k (t)) ) 2 Short-Term Load Forecasting Results In our investigations, we used the standard demand prediction metric, mean absolute percentage error (MAPE), which, for a set of N load values L t (e.g, in Watt hours Wh) and associated load forecasts, is defined as: Lˆt MAPE = 1 N N t = 1 Lˆ t L t L t In the previously published STLF studies on citywide and country-wide load forecasting, the MAPE typically was expected to range from a one to a three percent error at next-hour horizon forecasts to about four percent error at next-day horizons. System-Wide Predictions In a first series of experiments, we compared the performance of three iterated predictors relying on nonlinear time series models based on kernel methods: weighted kernel regression (WKR), support vector regression (SVR), and sigma-svr (ssvr) on system-wide load from 2007 to We would train the predictors on one year of load and weather forecasts, and make predictions for the following month, repeating this procedure 24 times for January 2008 through December 2009, averaging the MAPE performance, for each prediction horizon, over all 24 months. Our approach essentially simulated an STLF system retrained every month to fit mid- to longterm evolutions of the city-wide load consumption and of the climate. k Unsurprisingly, as reported on Figure 5, the more complex kernel method that enabled us both to weigh each input feature (e.g., load at a specific time, time of day, temperature or humidity forecast) individually and to select the support vectors, namely ssvr, achieved the best results (MAPE = 1.2 percent) at the one-hour horizon and MAPE = 4.7 percent after h = 24 hours. The Steadman apparent temperature would slightly outperform raw temperature (decreasing the MAPE). We then compared the performance of iterated ssvr to the direct prediction using ssvr, as well as to the remaining, linear, models, namely Holt-Winters double-exponential smoothing (HWT), state-space models with B-spline fit on load residue (SSM) and seasonal auto-regressive integrated moving average (SARIMA), all operating on the load residue after fitting the load on temperature and hour of the day (see Temperature Regression and Load Residual ). As can be seen in Figure 6 and Figure 7, which provide details on the system-wide aggregated load from 2012, the overall best algorithms were HWT and ssvr. HWT achieved MAPE = 4 percent performance at h = 24 on the dataset, slightly outperforming ssvr. The performance on the aggregated (2012) dataset was worse, because the set of meters considered (32,000) was only a subset of the total city load. Figure 8 and Figure 9 show how these predictions actually look, at h = 1 and at h = 24 respectively. Performance on Meter Aggregates We observed that the load forecasting performance seemed to worsen for lower level aggregates and tried to verify the hypothesis that, independently of the method, aggregates with large forecast errors are those with very few meters. As can be seen on Figure 10, we trained about 400 STLF predictors on different meter aggregates (feeders, substations, and system-wide) and plotted the performance (MAPE at h = 1) versus the size of the meter aggregate (which we can measure, for instance, as the number of meters interconnected to that aggregate, or as the peak hourly load measured at that meter aggregate). The MAPE would decrease as a function of meter DOI: /bltj Bell Labs Technical Journal 147

14 6 System wide load predictions for Jan 2008 Dec 2009, per month MAPE averages MAPE (%) sigmasvr 24h load + temperature sigmasvr 24h load + Steadman temperature sigmasvr 24h load + temperature + humidity WKR 8h load + temperature WKR 8h load + Steadman temperature SVR 8h load + temperature SVR 8h load + Steadman temperature Prediction horizon (h) MAPE Mean absolute percentage error SVR Support vector regression WKR Weighted kernel regression Figure 5. System-wide load forecasts using kernel methods for nonlinear time series modeling. These curves are the average of monthly MAPE performance over two years ( ). aggregate size (the more meters in an aggregate, the better the MAPE). We hypothesize that aggregates connected to more meters tend to behave in a more predictable way: the effect of weather (temperature) is prominent and there is an averaging effect due to the large sample (hundreds or thousands) of meters. Some meter aggregates (see Figure 11) can nevertheless be relative well predictable, despite their small size (here 12 meters). At the substation or system level, accurate forecasts can be useful input to strategic cost-saving decisions. At the level of individual meters, the utility is not interested in predicting precisely how much electricity will be used every hour, but rather in detecting large spikes of abnormal activity. Such abnormal usage spikes could be indicative of a system failure in the home (e.g., a malfunctioning heat pump), and could be useful information to the customer. Accurate forecasts can serve as baselines for detecting such anomalies. Discussion In this section, we discuss the practical considerations for the implementation and deployment of a load forecasting system, including modularity and parallelization, running time considerations, and robustness of the forecasts. Independent STLF for Each Meter Aggregate As explained previously, the meters, feeders, and substations considered in this study of a mid-sized 148 Bell Labs Technical Journal DOI: /bltj

15 7 System wide load predictions for Jan 2008 Dec 2009, per month MAPE averages MAPE (%) HWT on residual load from apparent temperature fit Iterated sigma SVR using app. temp. and wind chill temp. and load Direct sigma SVR using app. temp. and wind chill temp. and load SSM on residual load from temperature fit at h 1 SARIMA on residual load from temperature fit at h Prediction horizon (h) HWT Holt-Winters model MAPE Mean absolute percentage error SARIMA Seasonal auto-regressive integrated moving average SSM State-space model SVR Support vector regression Figure 6. System-wide load forecasts using various families of prediction algorithms, using the total load consumption of a mid-sized U.S. city. The curves represent average monthly MAPE performance over two years from 2008 to U.S. city are interconnected in a hierarchical distribution network. Such a rich hierarchy invites a study of the correlations or even interdependencies among all metered electrical components. The obvious advantage is in exploiting redundancies among all the meters (as households in the same urban area and under identical climatic conditions might present similar load consumption profiles). From a systems perspective, it may be desirable to make the load prediction component as modular as possible and to forecast load independently for each meter or load aggregate. In this study, all the predictions at the same level of aggregation are considered independent from the point of view of load forecasting, despite the correlations between each feeder connected to a given substation and the substation itself. There are several justifications for this approach. First of all, the meters in our system often are updated asynchronously or even suffer downtimes, not necessarily related to power outages. It could therefore be very detrimental, for the operation of the entire system, to make it wait for synchronous meter updates. Here, we allow for asynchronous data updates and load forecasts within the prediction timeframe, which happens at a granularity of one hour. DOI: /bltj Bell Labs Technical Journal 149

16 13 System wide load predictions for Jan Jun 2012, per month MAPE averages MAPE (%) HWT on residual load from apparent temperature fit at h 1 Iterated sigma SVR using app. temp. and wind chill temp. and load Direct sigma SVR using app. temp. and wind chill temp. and load SSM on residual load from temperature fit at h 1 SARIMA on residual load from temperature fit at h Prediction horizon (h) HWT Holt-Winters model MAPE Mean absolute percentage error SARIMA Seasonal auto-regressive integrated moving average SSM State-space model SVR Support vector regression Figure 7. Comparison of different load forecasting algorithms for 32,000 meter load aggregates. The curves represent average monthly MAPE performance over the six months from January to June Secondly, enforcing independence at each level of aggregation enables us to trivially parallelize the operation of the STLF modules for all the aggregates. Each module s only points of input/output are database accesses to read the latest meter historical load data as well as the associated geo-specific weather data and to return forecasts at different horizons. Running Time and Performance The parallelism that is enabled by the independent STLF operations facilitates the implementation of our system in a multi-threaded environment. Essentially, the process for generating hourly load forecasts at an aggregate level can be run as soon as all the hourly weather and meter data for the aggregate components have been collected. The system does not need to wait for the completion of all the prediction processes, each of which takes care of updating the database with forecasts independently. For model development purposes, we have been using a 16-core 2.3 GHz Intel Xeon* Linux* server with 24 GB random access memory (RAM), running Ubuntu*. The deployment system is a 32-core, 128 GB RAM, Linux system running Red Hat. The sigma- SVR and HWT algorithms are implemented in Matlab 150 Bell Labs Technical Journal DOI: /bltj

17 KWh Predictions: 1 hour ahead Observed Regression On Temp HWT ssvr SARIMA SSM :00: :00: :00: :00:00 (a) Prediction Errors: 1 hour ahead KWh 0e+00 2e+05 2e :00: :00: :00: :00:00 HWT Holt-Winters model SARIMA Seasonal auto-regressive integrated moving average SSM State-space model ssvr Sigma SVR SVR Support vector regression (b) Figure 8. Predictions and prediction errors by four algorithms and a simple weather fit model (in gray) over one week in These plots show the predictions at horizon h = 1 hour. (or its open source clone, Octave) and the SARIMA and SSM methods run in R. Our system avoids major computational bottlenecks at runtime. The SARIMA, SSM, and HWT methods can make essentially instantaneous forecasts on the 400 or so meter aggregates. The kernel methods-based predictions by the sigma-svr algorithm require, for each meter aggregate and for each prediction horizon (up to 168), a few matrix multiplications, with matrix dimensions on the order of 10,000. The latter can bring the computational time to several minutes, once per hour. The largest computational requirements are due to training the prediction algorithms, which, as we explained, happens once a month. While the SARIMA and SSM methods are, again, negligible in terms of training time, it typically takes a few hours to cross-validate the state parameters of the HWT model and about one day to learn the feature and Lagrange coefficients of the sigma-svr predictor. This is currently handled by scheduling learning for all the models over several days. Ensemble Prediction Given that we have four different prediction algorithms (HWT, ssvr, SARIMA, SSM), we can study methods for combining their predictions for potentially better accuracy and robustness to noise DOI: /bltj Bell Labs Technical Journal 151

18 KWh Predictions: 24 hour ahead Observed Regression On Temp HWT ssvr SARIMA SSM :00: :00: :00: :00:00 (a) Prediction Errors: 24 hour ahead KWh 0e+00 2e+05 2e :00: :00: :00: :00:00 HWT Holt-Winters model SARIMA Seasonal auto-regressive integrated moving average SSM State-space model ssvr Sigma SVR SVR Support vector regression (b) Figure 9. Predictions and prediction errors by four algorithms and a simple weather fit model (in gray) over one week in These plots show the predictions at horizon h = 24 hours. and random errors. We conjecture that possibility after observing that the predictions of the four algorithms have largely uncorrelated errors, as visible on Figure 8 for an example of system-wide load forecasts over one week at the one-hour horizon and on Figure 9, on the same data and time period, at a 24-hour horizon. Systematically generated ensembles are used extensively in numerical weather forecasting [29]. Our approach, on the other hand, needs to work with a small ensemble, each of which has independent ability to achieve a certain level of accuracy. In this case, simple combination strategies are desirable. We considered five simple schemes for combining the predictions: 1. Mean of four predictions, 2. Median of four predictions, 3. Switching among four predictions, using the one with the smallest absolute error at the time when the prediction is made, 4. Mean of HWT and ssvr, and 5. Switching between HWT and ssvr, using the one with the smallest absolute error at the time when the prediction is made. We summarize in Figure 12 the performance of these algorithms and the combined predictions for the system-wide aggregates from 2008 to The final performance of the mean of HWT and ssvr predictions on the system-wide data reaches a performance around MAPE = 3 percent at a 24-hour 152 Bell Labs Technical Journal DOI: /bltj

19 MAPE (%) in log 10 scale System-wide Selected feeder HWT Holt-Winters model MAPE Mean absolute percentage error SVR Support vector regression 1 hour ahead predictions at aggregate level log 10 of number of meters connected to the aggregate HWT Sigma SVR Figure 10. Relationship between load forecasting accuracy and the size of the load aggregate (i.e., the number of meters connected to the electrical structure). The monthly MAPE performance has been averaged over six months, from January to June 2012). prediction horizon, down from about four percent achieved by HWT alone. We can see that by most of the performance criteria considered, either the mean or the median of the four predictors gives the best performance, and it is better than the best individual method except for the horizon of one hour ahead (which is best done by ssvr). Further investigation will examine to what extent this observation generalizes to smaller meter aggregates. Conclusion We methodically evaluated state-of-the-art STLF methods on a unique dataset consisting of load aggregates from individual meters, and showed a dependency of the load forecasting performance on the size of the aggregate. In this study, we considered load forecasting at each meter aggregate as an independent task, and did not fully exploit the pyramidal structure of the meter-feeder-substation network. Future investigations could explore such hierarchical time series prediction. Acknowledgements The authors wish to acknowledge the help and contribution of former and current members of Alcatel-Lucent Bell Labs: Gary Atkinson, Kenneth Budka, Jayant Deshpande, Frank Feather, Zhi He, Marina Thottan and Kim Young Jin, as well as the DOI: /bltj Bell Labs Technical Journal 153

20 Selected feeder: 1 hour ahead STLF Wh Observed Regression On Temp HWT ssvr :00: :00: :00: :00:00 (a) Selected feeder: 1 hour ahead STLF Wh Observed Regression On Temp HWT ssvr :00: :00: :00: :00:00 (b) Selected feeder: 1 hour ahead STLF Wh Observed Regression On Temp HWT ssvr :00: :00: :00: :00:00 HWT Holt-Winters model ssvr Sigma SVR STLF Short-term load forecast SVR Support vector regression (c) Figure 11. Predictions by two algorithms and a simple weather fit model (in gray) over three weeks in 2012 for a selected feeder connected to 12 individual meters. These plots show the predictions at horizon h = 1 hour. 154 Bell Labs Technical Journal DOI: /bltj

21 Legends Mean Absolute Percentage Error Mean Absolute Error % 4 kwh Regression On Temp HWT ssvr SARIMA SSM MEAN MEDIAN SWITCH MEAN_HWT_sSVR SWITCH_HWT_sSVR Horizon (hours) Horizon (hours) (a) (b) Root Mean Squared Error Worst Overshoot Worst Undershoot kwh kwh kwh Horizon (hours) Horizon (hours) Horizon (hours) (c) (d) (e) HWT Holt-Winters model SARIMA Seasonal auto-regressive integrated moving average ssvr Sigma SVR SVR Support vector regression Figure 12. Performance of the four predictors and their combinations, benchmarked against a predictor using only regression on temperature, for prediction horizons from one to 24 hours ahead.

22 utility company for providing the meter and weather dataset used in this study. *Trademarks Linux is a trademark of Linus Torvalds. Matlab is a registered trademark of The Mathworks, Inc. Perl is a trademark of the Perl Foundation. Ubuntu is a registered trademark of Canonical Limited. Xeon is a registered trademark of Intel Corporation. References [1] Y. Bengio, P. Simard, and P. Frasconi, Learning Long-Term Dependencies with Gradient Descent Is Difficult, IEEE Trans. Neural Networks, 5:2 (1994), [2] A. Bhargava, On the Theory of Testing for Unit Roots in Observed Time Series, Rev. Econom. Stud., 53:3 (1986), [3] C. E. Borges, Y. K. Penya, and I. Fernández, Optimal Combined Short-Term Building Load Forecasting, Proc. IEEE PES Innovative Smart Grid Technol. Asia Conf. (ISGT 11) (Perth, Aus., 2011). [4] J. Cao, A. Chen, T. Bu, and A. Buvaneswari, Monitoring Time-Varying Network Streams Using State-Space Models, Proc. 28th IEEE Internat. Conf. on Comput. Commun. (INFOCOM 09) (Rio de Janeiro, Bra., 2009), pp [5] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, Choosing Multiple Parameters for Support Vector Machines, Mach. Learn., 46:1-3 (2002), [6] W. Charytoniuk, M. S. Chen, and P. Van Olinda, Nonparametric Regression Based Short-Term Load Forecasting, IEEE Trans. Power Syst., 13:3 (1998), [7] B.-J. Chen, M.-W. Chang, and C.-J. Lin, Load Forecasting Using Support Vector Machines: A Study on EUNITE Competition 2001, IEEE Trans. Power Syst., 19:4 (2004), [8] W. S. Cleveland and S. J. Devlin, Locally- Weighted Regression: An Approach to Regression Analysis by Local Fitting, J. Amer. Statist. Assoc., 83:403 (1988), [9] C. Cortes and V. Vapnik, Support-Vector Networks, Mach. Learn., 20:3 (1995), [10] V. Dordonnat, S. J. Koopman, M. Ooms, A. Dessertaine, and J. Collet, An Hourly Periodic State Space Model for Modelling French National Electricity Load, Internat. J. Forecasting, 24:4 (2008), [11] Enel, Smart Metering System, < technology/zero_emission_life/smart_ networks/smart_meters.aspx>, accessed Mar. 5, [12] S. Fan and L. Chen, Short-Term Load Forecasting Based on an Adaptive Hybrid Method, IEEE Trans. Power Syst., 21:1 (2006), [13] E. A. Feinberg and D. Genethliou, Load Forecasting, Applied Mathematics for Restructured Electric Power Systems: Optimization, Control, and Computational Intelligence (J. H. Chow, F. F. Wu, and J. A. Momoh, eds.), Springer, New York, 2005, pp [14] P. Goovaerts, Geostatistics for Natural Resources Evaluation, Oxford University Press, New York, [15] G. Gross and F. D. Galiana, Short-Term Load Forecasting, Proc. IEEE, 75:12 (1987), [16] F. Gullo, G. Ponti, A. Tagarelli, S. Iiritano, M. Ruffolo, and D. Labate, Low-Voltage Electricity Customer Profiling Based on Load Data Clustering, Proc. 13th Internat. Database Eng. and Applications Symp. (IDEAS 09) (Cetraro, Calabria, Ita., 2009), pp [17] H. Hahn, S. Meyer-Nieberg, and S. Pickl, Electric Load Forecasting Methods: Tools for Decision Making, European J. Oper. Res., 199:3 (2009), [18] T. M. Hansen, mgstat: A Geostatistical Matlab Toolbox, < [19] A. Harvey and S. J. Koopman, Forecasting Hourly Electricity Demand Using Time- Varying Splines, J. Amer. Statist. Assoc., 88:424 (1993), [20] J. D. Hobby, A. Shoshitaishvili, and G. H. Tucci, Analysis and Methodology to Segregate Residential Electricity Consumption in Different Taxonomies, IEEE Trans. Smart Grid, 3:1 (2012), [21] T. Hong, P. Wang, and H. L. Willis, A Naïve Multiple Linear Regression Benchmark for Short Term Load Forecasting, Proc. IEEE Power and Energy Soc. Gen. Meeting (Detroit, MI, 2011). [22] A. Khotanzad, R. Afkhami-Rohani, and D. Maratukulam, ANNSTLF Artificial Neural Network Short-Term Load Forecaster 156 Bell Labs Technical Journal DOI: /bltj

23 Generation Three, IEEE Trans. Power Syst., 13:4 (1998), [23] J. Z. Kolter and M. J. Johnson, REDD: A Public Data Set for Energy Disaggregation Research, Proc. KDD Workshop on Data Mining Applications in Sustainability (SustKDD 11) (San Diego, CA, 2011). [24] D. Kwiatkowski, P. C. B. Phillips, P. Schmidt, and Y. Shin, Testing the Null Hypothesis of Stationarity Against the Alternative of a Unit Root, J. Econometrics, 54:1-3 (1992), [25] E. Kyriakides and M. Polycarpou, Short Term Electric Load Forecasting: A Tutorial, Trends in Neural Computation (K. Chen and L. Wang, eds.), Springer, Berlin, New York, 2007, pp [26] D. Mattera and S. Haykin, Support Vector Machines for Dynamic Reconstruction of a Chaotic System, Advances in Kernel Methods: Support Vector Learning (B. Schölkopf, C. J. C. Burges, and A. J. Smola, eds.), MIT Press, Cambridge, MA, 1999, pp [27] A. Muñoz, E. F. Sánchez-Úbeda, A. Cruz, and J. Marín, Short-Term Forecasting in Power Systems: A Guided Tour, Handbook of Power Systems II: Energy Systems (S. Rebennack, P. M. Pardalos, M. V. F. Pereira, and N. A. Iliadis, eds.), Springer, Heidelberg, New York, 2010, pp [28] E. A. Nadaraya, On Estimating Regression, Theory Probab. Appl., 9:1 (1964), [29] T. N. Palmer, G. J. Shutts, R. Hagedorn, F. J. Doblas-Reyes, T. Jung, and M. Leutbecher, Representing Model Uncertainty in Weather and Climate Prediction, Annual Review of Earth and Planetary Sciences, Volume 33 (R. Jeanloz, A. L. Albee, and K. C. Burke, eds.), Annual Reviews, Palo Alto, CA, May 2005, pp [30] A. Pigazo and V. M. Moreno, Estimation of Electrical Power Quantities by Means of Kalman Filtering, Kalman Filter: Recent Advances and Applications (A. Pigazo and V. M. Moreno, eds.), In-Tech, Rijeka, Cro., Apr. 2009, pp [31] S. Rogai, Keynote I. Telegestore Project Progresses and Results, IEEE Internat. Symp. on Power Line Commun. and Its Applications (ISPLC 07) (Pisa, Ita., 2007). [32] J.-H. Shin, B.-J. Yi, Y.-I. Kim, H.-G. Lee, and K. H. Ryu, Spatiotemporal Load-Analysis Model for Electric Power Distribution Facilities Using Consumer Meter-Reading Data, IEEE Trans. Power Delivery, 26:2 (2011), [33] R. P. Singh, P. X. Gao, and D. J. Lizotte, On Hourly Home Peak Load Prediction, Proc. 3rd IEEE Internat. Conf. on Smart Grid Commun. (SmartGridComm 12) (Tainan, Twn., 2012), pp [34] A. J. Smola and B. Schölkopf, A Tutorial on Support Vector Regression, Stat. Comput., 14:3 (2004), [35] L. J. Soares and M. C. Medeiros, Modeling and Forecasting Short-Term Electricity Load: A Comparison of Methods with an Application to Brazilian Data, Internat. J. Forecasting, 24:4 (2008), [36] R. G. Steadman, A Universal Scale of Apparent Temperature, J. Climate Applied Meteorology, 23:12 (1984), [37] J. W. Taylor and P. E. McSharry, Short-Term Load Forecasting Methods: An Evaluation Based on European Data, IEEE Trans. Power Syst., 22:4 (2007), (Manuscript approved October 2013) PIOTR MIROWSKI is a member of technical staff at Bell Labs in Murray Hill, New Jersey. He obtained his Ph.D. in computer science at the Courant Institute of Mathematical Sciences at New York University, New York City, with a thesis in machine learning under the supervision of Prof. Yann LeCun. He also has a master s degree in computer science from École Nationale Supérieure ENSEEIHT in Toulouse, France. Prior to joining Bell Labs he worked as a research engineer in geology at Schlumberger Research. During his Ph.D. studies, he interned at the NYU Medical Center (investigating epileptic seizure prediction from EEG), at Google, at the Quantitative Analytics department of Standard & Poor s and at AT&T Labs Research. His current research focuses on machine learning methods for text analysis and query ranking, on computer vision and simultaneous localization and mapping for robotics, on indoor localization, on time series modeling and load forecasting for smart grids, and on deep learning. SINING CHEN is a member of technical staff in the Statistics and Learning Research Department at Bell Labs in Murray Hill, New Jersey. She received a B.S. in applied mathematics from Tsinghua University, Beijing, China and a Ph.D. in statistics from DOI: /bltj Bell Labs Technical Journal 157

24 Duke University, Durham, North Carolina. After completing her doctorate, Dr. Chen worked at Johns Hopkins University, first as a postdoctoral fellow in the School of Medicine and then as an assistant professor in the School of Public Health. Prior to joining Bell Labs, she was an associate professor at the Department of Biostatistics, University of Medicine and Dentistry of New Jersey. Her current research interests include Bayesian methods and forecasting. TIN KAM HO leads the Statistics of Communication Systems Research Activity in Bell Labs at Murray Hill. She pioneered research in multiple classifier systems, random decision forests, and data complexity analysis, and pursued applications of automatic learning in many areas of science and engineering. She also led major efforts on modeling and monitoring large-scale optical transmission systems. Recently she worked on wireless geo-location, video surveillance, smart grid data mining, and customer experience modeling. Her contributions were recognized by a Bell Labs President s Gold Award and two Bell Labs Teamwork Awards, a Young Scientist Award in 1999, and the 2008 Pierre Devijver Award for Statistical Pattern Recognition. She is an elected Fellow of IAPR (International Association for Pattern Recognition) and IEEE, and served as editor-in-chief of the journal Pattern Recognition Letters in She received a Ph.D. in computer science from State University of New York (SUNY), Buffalo. CHUN-NAM YU is a member of technical staff in the Statistics and Learning Department at Bell Labs in Murray Hill, New Jersey. He received a B.A. degree in mathematics and computer science from Oxford University, United Kingdom, and an M.S. degree and Ph.D. in computer science from Cornell University, Ithaca, New York. Prior to joining Bell Labs, he was a postdoctoral fellow at the Alberta Innovates Centre of Machine Learning (AICML) at the University of Alberta in Edmonton, Alberta, Canada. His research interests include structured output learning, graphical models, kernel methods, optimization, and biomedical applications. 158 Bell Labs Technical Journal DOI: /bltj