Clustering Time Series Based on Forecast Distributions Using Kullback-Leibler Divergence



Similar documents
Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA

IBM SPSS Forecasting 22

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data

Promotional Forecast Demonstration

ADVANCED FORECASTING MODELS USING SAS SOFTWARE

USE OF ARIMA TIME SERIES AND REGRESSORS TO FORECAST THE SALE OF ELECTRICITY

Practical Time Series Analysis Using SAS

Using JMP Version 4 for Time Series Analysis Bill Gjertsen, SAS, Cary, NC

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Statistics Graduate Courses

Chapter 25 Specifying Forecasting Models

STATISTICA Formula Guide: Logistic Regression. Table of Contents

4. Simple regression. QBUS6840 Predictive Analytics.

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

Cross Validation. Dr. Thomas Jensen Expedia.com

Time Series Analysis

Cluster Analysis: Advanced Concepts

Recall this chart that showed how most of our course would be organized:

OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

How To Run Statistical Tests in Excel

Time series Forecasting using Holt-Winters Exponential Smoothing

Data Mining: Algorithms and Applications Matrix Math Review

Time Series Analysis

Financial TIme Series Analysis: Part II

Multivariate Analysis of Ecological Data

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Simple Predictive Analytics Curtis Seare

1 Maximum likelihood estimation

Time Series Analysis and Forecasting Methods for Temporal Mining of Interlinked Documents

Chapter 27 Using Predictor Variables. Chapter Table of Contents

Data Exploration Data Visualization

Data Mining mit der JMSL Numerical Library for Java Applications

Using simulation to calculate the NPV of a project

TOURISM DEMAND FORECASTING USING A NOVEL HIGH-PRECISION FUZZY TIME SERIES MODEL. Ruey-Chyn Tsaur and Ting-Chun Kuo

I. Introduction. II. Background. KEY WORDS: Time series forecasting, Structural Models, CPS

2) The three categories of forecasting models are time series, quantitative, and qualitative. 2)

How to Get More Value from Your Survey Data

SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID

Least Squares Estimation

Demand Management Where Practice Meets Theory

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Data, Measurements, Features

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2015, Mr. Ruey S. Tsay. Solutions to Midterm

The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network

Energy Demand Forecasting Industry Practices and Challenges

SAS Software to Fit the Generalized Linear Model

How To Predict Web Site Visits

OUTLIER ANALYSIS. Data Mining 1

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Statistics in Retail Finance. Chapter 6: Behavioural models

Introduction to time series analysis

Module 6: Introduction to Time Series Forecasting

UNDERGRADUATE DEGREE DETAILS : BACHELOR OF SCIENCE WITH

Forecasting the first step in planning. Estimating the future demand for products and services and the necessary resources to produce these outputs

Leveraging Ensemble Models in SAS Enterprise Miner

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Probabilistic Forecasting of Medium-Term Electricity Demand: A Comparison of Time Series Models

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Improved Trend Following Trading Model by Recalling Past Strategies in Derivatives Market

AP Physics 1 and 2 Lab Investigations

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Models for Product Demand Forecasting with the Use of Judgmental Adjustments to Statistical Forecasts

Ch.3 Demand Forecasting.

Aspen Collaborative Demand Manager

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

TIME SERIES ANALYSIS

Social Media Mining. Data Mining Essentials

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

SINGULAR SPECTRUM ANALYSIS HYBRID FORECASTING METHODS WITH APPLICATION TO AIR TRANSPORT DEMAND

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Forecasting in STATA: Tools and Tricks

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images

IDENTIFICATION OF DEMAND FORECASTING MODEL CONSIDERING KEY FACTORS IN THE CONTEXT OF HEALTHCARE PRODUCTS

Applications of improved grey prediction model for power demand forecasting

The CUSUM algorithm a small review. Pierre Granjon

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Univariate and Multivariate Methods PEARSON. Addison Wesley

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS

Overview of Factor Analysis

Outline: Demand Forecasting

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Mining Airline Data for CRM Strategies

On Entropy in Network Traffic Anomaly Detection

Time Series Analysis

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques Page 1 of 11. EduPristine CMA - Part I

Luciano Rispoli Department of Economics, Mathematics and Statistics Birkbeck College (University of London)

Sections 2.11 and 5.8

Transcription:

Clustering Time Series Based on Forecast Distributions Using Kullback-Leibler Divergence Taiyeong Lee, Yongqiao Xiao, Xiangxiang Meng, David Duling SAS Institute, Inc 100 SAS Campus Dr. Cary, NC 27513, USA {taiyeong.lee, yongqiao.xiao, xiangxiang.meng, david.duling}@sas.com ABSTRACT One of the key tasks in time series data mining is to cluster time series. However, traditional clustering methods focus on the similarity of time series patterns in past time periods. In many cases such as retail sales, we would prefer to cluster based on the future forecast values. In this paper, we show an approach to cluster forecasts or forecast time series patterns based on the Kullback-Leibler divergences among the forecast densities. We use the same normality assumption for error terms as used in the calculation of forecast confidence intervals from the forecast model. So the method does not require any additional computation to obtain the forecast densities for the Kullback-Leibler divergences. This makes our approach suitable for mining very large sets of time series. A simulation study and two real data sets are used to evaluate and illustrate our method. It is shown that using the Kullback-Leibler divergence results in better clustering when there is a degree of uncertainty in the forecasts. Keywords Time Series Clustering, Time Series Forecasting, Kullback- Leibler Divergence, Euclidean Distance 1. INTRODUCTION Time series clustering has been used in many data mining areas such as retail, energy, weather, quality control chart, stock/financial data, and sequence/time series data generated by medical devices etc[3, 12, 14]. Typically, the observed data is used directly or indirectly as a source of time series clustering. For example, we can cluster CO2 emission patterns of each country based on their historical data or based on some extracted features from the historical data. Numerous similarity/dissimilarity/distance/divergence measures [4, 5, 8] have been proposed and studied. Another category of time series clustering methods is the model-based clustering technique, which clusters time series using the parameter estimates of the models or other statistics using the errors associated with the estimates[10, 13]. In[11], Liao summarized these time series clustering methods into three categories: raw data based, extracted feature based, and model based. Instead of using the observed time series, or some extracted features of the observations, or even models in the past time periods, we consider forecasts themselves at a specific future forecast time point or during a future time period. For the retail stores, we can cluster them based on their sales forecast distributions at a particular future time, instead of the observed sales data. Alonso [1] used density forecast models for time series clustering at a specific future time point. However, since the method [1] requires bootstrap samples, nonparametric forecast density estimation, and a specific distance measurement between the forecast densities, it is not an efficient approach to cluster a large number of time series. In this paper, we use the Kullback-Leibler divergence [9] for clustering the forecasts at a future point. Under the normality assumption in the error, the Kullback-Leibler distance can be computed directly from the forecast means and variances provided by the forecast model. We also extend our method to cluster the forecasts at all future points in the forecast horizon to capture the forecast patterns that could evolve over time. For instance, in the retail industry, business decisions such as stocking up or rearranging the shelves can be made after clustering the products based on the sales forecasts. Similarly, the clustering could be carried out at the store level, so that the store sales or price policies can be made for each group of stores. Typically, the number of time series in the retail industry is very large, and the industry also requires fast forecasting as well as fast clustering. The proposed method is suitable for clustering large amounts of forecasts. The paper is organized as follows. In Sections 2 and 3, we describe the KL divergence as a distance measure of forecast densities, and explain how to cluster forecasts. Following that, a simulation study and real data analyses are presented. 2. DISTANCE MEASURE FOR CLUSTER- ING FORECASTS Since forecasts are not observed values, the Euclidean distance between two forecast values may not be close to the true distance. Our proposed method uses a symmetric version of Kullback-Leibler divergence to calculate the distance

between the forecast densities under the normal assumption of the forecast error terms. In another word, both mean (forecast) and variance(forecast variance) are used in the calculation of the distance. assumption, the Kullback-Leibler distance between the forecast distributions of f 0 and f 1 consider both the mean and variance information of the forecasts, which has the following relationship with the Euclidean distance, 2.1 Kullback-Leibler Divergence Suppose P 0 and P 1 are the probability distributions of two continuous random variables, the Kullback-Leibler divergence of P 0 from P 1 is defined as KLD avg(f 1, f 0) = 1 4 ( 1ˆσ 2 0 + 1ˆσ )EUC(f 1, f 0) + 1 1 2 4 (K + 1 K ) (5) KLD(P 1 P 0) = p 1(x) log p1(x) dx (1) p 0(x) where p 0 and p 1 are the density functions of P 0 and P 1. The Kullback-Leibler divergence KLD(P 1 P 0) is not a symmetric measure of the difference between P 0 and P 1, but in clustering we need to define a symmetric version of distance measure for the items (in this paper, time series) to be grouped. A well-known symmetric version of the Kullback-Leibler divergence is the average of two divergences KLD(P 1 P 0) and KLD(P 0 P 1), where K = ˆσ2 1 is the relative ratio of the noises in two forecast f 0 and f ˆσ 0 2 1. The following plots of normal density functions (Figure 1 and Figure 2) show that using the forecast values without considering their distributions (that is, using EUC distance) may not be appropriate in clustering forecast values. The plots show the reverse relationship between what the KL and the Euclidean distances measure. KLD avg(p 1, P 0) = 1 {KLD(P1 P0) + KLD(P0 P1)} 2 = 1 (p 1(x) p 0(x)) log p1(x) dx (2) 2 p 0(x) This is also known as the J-Divergence of P 0 and P 1 [7]. When P 1 and P 0 are two normal distributions, that is, P 1 N(µ 1, σ 2 1) and P 0 N(µ 0, σ 2 0), the KLD avg can be simplified as follows, Figure 1: An example of forecasts with the same mean values but different errors KLD(P 1, P 0) = 1 [(µ 2σ0 2 1 µ 0) 2 + (σ1 2 σ0) 2 2 ] + log σ0 σ 1 KLD avg(p 1, P 0) = 1 2 ( 1 + 1 )[(µ 2σ0 2 2σ1 2 1 µ 0) 2 + (σ1 2 σ0) 2 2 ] (3) In the rest of the paper, we denote the symmetric version of KL divergence in (3) as KL distance. 2.2 KL and Euclidean Distances for Clustering Forecasts For two forecasts f 0 and f 1 with values ˆµ 0, ˆµ 1 and standard errors ˆσ 0, ˆσ 1, the Euclidean distance between the two forecasts is defined as EUC(f 1, f 2) = (ˆµ 1 ˆµ 0) 2 (4) In consistent with the definition of the KL divergence for normal density, here we define the squared distance function. Using Euclidean distance for clustering forecast time series ignores the variance information (ˆσ 0 2 and ˆσ 1) 2 of the underlying forecast distributions. In contrast, under normal Figure 2: An example of forecasts with different mean values but the same errors When two forecast values are the same and the forecast distributions are ignored, the two forecast values are definitely clustered into the same category based on the mean difference (Euclidean distance). However, when we use the KL distance, clustering forecast values may produce a different result even when the mean difference is zero (Figure 1). For example, let us consider the sales data from retail stores. The sales forecasts of two stores are both zero in the next week but their standard deviations are different from each other as shown in Figure 1. When we do not consider the forecast distributions, the two stores are clustered into the same segment and may get the same sales policy for the

coming week. Contrary to Figure 1, Figure 2 shows two different forecast values of sales (0 and 50) with the same large standard deviations. Based on the KL distance, the forecast sales of the two stores in Figure 2 show less difference than the forecast sales of two stores in Figure 1 (KL distance = 1.78 vs. KL distance = 0.22). In other words, two stores in Figure 1 are less likely to be clustered into the same segment comparing with the two stores in Figure 2 even though their forecast values are identical (Figure 1). We also observe the following properties of the symmetric Kullback-Leibler Divergence as defined in Equation 5, Property 1. KLD avg is not scale free, that is, it depends on the forecast errors. Especially when ˆσ 1 = ˆσ 0, KLD avg = 1 EUC(ˆµ 2ˆσ 1, ˆµ 0) 0 2 Property 1 is desirable for clustering the forecasts, since we want to distinguish the forecast mean values together with their errors. It indicates that when the errors of the forecasts are the same, the KL distance differs from the Euclidean distance with a ratio which depends on the error. Property 2. Suppose there exists a constant c > 0 such that ˆσ 0 = c ˆσ 1, KLD avg 0 when ˆσ 0. Property 2 implies that the KL distance cannot distinguish two forecasts when their errors are both very large. This indicates that the forecasting models are also very important while clustering the forecasts. If a poor forecast model is fit, we may end up with few clusters because the errors make the forecasts indistinguishable. Property 3. Under ths same condition in Property 2, KLD avg when ˆσ 0 = c ˆσ 1 0. Property 3 tells us that the KL distance cannot group two forecasts when their errors are very small. In theory, when we have perfect forecasts (errors are zero), there is no need to consider the errors in clustering. However, in practice, this does not hold, since the errors will increase when we forecast further away, as shown by the example in Equation 6. 2.3 Forecast Distributions for KL Divergence To get the KL distance among forecasts, we need to know the forecast density. As stated before, we utilize the forecast distributions that are used in the calculation of forecast confidence intervals to compute KL distance. Since the forecast confidence intervals are readily available in any forecast software, it saves a lot of time and computing resources compared to [1], which needs the full forecast density estimation for the calculation of the distance matrix. As an example, we show how to get the k-step ahead forecast value and variance in the simple exponential smoothing model. Under the assumption of Gaussian white noise process, Y t = µ t + ɛ t, t = 1, 2,... then the smoothing equation is S t = αy t + (1 α)s t 1, and the k-step ahead forecast of Y t is S t, i.e. Ŷ t(k) = S t. The simple exponential smoothing model uses an exponentially weighted moving average of the past values. The model is equivalent to ARIMA(0,1,1) model without constant. So the model is (1 B)Y t = (1 θb)ɛ t, where θ = 1 α. Thus Y t = ɛ t + j=1 αɛt j. Therefor the variance of Ŷt(k) k 1 V (Ŷt(k)) = V (ɛt)[1 + α 2 ] = V (ɛ t)[1 + (k 1)α 2 ]. (6) j=1 Under the Gaussian white noise assumption, Ŷ t(k) follows N(Ŷt(k), V (Ŷt(k))). Therefore the KL distance of two forecasts at a future time point can be easily obtained using Equation 3. 3. CLUSTERING THE FORECASTS When a distance function has been defined between all pairs of forecasts, we can use available clustering algorithms to cluster the forecasts. A hierarchical clustering algorithm needs a distance matrix between all the pairs, while the more scalable k-means clustering algorithm requires the distance between a group of points (typically represented by the centroid of the group) and any other single point. Thanks to the additive property of the normal distribution, that is, the sum of two independent normal random variables still follows a normal distribution with mean and variance equal to the sum of the individual means and variances respectively, the KL distance between a group of points and any other single point can be easily computed as well. Therefore, we can use both the hierarchical and the k-means clustering algorithms with the KL distance for clustering the forecasts. When clustering the forecasts, we consider two scenarios: clustering forecast values at a particular future time point, and clustering the forecast series for all future time points in the forecast horizon. Clustering the forecasts at a future time point helps us understand the forecasts and their clusters at the given time point, while clustering the forecast series helps us understand the overall forecast patterns. 3.1 Clustering at One Future Point Let ˆX t(k) and Ŷt(k) be the k-step ahead forecasts of two time series X t and Y t, and ˆσ x(k) and ˆσ y(k) be the standard errors of the forecasts. The KLD avg( ˆX t(k), Ŷt(k)) between the two forecasts can be calculated using Equation 5.

The steps of clustering the forecasts at a future time point are shown below. In this report, we consider hierarchical clustering, but the the procedure can be easily modified for any non-hierarchical clustering algorithms such as k-means. 1. Apply forecasting models to a forecast lead time k. 400 times. We fit AR(2) models to the simulated time series and obtain the forecast values and variances. For the synthetic data, since we know the group label of each series, we can easily compute the clustering error rate (CER). We report the mean clustering error rates of both distance measures for each SNR setting. 2. Obtain forecasts ( ˆX t(k), Ŷt(k)) and their standard errors (ˆσ x(k), ˆσ y(k)) fore each pair of the time series. 3. Calculate the KL distance matrix among all pairs of the time series. 4. Apply a clustering algorithm with the KL distances. 5. Obtain the clusters of the forecasts. 3.2 Clustering the Forecast Series The clusters at different future time points may be different. To capture the changes of the whole forecast pattern, we can cluster the forecast series for all future time points. Given a total forecast lead h, we extend the KL distance as follows. KLD avg( ˆX t, Ŷt) = h k=1 KLD avg( ˆX t(k), Ŷt(k)). (7) Note that we still define the squared distance. The steps of clustering the forecast series are 1. Apply forecasting models with total forecast lead h. 2. Obtain forecasts ( ˆX t(k), Ŷt(k)) and their standard errors (ˆσ x(k), ˆσ y(k)) at each lead time points k, k = 1, 2,... h. 3. Calculate the KL distance matrix among all pairs of the time series using Equation 7. 4. Apply a clustering algorithm with the KL distances. 5. Obtain the clusters of the forecasts. 4. A SIMULATION STUDY To demonstrate the performance of the proposed KL distance for clustering the forecasts, we simulate two groups of time series with the same autoregressive AR(2) [2] structure but different intercepts. Each time series is of length 100, and there are 50 time series in each group. X (i) t = µ i + 0.75X t 1 0.5X t 2 + σ i, where t = 1, 2,..., 100, i = 1, 2. The two groups have µ 1 = 0 and µ 2 = 1, respectively. The standard errors of the white noise for both groups (σ 1 and σ 2) vary from 0.5 to 5 by 0.5 in order to examine the performance difference of the KL distance and the Euclidean distance for time series with different signal-to-noise ratios (SNR). This yields 100 settings of SNR combinations (σ 1 and σ 2) and for each setting we repeat the simulations for Figure 3: The mean clustering error rates in 400 simulations for clustering two groups of time series with different combinations of noise standard errors (σ 1, σ 2). The forecast leads are 10. Top: Euclidean distance; Bottom: KL distance. The density plots in Figure 3 show the mean CER for both methods in 400 simulations in the clustering of all future time points with forecast length 10. It is clear that the proposed KL distance outperforms the traditional Euclidean distance in a variety of SNR combinations. Especially, when one group of time series tends to have a relative high SNR compared with the other group of time series, shown in the top-left and bottom-right corners in the density plots, the KL distance can help identify the true grouping of the time series (mean CERs are close to zero) but the Euclidean distance results in poor clustering results (mean CERS are around 15%). When both groups of time series have high noise standard errors and the two standard errors are close, shown in the top-right corners in the density plots, the performances of both distance measures are poor.

by all the countries in the world 1, and the other is the weekly sales amounts by store of a retail chain. The summary of the data sets is shown in Table 1. Data Set Number of of Series Length Frequency CO2 146 48 Yearly Sales 43 152 Monthly Table 1: Summary of Data Sets The CO2 data consists of the CO2 emissions (metric tons per capita) of 216 countries from year 1961 to 2008 (The CO2 data from year 1960 to 1999 was used in [1]). There is one CO2 emission time series recorded for each country. For better comparison of the results, we remove the series with missing values, and it ends up with 146 complete series. The sales data is a small subset of the weekly sales data of a big retail chain in the USA, which has millions of products and thousands of stores. The subset has the aggregated sales of a department for 43 stores. Each store has 152 weeks of sales history from February 2008 to December 2010. There are 43 time series, and no other exogenous variables besides the aggregated sales in the data. Figure 4: The mean clustering error rates in 400 simulations for clustering two groups of time series with different combinations of noise standard errors (σ 1, σ 2). The forecast leads are 1. Top: Euclidean distance; Bottom: KL distance. In Figure 4, we show the simulation results for the same SNR combinations as in Figure 3 but the clustering is on one future forecast (forecast length is 1). Clearly, the proposed KL distance measure still has better mean CERs compared with the traditional Euclidean measure in the clustering of one future time point. Consistent with the results in Figure 3, when the noise standard errors of the two groups of time series are different (the top-left and bottom-right corners in the density plots), the KL distance yields much smaller error rates than the Euclidean distance does. When comparing across Figure 3 and Figure 4, it is found that for the same SNRs, clustering time series based on 10 forecast time points tends to produce better results compared with clustering time series based on just one future time point, which suggests using a sufficient length of forecast is necessary to guarantee the further clustering based on the forecasts. 5. REAL LIFE DATA STUDY Two real life data sets are investigated to further evaluate the proposed K-L distance: one data set is the CO2 emission For the real life data sets, we cannot compute the clustering error rate since we don t have the true class labels available. In order to compare the quality of the clustering results, we do a holdout test, that is, we hold out the most recent h periods of data for testing, and the forecasting models are built with the data prior to the holdout periods (training data). For example, for the sales data, we holdout the most recent 12 weeks of data, and the forecasting model are built with the 140 weekly data prior to the recent 12 weeks. After building the appropriate forecasting models, we use the models to obtain forecast values and variance for the holdout periods. Since our focus is on the evaluation of different distance functions in the clustering of the forecasts instead of the accuracy of the models, we simply fit the training data with the best ESM (Exponential Smoothing Model) model for both the sales and the CO2 data sets. The candidate ESM models include simple, double, linear, damped trend, seasonal, additive and multiplicative Winters method. We select the best model with the minimum Root Mean Squared Error (RMSE). After computing the distance matrix using either KL or Euclidean distance, hierarchical clustering is used. 5.1 CO2 Emission Data by Country There are 146 complete time series after removing those with missing values. The initial exploration of the data indicates that there is an outlier series(country Qatar), which has significantly higher CO2 emissions per capita than the rest. So we remove the outlier from the further data analysis. For the rest 145 series, we set the holdout period to 5. After clustering the forecasts in the holdout period, we plot the actual time series for each cluster of the forecasts when setting the number of clusters to 3 in Figure 5. 1 http://data.worldbank.org/indicator/en.atm.co2e.pc

Figure 6: Two forecast series with different forecast errors and separated by the KL distance: Albania in cluster 2 and Afghanistan in cluster 1. When using the Euclidean distance, both countries are in cluster 1. The clusters are shown in Figure 5. Figure 5: Clusters of the time series for the CO2 data: each line shows a series of the actual CO2 emissions by a country in the holdout 5 years, while the clusters are identified based on the clustering of the forecast series in the holdout 5 years using Euclidean distance (EUC) and KL distance (KLD). We can see that Euclidean distance separates the countries into 3 clusters: cluster 1 with relatively low CO2 emissions (106 countries), cluster 2 with medium CO2 emissions (29 countries), and cluster 3 with relatively high CO2 emissions (10 countries). When the errors in the forecasts are considered, the clusters are different. With KL distance, clusters 1, 2 and 3 have 65, 52 and 28 countries respectively. Figures 6 and 7 illustrate the difference in the clustering results between the KL distance and the Euclidean distance. Notice that the dash lines are for the fits and forecasts, the solid lines are for the actual time series (including holdouts), and the filled areas are for the confidence intervals of the forecast periods. Using Euclidean distance, Afghanistan and Albania in Figure 6 are in cluster 1. However, Albania is separated from cluster 1 into cluster 2 when using the KL distance. This is because there are larger errors in the forecasts for Albania, and thus the KL distance separates the time series from cluster 1. Figure 7 shows that Australia and United Kingdom are in different clusters (3 and 2 respectively) when using the Euclidean distance. Instead, they are both put into cluster 3 when using the KL distance because their forecast distributions are similar. Figure 7: Two forecast series with similar forecast errors and grouped into one cluster by the KL distance: both Korea and Australia are in cluster 3. When using the Euclidean distance, Korea is in cluster 2 and Australia is in cluster 3. The clusters are shown in Figure 5. in forecasting the CO2 data with MAPE (Mean Absolute Percent Error) in the holdout period about 10%. It is also interesting to observe that the cluster pattern changes over time when we cluster at each time point in the holdout. Table 2 illustrates the cluster membership at each time point in the holdout and the overall cluster membership for four countries with the KL distance. The actual values, forecast values and the confidence intervals for these countries in the holdout period are shown in Figure 8. Within five years, most countries stay in the same cluster. But the cluster membership of a country could change because of the forecast changes. For example, the forecasts for China have an upward trend, and its cluster is changed from 1 to 2 in 2007. The errors in the forecasts could also contribute to the changes of cluster membership from one future time point to another. We checked the accuracy of the forecasting models (the best ESM). It turned out the best ESM models perform very well

Cluster Country 2004 2005 2006 2007 2008 Overall Kenya 1 1 1 1 1 1 China 1 1 1 2 2 2 Mexico 2 2 2 2 2 2 USA 3 3 3 3 3 3 Table 2: An illustration of the cluster changes over time for the C02 emissions of four countries. The cluster membership for each year is obtained based on the clustering of forecasts at that year using KL distance. The overall cluster membership is obtained based on clustering of the whole forecast time series using KL distance. For illustration, the number of clusters is set to 3. Figure 8: The time series plots of the four countries listed in Table 2. The clustering of one future forecast is performed at each of the five years in the holdout period. Dash: the forecast values, Solid: the actual holdout values, filled areas: the forecast confidence intervals. 5.2 Retail Store Sales Data We set the holdout period to be 12 weeks. When setting the number of clusters to 5, the actual time series for the clusters based on the forecasts in the holdout period are shown in Figure 9. With Euclidean distance, the stores are grouped into 5 clusters: 17, 16, 5, 3, 2 stores are in clusters 1, 2, 3, 4 and 5, respectively. By KL Distance, clusters 1, 2, 3, 4 and 5 have the number of stores 14, 6, 14, 7, 2 respectively. We check the forecast accuracy of the models. It turns out that the best ESM models have MAPE 40% in the holdout period. Notice that for the sales data the forecasting models could be enhanced to include other models like ARIMA[2], UCM[6] and with other exogenous variables such prices, holidays, promotions, etc. Figure 9: Clusters of the time series for the Sales data: each line shows a series of the actual sales in the holdout 12 weeks, while the clusters are identified based on the clustering of the forecast series in the holdout 12 weeks using Euclidean distance (EUC) and KL distance (KLD). We then experiment without the holdout, that is, all avail-

sets. We would like to enhance KL distance to handle cases when the forecast errors are extremely large or small, in which cases the current KL distance may not perform well. Different clustering algorithms such as k-means can be applied as well. Dynamic Time Warping[8] together with KL distance to detect forecast pattern changes for series of different length can also be an extension of this paper. Figure 10: Forecasts for the Sales data: each line shows a series of the forecast values in the future 6 weeks. The squares show the weeks with interesting forecast clustering pattern changes. Figure 11: Cluster changes over time in the future for the Sales data. The cluster membership for each week is obtained based on the clustering of forecasts at that week using KL distance. able data points are used to build the forecasting models. Forecast values for the next six weeks are obtained using the best ESM models and the KL distances are calculated at three specific future weeks. The interesting forecasting lead points and forecast values are marked with squares at Figure 10. Figure 11 shows the cluster change over time. We can see that some stores which are grouped in the same cluster in the first week (2Jan2011) may be separated into different clusters in later weeks (30Jan2011). Thus the retail company may need to adjust sales or promotion policy to each store and may set different business goals from week to week. 6. CONCLUSIONS AND FUTURE WORK In this paper we used the KL distance for clustering the forecast distributions of time series. The KL distance requires density functions, and the clustering of a larger number of time series with full density estimations of the forecasts takes a lot of computing resources. We approximated the forecast density using normal density with the forecast means and variances, which are directly provided by the forecast model. Thus, the KL distances among forecasts can be easily obtained. We demonstrated the advantage of using the KL distance over the Euclidean distance with simulations from autoregressive models. We also showed that the KL distance improved the clustering results in two real life data 7. REFERENCES [1] A. Alonso, J. Berrendero, A. Hernandez, and A. Justel. Time series clustering based on forecast densities. Computational Statistics & Data Analysis, 51(2):762 776, 2006. [2] G. Box, G. Jenkins, and G. Reinsel. Time Series Analysis: Forecasting and Control. Prentice Hall, 3rd edition, 1994. [3] P. S. P. Cowpertwait and T. F. Cox. Clustering population means under heterogeneity of variance with an application to a rainfall time series problem. Journal of the Royal Statistical Society. Series D (The Statistician), 41(1):113 121, 1992. [4] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. Proc. VLDB Endow., 1:1542 1552, August 2008. [5] A. W.-C. Fu, E. Keogh, L. Y. H. Lau, and C. A. Ratanamahatana. Scaling and time warping in time series querying. In Proceedings of the 31st international conference on Very large data bases, VLDB 05, pages 649 660, 2005. [6] A. C. Harvey. Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, 1989. [7] H. Jeffreys. An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. A, 186:453Ű 461, 1946. [8] E. Keogh. Exact indexing of dynamic time warping. In Proceedings of the 28th international conference on Very Large Data Bases, VLDB 02, pages 406 417, 2002. [9] S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79 86, 1951. [10] M. Kumar and N. R. Patel. Clustering data with measurement errors. Comput. Stat. Data Anal., 51:6084 6101, August 2007. [11] T. W. Liao. Clustering of time series data - a survey. Pattern Recognition, 38:1857 1874, 2005. [12] L. R. L. L. V. Macchiato, M. F. and M. Ragosta. Time modelling and spatial clustering of daily ambient temperature: An application in southern italy. Environmetrics, 6(1):31Ű 53, 1995. [13] P. C. Mahalanobis. On the generalized distance in statistics. Proceedings of the National Institute of Science Calcutta, 2(1):49 55, 1936. [14] R. H. S. Yoshihide Kakizawa and M. Taniguchi. Discrimination and clustering for multivariate time series. Journal of the American Statistical Association, 93(441):328 340, 1998.