Clustering Time Series Based on Forecast Distributions Using Kullback-Leibler Divergence

Size: px
Start display at page:

Download "Clustering Time Series Based on Forecast Distributions Using Kullback-Leibler Divergence"

Transcription

1 Clustering Time Series Based on Forecast Distributions Using Kullback-Leibler Divergence Taiyeong Lee, Yongqiao Xiao, Xiangxiang Meng, David Duling SAS Institute, Inc 100 SAS Campus Dr. Cary, NC 27513, USA {taiyeong.lee, yongqiao.xiao, xiangxiang.meng, ABSTRACT One of the key tasks in time series data mining is to cluster time series. However, traditional clustering methods focus on the similarity of time series patterns in past time periods. In many cases such as retail sales, we would prefer to cluster based on the future forecast values. In this paper, we show an approach to cluster forecasts or forecast time series patterns based on the Kullback-Leibler divergences among the forecast densities. We use the same normality assumption for error terms as used in the calculation of forecast confidence intervals from the forecast model. So the method does not require any additional computation to obtain the forecast densities for the Kullback-Leibler divergences. This makes our approach suitable for mining very large sets of time series. A simulation study and two real data sets are used to evaluate and illustrate our method. It is shown that using the Kullback-Leibler divergence results in better clustering when there is a degree of uncertainty in the forecasts. Keywords Time Series Clustering, Time Series Forecasting, Kullback- Leibler Divergence, Euclidean Distance 1. INTRODUCTION Time series clustering has been used in many data mining areas such as retail, energy, weather, quality control chart, stock/financial data, and sequence/time series data generated by medical devices etc[3, 12, 14]. Typically, the observed data is used directly or indirectly as a source of time series clustering. For example, we can cluster CO2 emission patterns of each country based on their historical data or based on some extracted features from the historical data. Numerous similarity/dissimilarity/distance/divergence measures [4, 5, 8] have been proposed and studied. Another category of time series clustering methods is the model-based clustering technique, which clusters time series using the parameter estimates of the models or other statistics using the errors associated with the estimates[10, 13]. In[11], Liao summarized these time series clustering methods into three categories: raw data based, extracted feature based, and model based. Instead of using the observed time series, or some extracted features of the observations, or even models in the past time periods, we consider forecasts themselves at a specific future forecast time point or during a future time period. For the retail stores, we can cluster them based on their sales forecast distributions at a particular future time, instead of the observed sales data. Alonso [1] used density forecast models for time series clustering at a specific future time point. However, since the method [1] requires bootstrap samples, nonparametric forecast density estimation, and a specific distance measurement between the forecast densities, it is not an efficient approach to cluster a large number of time series. In this paper, we use the Kullback-Leibler divergence [9] for clustering the forecasts at a future point. Under the normality assumption in the error, the Kullback-Leibler distance can be computed directly from the forecast means and variances provided by the forecast model. We also extend our method to cluster the forecasts at all future points in the forecast horizon to capture the forecast patterns that could evolve over time. For instance, in the retail industry, business decisions such as stocking up or rearranging the shelves can be made after clustering the products based on the sales forecasts. Similarly, the clustering could be carried out at the store level, so that the store sales or price policies can be made for each group of stores. Typically, the number of time series in the retail industry is very large, and the industry also requires fast forecasting as well as fast clustering. The proposed method is suitable for clustering large amounts of forecasts. The paper is organized as follows. In Sections 2 and 3, we describe the KL divergence as a distance measure of forecast densities, and explain how to cluster forecasts. Following that, a simulation study and real data analyses are presented. 2. DISTANCE MEASURE FOR CLUSTER- ING FORECASTS Since forecasts are not observed values, the Euclidean distance between two forecast values may not be close to the true distance. Our proposed method uses a symmetric version of Kullback-Leibler divergence to calculate the distance

2 between the forecast densities under the normal assumption of the forecast error terms. In another word, both mean (forecast) and variance(forecast variance) are used in the calculation of the distance. assumption, the Kullback-Leibler distance between the forecast distributions of f 0 and f 1 consider both the mean and variance information of the forecasts, which has the following relationship with the Euclidean distance, 2.1 Kullback-Leibler Divergence Suppose P 0 and P 1 are the probability distributions of two continuous random variables, the Kullback-Leibler divergence of P 0 from P 1 is defined as KLD avg(f 1, f 0) = 1 4 ( 1ˆσ ˆσ )EUC(f 1, f 0) (K + 1 K ) (5) KLD(P 1 P 0) = p 1(x) log p1(x) dx (1) p 0(x) where p 0 and p 1 are the density functions of P 0 and P 1. The Kullback-Leibler divergence KLD(P 1 P 0) is not a symmetric measure of the difference between P 0 and P 1, but in clustering we need to define a symmetric version of distance measure for the items (in this paper, time series) to be grouped. A well-known symmetric version of the Kullback-Leibler divergence is the average of two divergences KLD(P 1 P 0) and KLD(P 0 P 1), where K = ˆσ2 1 is the relative ratio of the noises in two forecast f 0 and f ˆσ The following plots of normal density functions (Figure 1 and Figure 2) show that using the forecast values without considering their distributions (that is, using EUC distance) may not be appropriate in clustering forecast values. The plots show the reverse relationship between what the KL and the Euclidean distances measure. KLD avg(p 1, P 0) = 1 {KLD(P1 P0) + KLD(P0 P1)} 2 = 1 (p 1(x) p 0(x)) log p1(x) dx (2) 2 p 0(x) This is also known as the J-Divergence of P 0 and P 1 [7]. When P 1 and P 0 are two normal distributions, that is, P 1 N(µ 1, σ 2 1) and P 0 N(µ 0, σ 2 0), the KLD avg can be simplified as follows, Figure 1: An example of forecasts with the same mean values but different errors KLD(P 1, P 0) = 1 [(µ 2σ0 2 1 µ 0) 2 + (σ1 2 σ0) 2 2 ] + log σ0 σ 1 KLD avg(p 1, P 0) = 1 2 ( )[(µ 2σ0 2 2σ1 2 1 µ 0) 2 + (σ1 2 σ0) 2 2 ] (3) In the rest of the paper, we denote the symmetric version of KL divergence in (3) as KL distance. 2.2 KL and Euclidean Distances for Clustering Forecasts For two forecasts f 0 and f 1 with values ˆµ 0, ˆµ 1 and standard errors ˆσ 0, ˆσ 1, the Euclidean distance between the two forecasts is defined as EUC(f 1, f 2) = (ˆµ 1 ˆµ 0) 2 (4) In consistent with the definition of the KL divergence for normal density, here we define the squared distance function. Using Euclidean distance for clustering forecast time series ignores the variance information (ˆσ 0 2 and ˆσ 1) 2 of the underlying forecast distributions. In contrast, under normal Figure 2: An example of forecasts with different mean values but the same errors When two forecast values are the same and the forecast distributions are ignored, the two forecast values are definitely clustered into the same category based on the mean difference (Euclidean distance). However, when we use the KL distance, clustering forecast values may produce a different result even when the mean difference is zero (Figure 1). For example, let us consider the sales data from retail stores. The sales forecasts of two stores are both zero in the next week but their standard deviations are different from each other as shown in Figure 1. When we do not consider the forecast distributions, the two stores are clustered into the same segment and may get the same sales policy for the

3 coming week. Contrary to Figure 1, Figure 2 shows two different forecast values of sales (0 and 50) with the same large standard deviations. Based on the KL distance, the forecast sales of the two stores in Figure 2 show less difference than the forecast sales of two stores in Figure 1 (KL distance = 1.78 vs. KL distance = 0.22). In other words, two stores in Figure 1 are less likely to be clustered into the same segment comparing with the two stores in Figure 2 even though their forecast values are identical (Figure 1). We also observe the following properties of the symmetric Kullback-Leibler Divergence as defined in Equation 5, Property 1. KLD avg is not scale free, that is, it depends on the forecast errors. Especially when ˆσ 1 = ˆσ 0, KLD avg = 1 EUC(ˆµ 2ˆσ 1, ˆµ 0) 0 2 Property 1 is desirable for clustering the forecasts, since we want to distinguish the forecast mean values together with their errors. It indicates that when the errors of the forecasts are the same, the KL distance differs from the Euclidean distance with a ratio which depends on the error. Property 2. Suppose there exists a constant c > 0 such that ˆσ 0 = c ˆσ 1, KLD avg 0 when ˆσ 0. Property 2 implies that the KL distance cannot distinguish two forecasts when their errors are both very large. This indicates that the forecasting models are also very important while clustering the forecasts. If a poor forecast model is fit, we may end up with few clusters because the errors make the forecasts indistinguishable. Property 3. Under ths same condition in Property 2, KLD avg when ˆσ 0 = c ˆσ 1 0. Property 3 tells us that the KL distance cannot group two forecasts when their errors are very small. In theory, when we have perfect forecasts (errors are zero), there is no need to consider the errors in clustering. However, in practice, this does not hold, since the errors will increase when we forecast further away, as shown by the example in Equation Forecast Distributions for KL Divergence To get the KL distance among forecasts, we need to know the forecast density. As stated before, we utilize the forecast distributions that are used in the calculation of forecast confidence intervals to compute KL distance. Since the forecast confidence intervals are readily available in any forecast software, it saves a lot of time and computing resources compared to [1], which needs the full forecast density estimation for the calculation of the distance matrix. As an example, we show how to get the k-step ahead forecast value and variance in the simple exponential smoothing model. Under the assumption of Gaussian white noise process, Y t = µ t + ɛ t, t = 1, 2,... then the smoothing equation is S t = αy t + (1 α)s t 1, and the k-step ahead forecast of Y t is S t, i.e. Ŷ t(k) = S t. The simple exponential smoothing model uses an exponentially weighted moving average of the past values. The model is equivalent to ARIMA(0,1,1) model without constant. So the model is (1 B)Y t = (1 θb)ɛ t, where θ = 1 α. Thus Y t = ɛ t + j=1 αɛt j. Therefor the variance of Ŷt(k) k 1 V (Ŷt(k)) = V (ɛt)[1 + α 2 ] = V (ɛ t)[1 + (k 1)α 2 ]. (6) j=1 Under the Gaussian white noise assumption, Ŷ t(k) follows N(Ŷt(k), V (Ŷt(k))). Therefore the KL distance of two forecasts at a future time point can be easily obtained using Equation CLUSTERING THE FORECASTS When a distance function has been defined between all pairs of forecasts, we can use available clustering algorithms to cluster the forecasts. A hierarchical clustering algorithm needs a distance matrix between all the pairs, while the more scalable k-means clustering algorithm requires the distance between a group of points (typically represented by the centroid of the group) and any other single point. Thanks to the additive property of the normal distribution, that is, the sum of two independent normal random variables still follows a normal distribution with mean and variance equal to the sum of the individual means and variances respectively, the KL distance between a group of points and any other single point can be easily computed as well. Therefore, we can use both the hierarchical and the k-means clustering algorithms with the KL distance for clustering the forecasts. When clustering the forecasts, we consider two scenarios: clustering forecast values at a particular future time point, and clustering the forecast series for all future time points in the forecast horizon. Clustering the forecasts at a future time point helps us understand the forecasts and their clusters at the given time point, while clustering the forecast series helps us understand the overall forecast patterns. 3.1 Clustering at One Future Point Let ˆX t(k) and Ŷt(k) be the k-step ahead forecasts of two time series X t and Y t, and ˆσ x(k) and ˆσ y(k) be the standard errors of the forecasts. The KLD avg( ˆX t(k), Ŷt(k)) between the two forecasts can be calculated using Equation 5.

4 The steps of clustering the forecasts at a future time point are shown below. In this report, we consider hierarchical clustering, but the the procedure can be easily modified for any non-hierarchical clustering algorithms such as k-means. 1. Apply forecasting models to a forecast lead time k. 400 times. We fit AR(2) models to the simulated time series and obtain the forecast values and variances. For the synthetic data, since we know the group label of each series, we can easily compute the clustering error rate (CER). We report the mean clustering error rates of both distance measures for each SNR setting. 2. Obtain forecasts ( ˆX t(k), Ŷt(k)) and their standard errors (ˆσ x(k), ˆσ y(k)) fore each pair of the time series. 3. Calculate the KL distance matrix among all pairs of the time series. 4. Apply a clustering algorithm with the KL distances. 5. Obtain the clusters of the forecasts. 3.2 Clustering the Forecast Series The clusters at different future time points may be different. To capture the changes of the whole forecast pattern, we can cluster the forecast series for all future time points. Given a total forecast lead h, we extend the KL distance as follows. KLD avg( ˆX t, Ŷt) = h k=1 KLD avg( ˆX t(k), Ŷt(k)). (7) Note that we still define the squared distance. The steps of clustering the forecast series are 1. Apply forecasting models with total forecast lead h. 2. Obtain forecasts ( ˆX t(k), Ŷt(k)) and their standard errors (ˆσ x(k), ˆσ y(k)) at each lead time points k, k = 1, 2,... h. 3. Calculate the KL distance matrix among all pairs of the time series using Equation Apply a clustering algorithm with the KL distances. 5. Obtain the clusters of the forecasts. 4. A SIMULATION STUDY To demonstrate the performance of the proposed KL distance for clustering the forecasts, we simulate two groups of time series with the same autoregressive AR(2) [2] structure but different intercepts. Each time series is of length 100, and there are 50 time series in each group. X (i) t = µ i X t 1 0.5X t 2 + σ i, where t = 1, 2,..., 100, i = 1, 2. The two groups have µ 1 = 0 and µ 2 = 1, respectively. The standard errors of the white noise for both groups (σ 1 and σ 2) vary from 0.5 to 5 by 0.5 in order to examine the performance difference of the KL distance and the Euclidean distance for time series with different signal-to-noise ratios (SNR). This yields 100 settings of SNR combinations (σ 1 and σ 2) and for each setting we repeat the simulations for Figure 3: The mean clustering error rates in 400 simulations for clustering two groups of time series with different combinations of noise standard errors (σ 1, σ 2). The forecast leads are 10. Top: Euclidean distance; Bottom: KL distance. The density plots in Figure 3 show the mean CER for both methods in 400 simulations in the clustering of all future time points with forecast length 10. It is clear that the proposed KL distance outperforms the traditional Euclidean distance in a variety of SNR combinations. Especially, when one group of time series tends to have a relative high SNR compared with the other group of time series, shown in the top-left and bottom-right corners in the density plots, the KL distance can help identify the true grouping of the time series (mean CERs are close to zero) but the Euclidean distance results in poor clustering results (mean CERS are around 15%). When both groups of time series have high noise standard errors and the two standard errors are close, shown in the top-right corners in the density plots, the performances of both distance measures are poor.

5 by all the countries in the world 1, and the other is the weekly sales amounts by store of a retail chain. The summary of the data sets is shown in Table 1. Data Set Number of of Series Length Frequency CO Yearly Sales Monthly Table 1: Summary of Data Sets The CO2 data consists of the CO2 emissions (metric tons per capita) of 216 countries from year 1961 to 2008 (The CO2 data from year 1960 to 1999 was used in [1]). There is one CO2 emission time series recorded for each country. For better comparison of the results, we remove the series with missing values, and it ends up with 146 complete series. The sales data is a small subset of the weekly sales data of a big retail chain in the USA, which has millions of products and thousands of stores. The subset has the aggregated sales of a department for 43 stores. Each store has 152 weeks of sales history from February 2008 to December There are 43 time series, and no other exogenous variables besides the aggregated sales in the data. Figure 4: The mean clustering error rates in 400 simulations for clustering two groups of time series with different combinations of noise standard errors (σ 1, σ 2). The forecast leads are 1. Top: Euclidean distance; Bottom: KL distance. In Figure 4, we show the simulation results for the same SNR combinations as in Figure 3 but the clustering is on one future forecast (forecast length is 1). Clearly, the proposed KL distance measure still has better mean CERs compared with the traditional Euclidean measure in the clustering of one future time point. Consistent with the results in Figure 3, when the noise standard errors of the two groups of time series are different (the top-left and bottom-right corners in the density plots), the KL distance yields much smaller error rates than the Euclidean distance does. When comparing across Figure 3 and Figure 4, it is found that for the same SNRs, clustering time series based on 10 forecast time points tends to produce better results compared with clustering time series based on just one future time point, which suggests using a sufficient length of forecast is necessary to guarantee the further clustering based on the forecasts. 5. REAL LIFE DATA STUDY Two real life data sets are investigated to further evaluate the proposed K-L distance: one data set is the CO2 emission For the real life data sets, we cannot compute the clustering error rate since we don t have the true class labels available. In order to compare the quality of the clustering results, we do a holdout test, that is, we hold out the most recent h periods of data for testing, and the forecasting models are built with the data prior to the holdout periods (training data). For example, for the sales data, we holdout the most recent 12 weeks of data, and the forecasting model are built with the 140 weekly data prior to the recent 12 weeks. After building the appropriate forecasting models, we use the models to obtain forecast values and variance for the holdout periods. Since our focus is on the evaluation of different distance functions in the clustering of the forecasts instead of the accuracy of the models, we simply fit the training data with the best ESM (Exponential Smoothing Model) model for both the sales and the CO2 data sets. The candidate ESM models include simple, double, linear, damped trend, seasonal, additive and multiplicative Winters method. We select the best model with the minimum Root Mean Squared Error (RMSE). After computing the distance matrix using either KL or Euclidean distance, hierarchical clustering is used. 5.1 CO2 Emission Data by Country There are 146 complete time series after removing those with missing values. The initial exploration of the data indicates that there is an outlier series(country Qatar), which has significantly higher CO2 emissions per capita than the rest. So we remove the outlier from the further data analysis. For the rest 145 series, we set the holdout period to 5. After clustering the forecasts in the holdout period, we plot the actual time series for each cluster of the forecasts when setting the number of clusters to 3 in Figure

6 Figure 6: Two forecast series with different forecast errors and separated by the KL distance: Albania in cluster 2 and Afghanistan in cluster 1. When using the Euclidean distance, both countries are in cluster 1. The clusters are shown in Figure 5. Figure 5: Clusters of the time series for the CO2 data: each line shows a series of the actual CO2 emissions by a country in the holdout 5 years, while the clusters are identified based on the clustering of the forecast series in the holdout 5 years using Euclidean distance (EUC) and KL distance (KLD). We can see that Euclidean distance separates the countries into 3 clusters: cluster 1 with relatively low CO2 emissions (106 countries), cluster 2 with medium CO2 emissions (29 countries), and cluster 3 with relatively high CO2 emissions (10 countries). When the errors in the forecasts are considered, the clusters are different. With KL distance, clusters 1, 2 and 3 have 65, 52 and 28 countries respectively. Figures 6 and 7 illustrate the difference in the clustering results between the KL distance and the Euclidean distance. Notice that the dash lines are for the fits and forecasts, the solid lines are for the actual time series (including holdouts), and the filled areas are for the confidence intervals of the forecast periods. Using Euclidean distance, Afghanistan and Albania in Figure 6 are in cluster 1. However, Albania is separated from cluster 1 into cluster 2 when using the KL distance. This is because there are larger errors in the forecasts for Albania, and thus the KL distance separates the time series from cluster 1. Figure 7 shows that Australia and United Kingdom are in different clusters (3 and 2 respectively) when using the Euclidean distance. Instead, they are both put into cluster 3 when using the KL distance because their forecast distributions are similar. Figure 7: Two forecast series with similar forecast errors and grouped into one cluster by the KL distance: both Korea and Australia are in cluster 3. When using the Euclidean distance, Korea is in cluster 2 and Australia is in cluster 3. The clusters are shown in Figure 5. in forecasting the CO2 data with MAPE (Mean Absolute Percent Error) in the holdout period about 10%. It is also interesting to observe that the cluster pattern changes over time when we cluster at each time point in the holdout. Table 2 illustrates the cluster membership at each time point in the holdout and the overall cluster membership for four countries with the KL distance. The actual values, forecast values and the confidence intervals for these countries in the holdout period are shown in Figure 8. Within five years, most countries stay in the same cluster. But the cluster membership of a country could change because of the forecast changes. For example, the forecasts for China have an upward trend, and its cluster is changed from 1 to 2 in The errors in the forecasts could also contribute to the changes of cluster membership from one future time point to another. We checked the accuracy of the forecasting models (the best ESM). It turned out the best ESM models perform very well

7 Cluster Country Overall Kenya China Mexico USA Table 2: An illustration of the cluster changes over time for the C02 emissions of four countries. The cluster membership for each year is obtained based on the clustering of forecasts at that year using KL distance. The overall cluster membership is obtained based on clustering of the whole forecast time series using KL distance. For illustration, the number of clusters is set to 3. Figure 8: The time series plots of the four countries listed in Table 2. The clustering of one future forecast is performed at each of the five years in the holdout period. Dash: the forecast values, Solid: the actual holdout values, filled areas: the forecast confidence intervals. 5.2 Retail Store Sales Data We set the holdout period to be 12 weeks. When setting the number of clusters to 5, the actual time series for the clusters based on the forecasts in the holdout period are shown in Figure 9. With Euclidean distance, the stores are grouped into 5 clusters: 17, 16, 5, 3, 2 stores are in clusters 1, 2, 3, 4 and 5, respectively. By KL Distance, clusters 1, 2, 3, 4 and 5 have the number of stores 14, 6, 14, 7, 2 respectively. We check the forecast accuracy of the models. It turns out that the best ESM models have MAPE 40% in the holdout period. Notice that for the sales data the forecasting models could be enhanced to include other models like ARIMA[2], UCM[6] and with other exogenous variables such prices, holidays, promotions, etc. Figure 9: Clusters of the time series for the Sales data: each line shows a series of the actual sales in the holdout 12 weeks, while the clusters are identified based on the clustering of the forecast series in the holdout 12 weeks using Euclidean distance (EUC) and KL distance (KLD). We then experiment without the holdout, that is, all avail-

8 sets. We would like to enhance KL distance to handle cases when the forecast errors are extremely large or small, in which cases the current KL distance may not perform well. Different clustering algorithms such as k-means can be applied as well. Dynamic Time Warping[8] together with KL distance to detect forecast pattern changes for series of different length can also be an extension of this paper. Figure 10: Forecasts for the Sales data: each line shows a series of the forecast values in the future 6 weeks. The squares show the weeks with interesting forecast clustering pattern changes. Figure 11: Cluster changes over time in the future for the Sales data. The cluster membership for each week is obtained based on the clustering of forecasts at that week using KL distance. able data points are used to build the forecasting models. Forecast values for the next six weeks are obtained using the best ESM models and the KL distances are calculated at three specific future weeks. The interesting forecasting lead points and forecast values are marked with squares at Figure 10. Figure 11 shows the cluster change over time. We can see that some stores which are grouped in the same cluster in the first week (2Jan2011) may be separated into different clusters in later weeks (30Jan2011). Thus the retail company may need to adjust sales or promotion policy to each store and may set different business goals from week to week. 6. CONCLUSIONS AND FUTURE WORK In this paper we used the KL distance for clustering the forecast distributions of time series. The KL distance requires density functions, and the clustering of a larger number of time series with full density estimations of the forecasts takes a lot of computing resources. We approximated the forecast density using normal density with the forecast means and variances, which are directly provided by the forecast model. Thus, the KL distances among forecasts can be easily obtained. We demonstrated the advantage of using the KL distance over the Euclidean distance with simulations from autoregressive models. We also showed that the KL distance improved the clustering results in two real life data 7. REFERENCES [1] A. Alonso, J. Berrendero, A. Hernandez, and A. Justel. Time series clustering based on forecast densities. Computational Statistics & Data Analysis, 51(2): , [2] G. Box, G. Jenkins, and G. Reinsel. Time Series Analysis: Forecasting and Control. Prentice Hall, 3rd edition, [3] P. S. P. Cowpertwait and T. F. Cox. Clustering population means under heterogeneity of variance with an application to a rainfall time series problem. Journal of the Royal Statistical Society. Series D (The Statistician), 41(1): , [4] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. Proc. VLDB Endow., 1: , August [5] A. W.-C. Fu, E. Keogh, L. Y. H. Lau, and C. A. Ratanamahatana. Scaling and time warping in time series querying. In Proceedings of the 31st international conference on Very large data bases, VLDB 05, pages , [6] A. C. Harvey. Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, [7] H. Jeffreys. An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. A, 186:453Ű 461, [8] E. Keogh. Exact indexing of dynamic time warping. In Proceedings of the 28th international conference on Very Large Data Bases, VLDB 02, pages , [9] S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79 86, [10] M. Kumar and N. R. Patel. Clustering data with measurement errors. Comput. Stat. Data Anal., 51: , August [11] T. W. Liao. Clustering of time series data - a survey. Pattern Recognition, 38: , [12] L. R. L. L. V. Macchiato, M. F. and M. Ragosta. Time modelling and spatial clustering of daily ambient temperature: An application in southern italy. Environmetrics, 6(1):31Ű 53, [13] P. C. Mahalanobis. On the generalized distance in statistics. Proceedings of the National Institute of Science Calcutta, 2(1):49 55, [14] R. H. S. Yoshihide Kakizawa and M. Taniguchi. Discrimination and clustering for multivariate time series. Journal of the American Statistical Association, 93(441): , 1998.

Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA

Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA Abstract Virtually all businesses collect and use data that are associated with geographic locations, whether

More information

IBM SPSS Forecasting 22

IBM SPSS Forecasting 22 IBM SPSS Forecasting 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release 0, modification

More information

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data Athanasius Zakhary, Neamat El Gayar Faculty of Computers and Information Cairo University, Giza, Egypt

More information

Promotional Forecast Demonstration

Promotional Forecast Demonstration Exhibit 2: Promotional Forecast Demonstration Consider the problem of forecasting for a proposed promotion that will start in December 1997 and continues beyond the forecast horizon. Assume that the promotion

More information

ADVANCED FORECASTING MODELS USING SAS SOFTWARE

ADVANCED FORECASTING MODELS USING SAS SOFTWARE ADVANCED FORECASTING MODELS USING SAS SOFTWARE Girish Kumar Jha IARI, Pusa, New Delhi 110 012 [email protected] 1. Transfer Function Model Univariate ARIMA models are useful for analysis and forecasting

More information

USE OF ARIMA TIME SERIES AND REGRESSORS TO FORECAST THE SALE OF ELECTRICITY

USE OF ARIMA TIME SERIES AND REGRESSORS TO FORECAST THE SALE OF ELECTRICITY Paper PO10 USE OF ARIMA TIME SERIES AND REGRESSORS TO FORECAST THE SALE OF ELECTRICITY Beatrice Ugiliweneza, University of Louisville, Louisville, KY ABSTRACT Objectives: To forecast the sales made by

More information

Practical Time Series Analysis Using SAS

Practical Time Series Analysis Using SAS Practical Time Series Analysis Using SAS Anders Milhøj Contents Preface... vii Part 1: Time Series as a Subject for Analysis... 1 Chapter 1 Time Series Data... 3 1.1 Time Series Questions... 3 1.2 Types

More information

Using JMP Version 4 for Time Series Analysis Bill Gjertsen, SAS, Cary, NC

Using JMP Version 4 for Time Series Analysis Bill Gjertsen, SAS, Cary, NC Using JMP Version 4 for Time Series Analysis Bill Gjertsen, SAS, Cary, NC Abstract Three examples of time series will be illustrated. One is the classical airline passenger demand data with definite seasonal

More information

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal MGT 267 PROJECT Forecasting the United States Retail Sales of the Pharmacies and Drug Stores Done by: Shunwei Wang & Mohammad Zainal Dec. 2002 The retail sale (Million) ABSTRACT The present study aims

More information

STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL Stock Asian-African Market Trends Journal using of Economics Cluster Analysis and Econometrics, and ARIMA Model Vol. 13, No. 2, 2013: 303-308 303 STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

More information

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Chapter 25 Specifying Forecasting Models

Chapter 25 Specifying Forecasting Models Chapter 25 Specifying Forecasting Models Chapter Table of Contents SERIES DIAGNOSTICS...1281 MODELS TO FIT WINDOW...1283 AUTOMATIC MODEL SELECTION...1285 SMOOTHING MODEL SPECIFICATION WINDOW...1287 ARIMA

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4

4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4 4. Simple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/4 Outline The simple linear model Least squares estimation Forecasting with regression Non-linear functional forms Regression

More information

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,

More information

Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation. Dr. Thomas Jensen Expedia.com Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

More information

Time Series Analysis

Time Series Analysis JUNE 2012 Time Series Analysis CONTENT A time series is a chronological sequence of observations on a particular variable. Usually the observations are taken at regular intervals (days, months, years),

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Recall this chart that showed how most of our course would be organized:

Recall this chart that showed how most of our course would be organized: Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

More information

OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS

OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS CLARKE, Stephen R. Swinburne University of Technology Australia One way of examining forecasting methods via assignments

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

How To Run Statistical Tests in Excel

How To Run Statistical Tests in Excel How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting

More information

Time series Forecasting using Holt-Winters Exponential Smoothing

Time series Forecasting using Holt-Winters Exponential Smoothing Time series Forecasting using Holt-Winters Exponential Smoothing Prajakta S. Kalekar(04329008) Kanwal Rekhi School of Information Technology Under the guidance of Prof. Bernard December 6, 2004 Abstract

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Time Series Analysis

Time Series Analysis Time Series Analysis Forecasting with ARIMA models Andrés M. Alonso Carolina García-Martos Universidad Carlos III de Madrid Universidad Politécnica de Madrid June July, 2012 Alonso and García-Martos (UC3M-UPM)

More information

Financial TIme Series Analysis: Part II

Financial TIme Series Analysis: Part II Department of Mathematics and Statistics, University of Vaasa, Finland January 29 February 13, 2015 Feb 14, 2015 1 Univariate linear stochastic models: further topics Unobserved component model Signal

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Discrete vs. continuous random variables Examples of continuous distributions o Uniform o Exponential o Normal Recall: A random

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

Time Series Analysis and Forecasting Methods for Temporal Mining of Interlinked Documents

Time Series Analysis and Forecasting Methods for Temporal Mining of Interlinked Documents Time Series Analysis and Forecasting Methods for Temporal Mining of Interlinked Documents Prasanna Desikan and Jaideep Srivastava Department of Computer Science University of Minnesota. @cs.umn.edu

More information

Chapter 27 Using Predictor Variables. Chapter Table of Contents

Chapter 27 Using Predictor Variables. Chapter Table of Contents Chapter 27 Using Predictor Variables Chapter Table of Contents LINEAR TREND...1329 TIME TREND CURVES...1330 REGRESSORS...1332 ADJUSTMENTS...1334 DYNAMIC REGRESSOR...1335 INTERVENTIONS...1339 TheInterventionSpecificationWindow...1339

More information

Data Exploration Data Visualization

Data Exploration Data Visualization Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

More information

Data Mining mit der JMSL Numerical Library for Java Applications

Data Mining mit der JMSL Numerical Library for Java Applications Data Mining mit der JMSL Numerical Library for Java Applications Stefan Sineux 8. Java Forum Stuttgart 07.07.2005 Agenda Visual Numerics JMSL TM Numerical Library Neuronale Netze (Hintergrund) Demos Neuronale

More information

Using simulation to calculate the NPV of a project

Using simulation to calculate the NPV of a project Using simulation to calculate the NPV of a project Marius Holtan Onward Inc. 5/31/2002 Monte Carlo simulation is fast becoming the technology of choice for evaluating and analyzing assets, be it pure financial

More information

TOURISM DEMAND FORECASTING USING A NOVEL HIGH-PRECISION FUZZY TIME SERIES MODEL. Ruey-Chyn Tsaur and Ting-Chun Kuo

TOURISM DEMAND FORECASTING USING A NOVEL HIGH-PRECISION FUZZY TIME SERIES MODEL. Ruey-Chyn Tsaur and Ting-Chun Kuo International Journal of Innovative Computing, Information and Control ICIC International c 2014 ISSN 1349-4198 Volume 10, Number 2, April 2014 pp. 695 701 OURISM DEMAND FORECASING USING A NOVEL HIGH-PRECISION

More information

I. Introduction. II. Background. KEY WORDS: Time series forecasting, Structural Models, CPS

I. Introduction. II. Background. KEY WORDS: Time series forecasting, Structural Models, CPS Predicting the National Unemployment Rate that the "Old" CPS Would Have Produced Richard Tiller and Michael Welch, Bureau of Labor Statistics Richard Tiller, Bureau of Labor Statistics, Room 4985, 2 Mass.

More information

2) The three categories of forecasting models are time series, quantitative, and qualitative. 2)

2) The three categories of forecasting models are time series, quantitative, and qualitative. 2) Exam Name TRUE/FALSE. Write 'T' if the statement is true and 'F' if the statement is false. 1) Regression is always a superior forecasting method to exponential smoothing, so regression should be used

More information

How to Get More Value from Your Survey Data

How to Get More Value from Your Survey Data Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2

More information

SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID

SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID Renewable Energy Laboratory Department of Mechanical and Industrial Engineering University of

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Demand Management Where Practice Meets Theory

Demand Management Where Practice Meets Theory Demand Management Where Practice Meets Theory Elliott S. Mandelman 1 Agenda What is Demand Management? Components of Demand Management (Not just statistics) Best Practices Demand Management Performance

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable

More information

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2015, Mr. Ruey S. Tsay. Solutions to Midterm

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2015, Mr. Ruey S. Tsay. Solutions to Midterm Booth School of Business, University of Chicago Business 41202, Spring Quarter 2015, Mr. Ruey S. Tsay Solutions to Midterm Problem A: (30 pts) Answer briefly the following questions. Each question has

More information

The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network

The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network , pp.67-76 http://dx.doi.org/10.14257/ijdta.2016.9.1.06 The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network Lihua Yang and Baolin Li* School of Economics and

More information

Energy Demand Forecasting Industry Practices and Challenges

Energy Demand Forecasting Industry Practices and Challenges Industry Practices and Challenges Mathieu Sinn (IBM Research) 12 June 2014 ACM e-energy Cambridge, UK 2010 2014 IBM IBM Corporation Corporation Outline Overview: Smarter Energy Research at IBM Industry

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

How To Predict Web Site Visits

How To Predict Web Site Visits Web Site Visit Forecasting Using Data Mining Techniques Chandana Napagoda Abstract: Data mining is a technique which is used for identifying relationships between various large amounts of data in many

More information

OUTLIER ANALYSIS. Data Mining 1

OUTLIER ANALYSIS. Data Mining 1 OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance. Chapter 6: Behavioural models Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

More information

Introduction to time series analysis

Introduction to time series analysis Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples

More information

Module 6: Introduction to Time Series Forecasting

Module 6: Introduction to Time Series Forecasting Using Statistical Data to Make Decisions Module 6: Introduction to Time Series Forecasting Titus Awokuse and Tom Ilvento, University of Delaware, College of Agriculture and Natural Resources, Food and

More information

UNDERGRADUATE DEGREE DETAILS : BACHELOR OF SCIENCE WITH

UNDERGRADUATE DEGREE DETAILS : BACHELOR OF SCIENCE WITH QATAR UNIVERSITY COLLEGE OF ARTS & SCIENCES Department of Mathematics, Statistics, & Physics UNDERGRADUATE DEGREE DETAILS : Program Requirements and Descriptions BACHELOR OF SCIENCE WITH A MAJOR IN STATISTICS

More information

Forecasting the first step in planning. Estimating the future demand for products and services and the necessary resources to produce these outputs

Forecasting the first step in planning. Estimating the future demand for products and services and the necessary resources to produce these outputs PRODUCTION PLANNING AND CONTROL CHAPTER 2: FORECASTING Forecasting the first step in planning. Estimating the future demand for products and services and the necessary resources to produce these outputs

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Probabilistic Forecasting of Medium-Term Electricity Demand: A Comparison of Time Series Models

Probabilistic Forecasting of Medium-Term Electricity Demand: A Comparison of Time Series Models Fakultät IV Department Mathematik Probabilistic of Medium-Term Electricity Demand: A Comparison of Time Series Kevin Berk and Alfred Müller SPA 2015, Oxford July 2015 Load forecasting Probabilistic forecasting

More information

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009 Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

More information

Improved Trend Following Trading Model by Recalling Past Strategies in Derivatives Market

Improved Trend Following Trading Model by Recalling Past Strategies in Derivatives Market Improved Trend Following Trading Model by Recalling Past Strategies in Derivatives Market Simon Fong, Jackie Tai Department of Computer and Information Science University of Macau Macau SAR [email protected],

More information

AP Physics 1 and 2 Lab Investigations

AP Physics 1 and 2 Lab Investigations AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks

More information

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

Models for Product Demand Forecasting with the Use of Judgmental Adjustments to Statistical Forecasts

Models for Product Demand Forecasting with the Use of Judgmental Adjustments to Statistical Forecasts Page 1 of 20 ISF 2008 Models for Product Demand Forecasting with the Use of Judgmental Adjustments to Statistical Forecasts Andrey Davydenko, Professor Robert Fildes [email protected] Lancaster

More information

Ch.3 Demand Forecasting.

Ch.3 Demand Forecasting. Part 3 : Acquisition & Production Support. Ch.3 Demand Forecasting. Edited by Dr. Seung Hyun Lee (Ph.D., CPL) IEMS Research Center, E-mail : [email protected] Demand Forecasting. Definition. An estimate

More information

Aspen Collaborative Demand Manager

Aspen Collaborative Demand Manager A world-class enterprise solution for forecasting market demand Aspen Collaborative Demand Manager combines historical and real-time data to generate the most accurate forecasts and manage these forecasts

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

TIME SERIES ANALYSIS

TIME SERIES ANALYSIS TIME SERIES ANALYSIS L.M. BHAR AND V.K.SHARMA Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-0 02 [email protected]. Introduction Time series (TS) data refers to observations

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

SINGULAR SPECTRUM ANALYSIS HYBRID FORECASTING METHODS WITH APPLICATION TO AIR TRANSPORT DEMAND

SINGULAR SPECTRUM ANALYSIS HYBRID FORECASTING METHODS WITH APPLICATION TO AIR TRANSPORT DEMAND SINGULAR SPECTRUM ANALYSIS HYBRID FORECASTING METHODS WITH APPLICATION TO AIR TRANSPORT DEMAND K. Adjenughwure, Delft University of Technology, Transport Institute, Ph.D. candidate V. Balopoulos, Democritus

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Forecasting in STATA: Tools and Tricks

Forecasting in STATA: Tools and Tricks Forecasting in STATA: Tools and Tricks Introduction This manual is intended to be a reference guide for time series forecasting in STATA. It will be updated periodically during the semester, and will be

More information

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows: Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

More information

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images Małgorzata Charytanowicz, Jerzy Niewczas, Piotr A. Kowalski, Piotr Kulczycki, Szymon Łukasik, and Sławomir Żak Abstract Methods

More information

IDENTIFICATION OF DEMAND FORECASTING MODEL CONSIDERING KEY FACTORS IN THE CONTEXT OF HEALTHCARE PRODUCTS

IDENTIFICATION OF DEMAND FORECASTING MODEL CONSIDERING KEY FACTORS IN THE CONTEXT OF HEALTHCARE PRODUCTS IDENTIFICATION OF DEMAND FORECASTING MODEL CONSIDERING KEY FACTORS IN THE CONTEXT OF HEALTHCARE PRODUCTS Sushanta Sengupta 1, Ruma Datta 2 1 Tata Consultancy Services Limited, Kolkata 2 Netaji Subhash

More information

Applications of improved grey prediction model for power demand forecasting

Applications of improved grey prediction model for power demand forecasting Energy Conversion and Management 44 (2003) 2241 2249 www.elsevier.com/locate/enconman Applications of improved grey prediction model for power demand forecasting Che-Chiang Hsu a, *, Chia-Yon Chen b a

More information

The CUSUM algorithm a small review. Pierre Granjon

The CUSUM algorithm a small review. Pierre Granjon The CUSUM algorithm a small review Pierre Granjon June, 1 Contents 1 The CUSUM algorithm 1.1 Algorithm............................... 1.1.1 The problem......................... 1.1. The different steps......................

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Univariate and Multivariate Methods PEARSON. Addison Wesley

Univariate and Multivariate Methods PEARSON. Addison Wesley Time Series Analysis Univariate and Multivariate Methods SECOND EDITION William W. S. Wei Department of Statistics The Fox School of Business and Management Temple University PEARSON Addison Wesley Boston

More information

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS 1. Bandwidth: The bandwidth of a communication link, or in general any system, was loosely defined as the width of

More information

Overview of Factor Analysis

Overview of Factor Analysis Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,

More information

Outline: Demand Forecasting

Outline: Demand Forecasting Outline: Demand Forecasting Given the limited background from the surveys and that Chapter 7 in the book is complex, we will cover less material. The role of forecasting in the chain Characteristics of

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Mining Airline Data for CRM Strategies

Mining Airline Data for CRM Strategies Proceedings of the 7th WSEAS International Conference on Simulation, Modelling and Optimization, Beijing, China, September 15-17, 2007 345 Mining Airline Data for CRM Strategies LENA MAALOUF, NASHAT MANSOUR

More information

On Entropy in Network Traffic Anomaly Detection

On Entropy in Network Traffic Anomaly Detection On Entropy in Network Traffic Anomaly Detection Jayro Santiago-Paz, Deni Torres-Roman. Cinvestav, Campus Guadalajara, Mexico November 2015 Jayro Santiago-Paz, Deni Torres-Roman. 1/19 On Entropy in Network

More information

Time Series Analysis

Time Series Analysis Time Series Analysis Identifying possible ARIMA models Andrés M. Alonso Carolina García-Martos Universidad Carlos III de Madrid Universidad Politécnica de Madrid June July, 2012 Alonso and García-Martos

More information

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1. Page 1 of 11. EduPristine CMA - Part I

Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1. Page 1 of 11. EduPristine CMA - Part I Index Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1 EduPristine CMA - Part I Page 1 of 11 Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting

More information

Luciano Rispoli Department of Economics, Mathematics and Statistics Birkbeck College (University of London)

Luciano Rispoli Department of Economics, Mathematics and Statistics Birkbeck College (University of London) Luciano Rispoli Department of Economics, Mathematics and Statistics Birkbeck College (University of London) 1 Forecasting: definition Forecasting is the process of making statements about events whose

More information

Sections 2.11 and 5.8

Sections 2.11 and 5.8 Sections 211 and 58 Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1/25 Gesell data Let X be the age in in months a child speaks his/her first word and

More information