Modeling of Sales Forecasting in Retail Using Soft Computing Techniques

Transcription

1 Modeling of Sales Forecasting in Retail Using Soft Computing Techniques Luís Francisco Oliveira Martelo Lobo da Costa Dissertação para obtenção do Grau de Mestre em Engenharia Mecânica Júri Presidente: Orientador: Co-Orientador: Vogais: Prof. José Rogério Caldas Pinto Prof. João Miguel da Costa Sousa Dr. Susana Margarida da Silva Vieira Eng. Hugo Miguel Lampreia Alexandre Prof. Luís Manuel Fernandes Mendonça Novembro

2

3 And the high destiny of the individual is to serve rather than to rule, or to impose himself in any other way. Albert Einstein The one thing that matters is the effort. Antoine de Saint-Exupery

4 Este trabalho reflecte as ideias dos seus autores que, eventualmente, poderão não coincidir com as do Instituto Superior Técnico.

5 Abstract The objective of this work is to address the problem of aggregate daily sales forecasting in retail. Intelligent modeling techniques were applied to this problem. Fuzzy classification and NARX models, feedforward classification and NARX neural networks and adaptive neuro-fuzzy systems were tested over three different forecasting periods in order to test the applicability of each model. A methodology on how to select the forecasting periods is presented. The different forecasting horizons consist of a stationary period, where there are no major variations from a regular weekly pattern; a stationary period with disturbances, where there are some events that have an impact on the weekly pattern; and a non-stationary period, which is constituted by several different events that have major impacts on the sales behavior. It is also presented a methodology on how to construct the models features. These features account the effects that major variables have on sales forecasting. These are the weekly and monthly seasonality, the macroeconomic environment translated into the purchasing power, the major promotions and holidays. Further, each model s parameters are developed. The models that presented accurate training performances are finally tested over the forecasting periods, allowing the obtention of accurate forecasts for the three periods, in particular considering stationary periods and well defined events. Keywords: Sales forecasting, Modeling, Soft computing techniques, Retail v

6 vi

7 Resumo O objectivo deste trabalho aborda o problema da previsão de vendas agregadas no retalho. Na sua abordagem, foram utilizadas técnicas de modelação inteligente. Classificadores fuzzy, modelos NARX fuzzy, redes neuronais de classificação e redes NARX, assim como sistemas neuro-fuzzy foram testados para três períodos de predição diferentes, de forma a testar a aplicabilidade destes modelos. A metodologia de selecção dos períodos de predição é apresentada. Os diferentes horizontes temporais de predição consistem num período estacionário, onde as vendas apresentam uma tendência semanal; um período estacionário com perturbações, onde alguns efeitos perturbam o padrão semanal usual; e um período não-estacionário, constituído maioritariamente por eventos que têm um impacto significativo no comportamento da curva de vendas. É também apresentada a metodologia usada na construção dos atributos constituintes dos modelos. Estes atributos pretendem considerar os efeitos das principais variáveis de influência nas vendas. Nestes eventos englobam-se a sazonalidade semanal, sazonalidade mensal, a tradução do efeito macro-económico no poder de compra, as principais promoções, e feriados ou principais dias festivos. Após as anteriores definições, apresenta-se o desenvolvimento de cada modelo e dos parâmetros constituintes. Os modelos que melhor desempenho apresentaram no treino são usados para teste (ou seja, previsão), para cada um dos períodos anteriormente definidos. Para cada período, obtiveram-se resultados exactos, em particular para períodos estacionários e eventos bem definidos. Palavras-chave: Previsão de vendas, Modelação, Técnicas de Soft computing, Retalho vii

8 viii

9 Acknowledgements I would like to acknowledge my supervisors Professor João Sousa and Doctor Susana Vieira for their guidance through this work. Their technical knowledge, good sense and optimism were very important to drive this work to a good end. I would like to thank SONAE SR for the opportunity of developing this interesting project. It was a period of intense learning, and the experience was a rich complement to my education. I was able to develop skills that for sure will be useful in the future. I would also like to thank Eng. Hugo Alexandre and his team, for receiving me, providing me all the information needed, and giving me great insights and ideas on how to develop this work. A special thanks to João Semeano, who shared this experience in the enterprise, and who was always available to discuss my doubts and to contribute with new ideas. A special acknowledgement to my family, for being always present with words of support and encouragement. To my parents, who always helped me to keep track and focused on the essential, and to my brother, a great company and with whom I spend incredible moments. Finally, I would like to thank all my colleagues and friends that were present in these last years. Thankfully, you are so many that is impossible to thank each one of you for your amazing support here. You know who you are. ix

10 x

11 Contents Acknowledgements ix Contents xi List of Figures xv List of Tables xvii Notation xxi 1 Introduction Sales forecasting Intelligent computing techniques for sales forecasting Main contributions Outline Modeling Data preprocessing Fuzzy Models Rule-based fuzzy models Linguistic models Takagi-Sugeno models Clustering Building models Neural Networks Models of a neuron Architecture Network learning Adaptive neuro-fuzzy inference systems (ANFIS) Architecture Hybrid learning xi

12 3 Preprocessing of sales data Period selection Stationary period Stationary period with disturbances Non-stationary period Feature construction Weekly seasonality Monthly seasonality Purchasing power of costumers Promotions Holidays or festive days Overall overview of prediction periods and inputs Intelligent modeling for sales forecasting Stationary period Fuzzy modeling Classification model NARX model Neural modeling Feedforward classification network NARX network ANFIS modeling Stationary period with disturbances Fuzzy modeling Classification model NARX model Neural modeling Feedforward classification network NARX network ANFIS Non-stationary period Fuzzy modeling Classification model NARX model Neural modeling Feedforward classification network NARX network ANFIS Summary xii

13 5 Results and discussion Stationary period stationary period with disturbances Non-stationary period Conclusions Future improvements Bibliography 81 Appendix 89 A Extended results 91 A.1 Modeling data preprocessing A.2 Intelligent modeling for sales forecasting A.2.1 Stationary period with disturbances A Fuzzy classification model A Fuzzy NARX model A Feedforward classification network A NARX network A.2.2 Non-stationary period A Fuzzy classification model A Fuzzy NARX model A Feedforward classification network A NARX network xiii

14 xiv

15 List of Figures 2.1 Model of a neuron Feedforward classification neural network NARX neural network Model of a neuron Weekly sales from week aa of year A to the week ee of year C per year Real sales against sales without seasonality and major promotions effects Yearly sales and yearly sales without seasonality and major promotions effect Daily sales in weeks aa to bb for all years together Daily sales in weeks aa to cc for all years together Daily sales in weeks 3 to 22 for all years together Weekly seasonality Monthly seasonality with weekly data Monthly seasonality Vector of daily sales for weeks aa to bb from year A to C Promotions in the data Promotions input Type 1 holiday (h 1 ) from year A to C h 2 holiday and week before for each year Week of the type 3 holiday (h 3 ) Holidays input All inputs Performance per number of clusters for the classification fuzzy model - stationary period Performance per number of clusters for the NARX fuzzy model - stationary period Performance per number of HL for the feedforward neural network - stationary period Performance per number of neurons in 1 HL for the feedforward neural network - stationary period Performance per number of HL for the NARX network - stationary period Performance per number of neurons in 1 HL for the NARX neural network - stationary period xv

16 4.7 Performance per number of delays in a 7 single neuron HL NARX network - stationary period Performance per number of membership functions for the ANFIS model - stationary period Sales forecasting with the fuzzy classification model - Stationary period Comparison of all the results - Stationary period Sales forecasting with the fuzzy classification model - stationary period with disturbances Comparison of all the results - stationary period with disturbances Sales forecasting with the fuzzy classification and NARX model - Non-stationary period Sales forecasting with the fuzzy classification model - Non-stationary period A.1 Yearly sales and yearly sales without seasonality and major promotions effect - zoom in weeks aa to bb A.2 Performance per number of clusters for the classification fuzzy model A.3 Performance per number of clusters for the narx fuzzy model A.4 Performance per number of HL for the feedforward NN A.5 Performance per number of neurons per HL for the feedforward NN A.6 Performance per number of HL for the NARX NNl A.7 Performance per number of neurons in each HL for the NARX NNl A.8 Performance per number of delays in the NARX NNl A.9 Performance per number of clusters for the classification fuzzy model A.10 Performance per number of clusters in the fuzzy NARX A.11 Performance per number of HL in the feedforward NN A.12 Performance per number of neurons in the feedforward NN A.13 Performance per number of HL in the feedforward NN A.14 Performance per number of neurons in each HL for the NARX NN A.15 Performance per number of delays for the NARX NN xvi

17 List of Tables 3.1 Weekly seasonality Monthly seasonality Purchasing power Promotion input Type 1 effect Type 2 effect Type 3 holidays input Prediction periods Performance per number of clusters for the classification model - stationary period Performance per number of clusters for the NARX fuzzy model - stationary period Performance per number of HL for the feedforward network - stationary period Performance per number of neurons in 1 HL for the feedforward network - stationary period Performance per number of HL for the NARX network - stationary period Performance per number of neurons in each HL for the NARX network - stationary period Performance per number of delays in a 7 single neuron HL NARX network - stationary period Performance per number of membership functions for the ANFIS model - stationary period Performance per number of clusters for the classification model - stationary period with disturbances Performance per number of clusters for the NARX fuzzy model - stationary period with disturbances Performance per number of HL for the feedforward network - stationary period with disturbances Performance per number of neurons in each HL for the feedforward network - stationary period with disturbances Performance per number of HL for the NARX network - stationary period with disturbances Performance per number of neurons in each HL for the NARX network - stationary period with disturbances Performance per number of delays for the NARX network - stationary period with disturbances xvii

18 4.16 Performance per number of membership functions for the ANFIS model - stationary period with disturbances Performance per number of clusters for the classification model - non-stationary period Performance per number of clusters for the NARX model - non-stationary period Performance per number of HL for the feedforward network - non-stationary period Performance per number of neurons in each for the feedforward network - non-stationary period Performance per number of HL for the NARX network - non-stationary period Performance per number of neurons in each HL for the NARX network - non-stationary period Performance per number of delays for the NARX network - non-stationary period Performance per number of membership functions for the ANFIS model - stationary period with disturbances Performance of each model in the stationary period Performance of each model in the stationary period with disturbances Performance of each model in the non-stationary period Comparison of the best models in testing - stationary period Application of the fuzzy classification model to the different business units - stationary period Comparison of the best models in testing - stationary period with disturbances Application of the fuzzy classification model to the different business units - stationary period with disturbances Comparison of the best models in testing - non-stationary period New classification best model performance - non-stationary period Application of the fuzzy classification model to the different business units - non-stationary period A.1 Performance per number of clusters for the classification model - stationary period with disturbances A.2 Performance per number of clusters for the NARX fuzzy model - stationary period with disturbances A.3 Performance per number of HL for the feedforward network - stationary period with disturbances A.4 Performance per number of neurons in each HL for the feedforward network - stationary period with disturbances A.5 Performance per number of HL for the NARX network - stationary period with disturbances 95 A.6 Performance per number of neurons in each HL for the NARX network - stationary period with disturbances xviii

19 A.7 Performance per number of delays for the NARX network - stationary period with disturbances A.8 Performance per number of clusters for the classification model - non-stationary period.. 97 A.9 Performance per number of clusters for the NARX model - non-stationary period A.10 Performance per number of HL for the feedforward network - non-stationary period A.11 Performance per number of neurons in each for the feedforward network - non-stationary period A.12 Performance per number of HL for the NARX network - non-stationary period A.13 Performance per number of neurons in each HL for the NARX network - non-stationary period A.14 Performance per number of delays for the NARX network - non-stationary period xix

20 xx

21 Notation Symbols A i Fuzzy set of the antecedent B i Fuzzy set of the consequent R i Fuzzy rule K Number of rules x Data sample X Database ŷ Model output y System output µ A (x), µ B (y) Menmbership functions µ R (x, y) Fuzzy relation f i Mapping function of the i th rule β i Degree of fulfullment of rule i v i Cluster center J Objective function U Fuzzy partition matrix V Matrix of cluster prototypes DijA 2 Inner-product distance norm A Non-inducing matrix that determines cluster shape λ Lagrange multipliers π i Parameter vector for the i t h rule θ Matrix of model parameters xxi

22 Ψ Matrix of vector inputs Φ Vector containing the regressor Υ Vector containing the regressand Γ Matrix containing the membership degree x kj Neuron input w kj Weight factor z Weighted sum of the neuron inputs b j Constant input, bias or threshold [x j, 1] T Expanded input vector [w T j, b j] Weights vector ϕ j (z) Neuron output ϕ j Activation function f(x) Output of the neural network t Time d x Delay on the inputs d y Delay on the outputs w i Firing strength p i, q i and r i ANFIS node parameters O ANFIS node output F Function of the fuzzy inference system H Identity function s Sales ws Weekly seasonality ms Monthly seasonality pw Purchasing power p Promotions h Holidays xxii

23 Acronyms ANFIS Adaptive neuro-fuzzy inference system ARIMA Auto-regressive moving average BET Bucharest Stock Exchange Trading Index COG Centre of gravity FCM Fuzzy C-means JIT Just in time HL Hidden layer LSE Least-squares estimator MAPE Mean absolute percentage error MAR Missing at random MCAR Missing completely at random MISO Multi-input single-output MLP Multi-layer perceptron MOM Mean of maxima MSE Mean squared error N Neuron NARX Non-linear auto-regressive with exogenous input NN Artificial neural networks P Parallel mode PCB Printed circuit board POS Point of sales RMSE Root mean squared error SP Series-parallel mode TDNN Time delay neural network TS Takagi-Sugeno VAF Variance accounted for VBR Variable bit rate xxiii

24 xxiv

25 Chapter 1 Introduction Due to the strong and growing competition existing nowadays, the majority of retailers are in a continuous effort for increasing profits and reducing costs [19]. In addition, the variations in consumers demand, which are caused by many factors like price, promotion, changing consumers preference, seasonality, or weather changes, contribute to a fluctuating market behavior [3]. In that sense, an accurate sales forecasting system is an efficient way to achieve higher profits and lower costs, by improving customers satisfaction, reducing product destruction, increasing sales revenue and desiging production plans efficiently [19]. Sales forecasting is the starting point for planning various phases of a firms operations [13], and a crucial task in supply chain management under dynamic market demand which, ultimately, affects retailers and other channel members in various ways [99]. Industry forecasts are especially useful to big retailers who may have a greater market share [3]. Due to ever-increasing global competition and an environment characterized by very short product life cycles and high market volatility [98], this subject plays an even more prominent role in supply chain management when the profitability and the long-term viability of a firm relies on effective and efficient sales forecasts [96]. Theoretically speaking, the way to improve the quality of forecasts is still an outstanding subject of attention [39]. For data containing trend and seasonal patterns, failure to account for these patterns may result in poor forecasts. Over the last few decades several methods such as Winters exponential smoothing, Box-Jenkins, autoregressive integrated moving average (ARIMA) model and multiple regression have been proposed and widely used to account for these patterns. More recently, artificial neural networks (NNs) have emerged as a technology with a great promise for identifying and modeling data patterns that are not easily described by traditional statistical methods in diverse areas as cognitive science, computer science, electrical engineering and finance [3]. Forecasting has earned global acceptance as a decisive part of business planning and decision-making in a diversity of areas such as sales planning, marketing research, pricing, production planning and scheduling [59],[75]. This chapter begins with the definition of sales forecasting, and its main variables. Afterwards is introduced the role that systems engineering and, more specifically, soft computing modeling techniques 1

26 have been having in sales forecasting. Finally, the main contributions of this work and its outline are presented. 1.1 Sales forecasting Sales forecasting refers to the prediction of future sales based on past historical data. Owing to competition and globalization, sales forecasting plays an even more important role as part of the commercial enterprise [98]. Accurate sales forecasting is crucial for profitable retail operations because without a good forecast, either too-much or too-little stocks would result, directly affecting the revenue and competitive position [1]. In that sense, forecasting models contribute to accurate sales estimates for a retailer, may allow its supply chain to effectively control the inventory to achieve just-in-time (JIT); scheduling and arranging the facility utilization, which decreases costs to the supplier, and, therefore, to the retailer [57]. The processes of science and decision making share an important characteristic: success in each depends upon the researcher or decision maker having some ability to anticipate the consequences of their actions [72]. Without sales forecasts, operations can only respond retroactively, leading to poor production planning, lost orders, inadequate customer service, and poorly utilized resources [35],[96]. Recent research has shown that effective sales forecasting enables improvements in supply chain performance [9],[96],[102]. Better forecasts of aggregate retail sales can improve the forecasts of individual retailers because changes in their sales levels are often systematic. For example, around Christmas time, sales of most retailers increase [3]. The growing importance of the forecasting function within companies is reflected in an increased level of commitment in terms of money, hiring of operational researchers and statisticians, and purchasing computer software [95]. In [75],[95], some factors which contributed to the importance of forecasting within organizations are pointed: The increasing complexity of organizations (e.g. number of submarkets served and products offered) and their environments (e.g. changes in technology and demand structures) has made it more difficult for decision makers to take all the factors relating to the future development of the organization into account; Organizations have moved towards more systematic decision making that involves explicit justifications for individual actions, and formalized forecasting is one way in which actions can be supported; The further development of forecasting methods and their practical application has allowed not only forecasting experts but also managers (decision makers) to understand and use these techniques, becoming useful tools within the hole organization. In process industries ranging from oil and gas industry [58], to the high-risk agrochemical and pharmaceutical industries [59],[64], customer demand has been clearly identified by recent studies as the top 2

27 business management driver. Given the importance of customer demand, it is easy to realize the potential surplus of an effective tool for customer demand forecasting in process industries [59]. Since managers in retails are waiting for a suitable tool to support their making the purchasing decisions, they usually rely on their own experience or consult the point of sales system (POS system) to predict the future sales and place purchasing orders. Few decision-makers adopt statistical methods, such as the moving average method or exponential smoothing, to deal with the daily problems commonly. In fact, most conventional sales forecasting methods use either factors or time-series data to determine the sales prediction. The relationship between the past time-series data and the sales prediction is always too complex to acquire an advantageous ordering suggestion by using the unsuited statistical approaches. Practically, the POS system actually provides some forecasting suggestions for the managers to place orders. However, most decision-makers still prefer to place the same quantity as usual or depend on their own intuition instead of model-based approaches. [19] In the present, the variations in consumers demand are caused by many factors like price, promotion, changing consumer preference, seasonality, or weather changes [3]. Regarding conventional sales forecasting methods, most of them used either factors or time series data to determine the forecast. However, the relationship between the factors or the past time series data independent variables and the sales data dependent variable is always quite complicated [57]. Due to nature of the relationships among independent and dependent variables, various computational intelligence methods have emerged over the last two decades as alternatives for building effective predictors. Some of these models are based on neural networks [101], fuzzy models, and hybrid approaches [68], adopting iterative adjustments of a unique model during a sequence of offline parameter updates, and considering all of the data available for the task at each iteration [62]. 1.2 Intelligent computing techniques for sales forecasting From an historical perspective, exponential smoothing methods and decomposition methods were the first forecasting approaches to be developed back in the mid-1950s. During the 1960s, as computer power became more available and cheaper, more sophisticated forecasting methods appeared [59]. Box Jenckins [14] methodology gave rise to the ARIMA models [27], [37],[59], [83]. Later on, during the 1970s and 1980s, even more sophisticated forecasting approaches were developed including econometric methods and Bayesian methods [75]. Time series forecasting models have been widely applied in sales forecasting, such as exponential smoothing models [36],[51], ARIMA models [37], expert systems [61], fuzzy systems [16], [21], and NN models [23],[81],[96] [100]. Soft, or intelligent, computing algorithms, which combined fuzzy theory with neural network has found a variety of applications in various fields, ranging from industrial environment control system, process parameters, semi-conductor machine capacity forecasting, business environment forecasting, financial analysis, stock index fluctuation forecasting, consumer loan, medical diagnosis and electricity demand forecasting [22]. 3

28 One of the major limitations of the traditional methods compared to soft computing is that they are essentially linear methods. In order to use them, users must specify the model form without the necessary genuine knowledge about the complex relationship in the data [23]. Forecasting with fuzzy systems is, generally, accomplished with a combination of other soft computing techniques such as neural networks or genetic algorithms; or a hybrid structure such as ANFIS. Kuo, [54], [55], [56], proposed a fuzzy neural networks with different approaches to learn fuzzy if-then rules for promotion, which were integrated in a neural network for forecasting using time series data. The model performed more accurately than statistical methods and single adaptive neural networks. In [17] a evolving fuzzy neural network was developed for PCB sales forecasting and states that the model can be applied practically as a sales forecasting tool in the PCB industry. ANFIS is used in [5] for estimation of Natural Gas demand. The model provides a better performance than adaptive neural networks and time series performance. ANFIS was used to forecast tourist arrivals in Taiwan in [20], and outperformed a fuzzy time series model, grey forecasting model and Markov residual model. [41] uses an approach based on genetic fuzzy systems and artificial neural networks for constructing a stock price forecasting expert system which outperformed previous methods. ANFIS was also used by [89] to forecast demand for thin-film transistor liquid display manufacturer and the model performed better than other four. One nonlinear model that receives extensive attention in forecasting is the NN model [101]. Inspired by the architecture of the human brain as well as the way it processes information, NNs are able to learn from the data and experience, identify the pattern or trend, and make generalization to the future [23]. The idea of using NNs for forecasting is not new. The first application dates back to Hu, [44], in his thesis, uses the Widrows adaptive linear network to weather forecasting. Werbos, [93], [94], first formulates the backpropagation and finds that NNs trained with backpropagation outperform the traditional statistical methods such as regression and Box-Jenkins approaches [101]. Research efforts on NNs for forecasting are considerable. The literature is vast and growing. Marquez et al., [65], and Hill et al.,[43], review the literature comparing NNs with statistical models in time series forecasting and regression-based forecasting. However, their review focuses on the relative performance of NNs and includes only a few papers[101]. In [4], the use of back-propagation (BP) NN model was presented to analyze the behavior of sales in a medium size enterprise and reported that the BP model generated better forecasts than ARIMA models with interventions did. A NN-based forecasting system to predict the weekly product demand in a German supermarket company was also developed [85]. A comparison between traditional methods with artificial neural networks forecasting aggregate sales was made and concluded that the NN model was able to effectively capture the dynamic nonlinear trend and seasonal patterns, as well as the interactions between them [3]. In [23] another comparison of the performance of NN models and various linear models for forecasting aggregate retail sales and reported that the overall best model is the NN model built on deseasonalized time series data. A proposed evolving NN forecasting model by integrating genetic algorithms and BP NN was concluded to generate more accurate forecasts than traditional statistical models and BP networks in [18]. The application of NN to time-series problems is extremely wide. The first application was to the 4

29 Mackey-Glass time series by Lapedes and Farber, where the feedforward neural networks that can accurately mimic and predict such non linear systems was designed. Also related with chaotic time-series, [30] propose a hierarchically trained NN model in which a dramatic improvement in accuracy is achieved for prediction of two chaotic systems. A series often used as a yardstick to evaluate and compare new forecasting methods is the sunspot series, which has long served as a benchmark and has been well studied in statistical literature, since the data are believed to be nonlinear, non-stationary and non-gaussian. While there are authors that focus on how to use NNs to improve accuracy in predicting sunspot activities over traditional methods, [60], [29], there are others using the data to illustrate a method [25],[90], [91],[92]. There is an extensive literature in financial applications of NNs [6],[86]. NNs have been used for foreign exchange rate [67], [73], [97], stock prices [10],[77], forecasting bankruptcy and business failure [24], [82]. Another major application of neural network forecasting is in electric load consumption study. Load forecasting is an area which requires high accuracy since the supply of electricity is highly dependent on load demand forecasting. A report by [70], states that simple NNs with inputs of temperature information alone perform much better than the currently used regression-based technique in forecasting hourly, peak and total load consumption. In [8], a discussion on the reason why NNs are suitable for load forecasting is presented and proposed a system of cascaded subnetworks. A four-layer MLP to predict the hourly load of a power system is used in [80]. Many other forecasting problems have been solved by NNs. Some of them include student grade point averages [38], ozone level [74], hydrologic streamflow data [47], commodity prices [53], helicopter component loads [40], international airline passenger traffic [69], personnel inventory [45], river flow [46], tool life [34], total industrial production [33], trajectory [71], transportation [32], macroeconomic indices [63], water demand [26], and wind pressure profile [87]. There are many different ways to construct and implement neural networks for forecasting. Most studies use the straightforward MLP networks [52], [84], while others employ some variants of MLP. It should be pointed out that recurrent networks also play an important role in forecasting. In [66], a NARX neural network is compared to the standard neural network based predictors, such as TDNN and Elman network, and successfully improves the predictive performance of the chaotic laser time series and the VBR video traffic time series. Diaconescu,[31], verified the performance of NARX models for several types of chaotic or fractal time series (Mackey-Glass, Fractal Weirstrass and BET). Concluded that NARX models capture efficiently the non-linear dynamic behavior of the system but not without problems. It is made a special reference to limitations in learning and also performance variations with architecture. NARX NN have also been applied to financial areas and in [50], NARX models successfully improved exchange rate prediction performance, compared to radial-based neural networks and ARIMA models. Also, in [78], a NARX model was developed to predict 5 currencies and the gold series as well. It was concluded that it was possible to earn profits with trading on different assets and the NARX outperformed the static approach. 5

30 There are not much single fuzzy expert systems applications to sales forecasting. A fuzzy expert system used for time series forecasting of electric load used by, [28], predicts loads fairly accurately. In [2], a fuzzy system is trained to predict electrical power demand on an hourly basis and is concluded that there is still space for improvement. 1.3 Main contributions In this work, the sales for a retailer in Portugal are forecasted. The approach to the forecasting problem consisted in defining three different forecasting periods and their most influencing features. By comparing different modeling soft computing techniques for each period it was selected the best one.the main contributions of the proposed work are: A comparison between five different types of intelligent models that forecast three different periods. The separation of the forecasting horizons may allow different models to perform better in different periods. The models are: Fuzzy classification models; Feedforward classification neural networks; NARX models of both fuzzy and neural type; ANFIS models. A slightly different approach is proposed by using classification fuzzy models. From literature review, there are only few applications of classification fuzzy systems to forecasting problems, which creates an opportunity to approach the problem differently and which may yield interesting results. The use of two different types of models with the same mapping functions, such as the fuzzy models and feedforward network, and the NARX neural network and NARX fuzzy model; allows a comparison between fuzzy systems and a neural networks of the same type for the forecasting problems. An extensive discussion on how to select and shape attributes for forecasting problems. This is a major part of this work, because the way the features are constructed have a great impact on the potential performance of the models. The understanding of the problem is extremely important to separate important from weak features and to approach the important s construction. Finally, for the retailer, this is a pioneer procedure, as sales forecasting has never been done. The company s closest approach to sales forecasting is the estimation of the effect of a certain type of promotion. In that sense, this work may provide relevant conclusions, or point interesting directions in order to improve business. 6

31 1.4 Outline In Chapter 2, an overview of the modeling techniques is presented. This chapter describes fuzzy systems, neural networks and ANFIS models as well. It presents the standard architectures such as fuzzy inference systems and feedforward networks, and also the corresponding NARX architectures. Chapter 3 will introduce all the preprocessing phase. The selection of the different forecasting periods is developed, as well as the definition of all the features considered in this approach. In Chapter 4, intelligent modeling is applied to the problem given. The different types of models (standard fuzzy classification model, NARX fuzzy model, feedforward neural network, NARX neural network and ANFIS model), are built for each forecasting horizon, and a comparison between each model training performance is made. The following Chapter, 5, the forecasted sales performance for the best models is presented. Then, for the best model in each period, the modeling approach is made for the different business units in the company. Those results are also presented. Finally, in Chapter 6, the results are summarized and conclusions are outlined. In addition, future improvements of this work are submitted. 7

32 8

33 Chapter 2 Modeling Some systems can be represented by white-box models, that are based on the knowledge of the nature of the system and approximate its almost linear behavior with linear models, or linear models presenting the system around a working point. Others, although nonlinear, can be described by mathematical laws and can be called white-box models due to their lack of complexity and computational effort, and are highly desirable. However, most of the real world systems are complex and it s virtually impossible to completely understand their underlying mechanisms. A different approach suggests that a process may be approximated by a sufficiently general blackbox structure. In that sense, the problem is reduced to the obtention of a model structure which is able to capture the system s dynamics and nonlinearity. The identification problem consists of estimating the model s parameters. A problem concerning this type of models is related to its interpretation and physical significance. Such models have the handicap of not being able to be used for analyzing the system s behavior in other way besides numerical simulation. In that sense, soft computing techniques and models can be used instead of the white-box and blackbox approach. These are also named gray-box techniques. This type of methodologies try to capture both white-box and black-box advantages, such that the known parts of the system are modeled using physical knowledge, and the unknown using black-box techniques, with suitable approximation properties. Fuzzy modeling is a gray-box methodology that employ techniques motivated by biological systems and human intelligence to model and control dynamic systems. It explores alternative representation schemes, using natural language, rules, and possess formal methods to incorporate extra relevant information. Fuzzy modeling methods are typical examples of techniques that make use of human knowledge and deductive processes. Artificial neural networks, on the other hand, realize learning by imitating the functioning of the biological neural system on a simplified level. They are robust and have good generalization properties. Although fuzzy systems represent their structured knowledge in the form of if-then rules, they lack the adaptability do deal with changing external environments. By incorporating neural network learning concepts in fuzzy inference systems, the result is a neuro-fuzzy modeling approach, which tries to create a synergy between the two models characteristics [49]. 9

34 In this chapter, the modeling techniques applied are described. Due to data analysis and feature selection needed in the application of intelligent modeling to the sales forecasting phase, an introductory section of such techniques is presented. Then, the chapter continues, by presenting fuzzy modeling. Further, neural networks are presented. The chapter ends with ANFIS models. 2.1 Data preprocessing Data preparation is a necessary step towards a successful modeling phase. It s possible to define data with a set of attributes or input variables, more correctly named as features [88]. Features can assume different types, such as binary, or continuous. In a computer, a feature might be brand, model, processor s speed, number of processors, number of cores, memory, and so on. Before we start modeling, we have to prepare our data set appropriately, that is, we are going to modify our dataset so that the modeling techniques are best supported but least biased. This phase can be divided into four others [11]: Data selection, Data cleaning, Feature construction, Data integration. Data selection Data selection is needed when there is a lot of data available, and not all the data is relevant for the problem proposed to solve. Inclusion of all the data to the database, besides not guaranteeing inclusion of information, it can be harmful for the representative attributes present. Besides, the computational performance is also important, and the exclusion of redundant features is preferable. Data selection consists, then, in two main steps: feature selection and dimensional analysis [11]. Despite feature selection is primarily performed to select the optimal subset of available features, it can have other motivations such as feature set reduction or performance improvement. The three main methods for feature selection consist in filter methods, wrapper methods and embedded methods [76]. Using filter methods, features are scored independently and the top features are used by the model. In univariate methods, as in multivariate methods, features are based on general characteristics of data to be evaluated, and are ranked according to some criteria, which might be, for example, correlation or Chi-square to target (if available). No model is involved. On the one side, an advantage is that are fast an scalable methods, independent of the model. On the other side, they ignore interaction with the classifier [76]. Wrapper methods, which can be deterministic or randomized, consist on an iterative approach, where many feature subsets are scored based on classification performance and the best is used. Some advantages are that by using the classifier as a black box, are universal an simple. On the other side, they are easy to 10

35 over-fit and more computationally expensive than filter methods. Some examples are sequential forward selection and sequential backward elimination [76]. Embedded methods, such as decision trees or weighted naive Bayes, do not retrain the model at every step and search feature selection space and model parameters simultaneously. These methods have better computational complexity than wrapper methods, but are classifier dependent [76]. Dimensional reduction is achieved by techniques that generate visual displays that transmit the idea of how many intrinsic dimensions there are in data and how much the variance can be preserved by a projection to a lower dimension. These techniques also construct a linear mapping of features, which can be very efficient but often difficult to interpret [76]. Data cleaning Data cleaning is used to remove noise from data. Noise are the simple errors, inconsistencies, abbreviations, spelling mistakes that distort data. In order to have consistent data, this values should be corrected [11]. It also consists of solving the missing data problem, that might happen. The reason for treating missing data, is, mainly, because some methods can t deal with empty fields. There are several suggestions for treating missing values, such as MCAR, MAR, selecting a single representation for all empty values, or constructing new attributes [11]. Data construction Data construction is the process of transforming the existing features and construction of new ones. The main reasons for the existence of this step have to due with tool implementations, that require transformation in order to work; absence of any background knowledge; and given some knowledge, the goal is to give helpful hints to the modeling techniques to improve expected results [11]. The main objectives with data construction are to achieve operability (scale conversion, problem reformulation, among others); assure impartiality; and maximizing efficiency. To achieve operability, the main criteria is the scale conversion. Scale conversion has to due with modeling techniques, as some techniques assume all features as numerical (e.g. regression or neural networks), other rather work with categorical features or perform more accurately when a discretization is carried beforehand. This means redefinition of boundaries and ranges for numerical values. Some techniques such as equi-witdth provide intervals of the same width. Equi-frequency assures all intervals contain the same number of objects whenever possible, among others [11]. Another reason for transforming input variables is to ensure that all feature has the same a priori influence. For this approach there are several functions to normalize or standardize data: min-max normalization; z-score standardization, robust-scores standardization, and decimal scaling. Min-max normalization and decimal scaling require careful data cleaning, as a single outlier can push the majority of data to a small interval [11]. The performance of many standard learning algorithms degrades if redundant features are supplied. Sometimes, features do contain all necessary information, but is hard for the modeling technique to extract 11

36 the knowledge from data. In that sense, the idea of construction overcomes these limits through offering new attributes that may have different features. Derived features from existing ones may be constructed, such as computation between features to the creation of new ones. Another interesting approach might be changing the baseline in a way that an important variable can stand out clearly. Grouping may also be useful if it is assumed that data form natural clusters. Defining hyperplanes might be of use, for example a binary feature that is true if some condition feature-related and false otherwise. For testing dependencies at various levels, may be useful to group values of a variable to meaningful aggregates, as a priori is not clear the level of granularity of data [11]. Model error can be decomposed into machine learning bias and variance. While variance is caused by sample errors and causes over-fitting, the bias is a systematic error that expresses the lack of fit between the model and data. Sample error can be reduced by acquiring a larger set and, in the sense of reducing bias error, feature construction can help to reduce the limitations of the learning error [11]. Data integration Data integration is the process of gathering data spread over different tables and databases together. It can be vertical, by concatenating to tables holding similar information, or horizontal, where different types of information in spread tables is also integrated in one [11]. Essentially, is a process of concatenating the entries from two databases and matching which entries should append to each other. 2.2 Fuzzy Models From the modeling techniques based on soft computing, fuzzy modeling is one of the most appealing [79]. Fuzzy models provide transparent, gray-box description of the process dynamics that reflects the nature of the process nonlinearity for the low-order nonlinear systems [7], [79]. Rule-based models describe relationships between variables by means of if then rules, which take the following general form, If antecedent proposition then consequent proposition. By relating the qualitative value of one variable to the qualitative value of the other variable, these rules establish relations between system s variables, and their logical structure contributes to a better understanding of the model, closer to the way humans reason about the real world. The fuzzy set theory behind the rules serves as an interface between the qualitative variables and the input and output numerical values [7]. The model s rule-base nature also enables a linguistic description of the system s knowledge, which is captured by the model [79]. Fuzzy rules are constituted by linguistic variables that have linguistic terms associated. For example, a linguistic variable, x, such as speed, can have three (in this case) linguistic terms associated: A 1 slow, A 2 medium, A 3 fast, each one with its membership function over the domain of the variable. 12

37 It is usually required that linguistic terms satisfy some properties. The strongest condition that may be subjected to is the fuzzy partition, which means that for each x, the sum of membership degrees equals one Rule-based fuzzy models A fuzzy system is constituted by the rule base, which contains a selection of fuzzy rules; a database, which defines the membership functions used in the fuzzy rules; and a reasoning mechanism, which performs the inference procedure (usually the fuzzy reasoning) upon the rules and given facts to derive a reasonable output or conclusion. The fuzzy system can take crisp or fuzzy inputs and produces fuzzy outputs. Sometimes is useful to have a crisp output, and in that sense is needed to have a defuzzification method to extract the crisp value that best represents the fuzzy set. By a number of if then rules that describe behavior of the system, the fuzzy inference system implements a non-linear mapping from its input space to output space. There are two major types of fuzzy models [15], Linguistic fuzzy model also known as Mamdani model, where both antecedent and consequent are fuzzy propositions. Takagi-Sugeno (TS) fuzzy model where the consequent is a crisp function of the antecedent variables, rather than a fuzzy proposition Linguistic models These models are constituted by fuzzy antecedents and consequents. The input-output mapping is realized by the fuzzy inference mechanism that, given the knowledge stored and the input value, provides the corresponding output value. These models represent static mappings of systems. As in most engineering applications is common to work with numeric data, a fuzzification and defuzzification block should be added to the model in order to convert the data to a convenient format. A general rule of a linguistic fuzzy model is given by: R i : If x is A i then y is B i, i = 1, 2,...K (2.1) where R i denotes the ith rule and K is the number of rules. The antecedent variable is given by x X R n (n is the number of inputs) and represents the input of the fuzzy system. Analogously, y Y R p (p is the number of outputs) is a consequent variable representing the output of the fuzzy system. A i and B i are fuzzy sets described by membership functions µ Ai (x) : X [0, 1] and µ Bi (y) : Y [0, 1], respectively. The process of deriving an output fuzzy set is called inference. The inference mechaninsm in the linguistic model is based on the compositional rule of inference, where for the above rule can be regarded as a fuzzy relation R : (X Y ) [0, 1], computed by µ R (x, y) = I(µ A (x), µ B (y)), (2.2) 13

38 where I can be a fuzzy implication or a conjunction operator. The generalized modus ponens rule bases the inference mechanism: If x is A then y is B x is A y is B The output fuzzy set is derived by the max t composition B = A R. (2.3) If it is desired to deffuzify the output, as explained previously, several methods can be applied such as the centre-of-gravity (COG) and the mean-of-maxima (MOM). For discrete domains Y, the COG computes the centre of gravity for the fuzzy set B as a weighted sum, Z cog y i (B ) = Nq q=1 Nq q=1 µ B (y q )y i,q µ B (y q ) where N q is the cardinality of the discretized domain Y and y q is the qth discrete point in the quantization of Y. (2.4) Takagi-Sugeno models The main difference from the linguistic models to the TS models is the consequent form, where in the TS model the output is computed as a crisp function rather than a fuzzy set: R i : If x is A i then y i = f i (x), i = 1, 2,..., K, (2.5) where x R n is the multidimensional input (antecedent) variable and y i R p is the also multidimensional output (consequent) variable. R i denotes the ith rule, and K is the number of rules in the rule base. A i is the antecedent fuzzy set of the ith rule, defined as in the linguistic model: µ Ai (x) : R n [0, 1]. (2.6) The general form of a consequent function, f i, is a first order polynomial, naming the model first-order TS model (or affine TS model), y i = a T i x + b i. (2.7) while when being just a constant is the zero-order TS model, which is a particular case of the linguistic model (singleton model). The inference in the TS model is reduced to a simple algebraic expression K β i (x)y i i=1 y = (2.8) K β i (x) i=1 where β i = µ Ai (x). 14

39 2.2.2 Clustering The partition of the available data into subsets and approximation of each subset by simpler modes is an effective approach to the identification of complex nonlinear systems. Fuzzy clustering is a tool that parts the data into subgroups different from each other but where changes between them are smooth and gradual. The objective of clustering is the grouping of similar data. It is an unsupervised (learning) method. The fact it doesn t rely on assumptions common to statistical methods, for example, makes them useful in situations where prior knowledge doesn t exist. A cluster may be defined as a group of objects that are more similar to one another than to data outside that group (or members of other clusters). Similarity should be defined as mathematical similarity. It can be defined in several ways but, generally, mathematical similarity is defined by means of a distance norm. Distance can be measured among the data or as a distance to some prototypical object (or center) of the cluster. Data clusters can have different geometrical shapes, sizes and densities. The performance of clustering algorithms depends not only on these previous parameters but also by the relations and distances among clusters. There are different clustering algorithms, but most of them are based on the minimization of the basic c-means functional, which is formulated as follows: c N J(X; U, V ) = (µ ik ) m x k v i 2 A (2.9) i=1 k=1 where U is the partition matrix containing the normalized membership values, U = [µ i k] M fc, X is the matrix containing the data matrix and m [1, ) is a weighting exponent which determines the fuzziness of the resulting clusters. V is the vector containing the cluster prototypes, or centres, given by V = [v 1, v 2,..., v c ], v i R n, (2.10) which have to be determined. In this work the algorithm used was the fuzzy c-means algorithm, so it will be developed on the following paragraphs. The cluster centers are determined by a squared inner-product distance norm [12], DikA 2 = x k v i 2 A = (x k v i ) T A(x k v i ) (2.11) The squared distance between each data point x k and the cluster centre v i in eq. (2.9) is weighted by the power of the membership degree of that point (µ ik ) m and the cost function may be regarded as a measure of the total variance between each point and cluster centre [7]. For U to represent a hard partition, there are some constraints that need to be respected. µ ik {0, 1}, 1 i c, 1 k N, (2.12) c µ ik = 1, 1 k N, (2.13) i=1 N 0 < µ ik < N, 1 i c. (2.14) k=1 15

40 The choice of A influences the shape of the clusters obtained. A possible approach would be to set A = I, which would induce the standard Euclidean norm. For this value of A the algorithm would obtain hyperspherical clusters. The minimization of the functional represents a nonlinear optimization problem that can be solved through several methods, being the most popular the one used in this work, the Picard iteration through the first-order conditions for stationary points of eq. (2.9), known as the fuzzy c-means (FCM) algorithm [12]. Using Lagrange multipliers is possible to obtain the stationary points of the objective function. ] c N N J(X; U, V, λ) = (µ ik ) m DikA 2 + µ ik 1, (2.15) i=1 k=1 k=1 λ k [ c i=1 and by setting the gradients of J with respect to U, V and λ to zero. If DikA 2 > 0, i, k and m > 1, then (U, V ) M fc R n c may minimize the functional only if µ ik = and 1 c, 1 i c, 1 k N, (2.16) (D ika /D ika ) 2/(m 1) j=1 v i = N (µ ik ) m x k k=1 ; 1 i c. (2.17) N (µ ik ) m This solution also satisfies the remaining constraints (2.12) and (2.14) [12]. k= Building models Fuzzy identification is the construction of fuzzy models from data. It can be regarded as a search for a decomposition of a nonlinear system, which gives a desired balance between the complexity and the accuracy of the model, effectively exploring the fact that the complexity of systems is usually not uniform. As it cannot be expected that the decomposition has enough a priori knowledge available, there are methods used to automate the generation of the decomposition, such as fuzzy clustering algorithms [7]. Fuzzy clustering algorithms enable the identification of relations between variables by partitioning data into different groups (clusters). By applying these algorithms on data obtained from measurements it is possible to obtain the fuzzy models. According to [79], the construction of fuzzy models through clustering consists in the following steps. 1. Determine the model structure suitable to the problem by identifying the relevant system variables. 2. Collect data from the system by measuring, computing or constructing the relevant system variables. 3. Select a clustering algorithm and determine the values of the parameters relevant to a clustering method used. 4. Select the number of required clusters. 5. Cluster the data with the selected clustering algorithm. 6. Obtain membership functions from clusters by projection or otherwise. 16

41 7. Determine a fuzzy rule from each cluster by using the membership functions obtained. 8. Validate the model. Typically, this procedure is flexible and the user often has to iterate through several steps in order to have a suitable model. When there is no prior knowledge about the system, the model will be constructed only based on measurements, and the clustering is in the product space variables of the system. The identification of this type of models is made in two steps [79]. 1. Structure identification. 2. Parameter estimation. Structure identification, which is what allows the transformation from the dynamic identification problem into a static nonlinear regression, can be executed in three steps [7]: (1) Choice of input and output variables, (2) Representation of the system s dynamics, (3) Choice of the fuzzy model s granularity. The selection of the input and output variables is based on the objective of the model, on the prior knowledge related to the process dynamics and on other variables that may contribute to the system s nonlinearity. Some statistical analysis, such as correlation, can be of aid in this process. To assess the best model, the performance of several models with different variables can be compared [7]. Concerning the representation of the system s dynamics, a common approach is to transform the identification of a dynamic system into a static regression problem. The choice of the transformation is based on prior knowledge, intuition, and understanding of the problem in hand. It can be regarded as a mapping from a domain of time signals into a space of variables, which are named regressors, that fully determine the state of the system. The choice of the regressors is an important step, as a poor structure may lead to inaccurate modeling, while a richer than necessary structure might lead to over-fitting the data [7]. The granularity of the model is related to the number of linguistic terms associated to each variable, and therefore to the number of rules. This step is intrinsically connected to clustering, and is one of the first parameters to be selected on modeling [7]. In this work two different fuzzy structures are used: a classification model, and a NARX fuzzy model. The classification model maps the input-output relation from the same time space. For instance, for the problem in hand, relates the sales in a certain day, as a consequence of the instances (features) happening that day as: ŷ(t + 1) = f(x(t + 1)) (2.18) In the case of the NARX model, for a multi-input single-output (MISO), the system can be described by equation (2.19) [79]. ŷ(t + 1) = f(x(t)) (2.19) 17

42 Product-space clustering is based on the data in the product space X Y of the regressor and the regressand. the inputs and outputs. For N samples, n states, N d the actual number points used and t h the highest order of regressands Υ, can be defined as follows[79],[7]. Then, the vector containing the regressor Φ, and the vector containing the x(t h ) T y(t h + 1) Φ =., Υ =. x(n 1) T y(n) (2.20) Having defined the model s structure, i.e., input and output variables, it follows parameter estimation, where the number of rules K, the antecedent fuzzy sets A i, and the consequent parameters a i and b i for i = 1,..., K are determined. As previously mentioned, fuzzy clustering in the Cartesian product space X Y is applied to the partition of data into subsets, which can be approximated by local linear models. A further analysis on clustering was developed on subsection [79]. The number of clusters defines the number of rules of the model. strategies have been developed to select the appropriate number of clusters: It influences accuracy and two Use different number of clusters to build the model and evaluate the goodness of partitions. Use large enough number of clusters and then reduce the number by combining clusters compatible with some predefined criteria. Each cluster represents one rule. The multidimensional membership functions A i are given analytically by computing the distance of x(t) from the projection of the cluster centre v k onto X, and then computing the membership degree in and inverse proportion to the distance. From the fuzzy c-means algorithm, see subsection 2.2.2, the distance given by if DikA 2 > 0 for 1 i c, 1 k N, then D 2 ika = x k v i 2 A = (x k v i ) T A(x k v i ) (2.21) µ ik = 1 c, (2.22) (D ika /D jka ) 2/(m 1) j=1 otherwise µ ik = 0 if D ika > 0, and µ ik [0, 1] with c i=1 µ ik = 1 and where m is the fuzziness parameter. The estimations of the consequent parameters is made through the least-squares method. For (θ i ) T = [(a i ) T, b i ], let Φ e denote the matrix [Φ, 1], and let Γ i denote a diagonal matrix in R N d N d having the membership degree µ Ai (x(t)) as its lth diagonal element. Denote Φ the matrix in R N d K(n+1) composed from matrices Γ i and Φ e as follows Denote θ the vector in R K(n+1) given by Φ = [(Γ 1 Φ e ), (Γ 2 Φ e ),..., (Γ K Φ e )]. (2.23) θ = [ (θ 1 ) T, (θ 2 ) T,..., (θ K ) T ] T. (2.24) 18

43 The resulting least squares problem, Υ = Φ θ + ɛ, has the solution θ = [ (Φ ) T Φ ] 1 (Φ ) T Υ. (2.25) The optimal parameters a i and b i are given by a k = [ θ s+1, θ s+2,..., θ s+n] T, (2.26) b k = [ θ s+n+1], where s = (k 1)(n + 1). (2.27) With the determination of the parameters a i and b i, the fuzzy model identification procedure is completed. 2.3 Neural Networks In its most general form, artificial neural networks, commonly referred to as neural networks, are machines designed to model the way in which the brain performs a particular task or function of interest [42]. One of the main reasons why NN were developed was to capture some of the advantages that biological neural networks have over computational systems. These networks can be most adequately characterized as computational models with particular properties such as the ability to adapt or learn, to generalize, to cluster or organize data, and which operation is based on parallel processing. They can be defined in the following way: A neural network is a massively parallel-distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects [42]: 1. Knowledge is acquired by the network through a learning process. 2. Interneuron connection strengths known as a synaptic weights are used to store the knowledge. As the name indicates, a neural network is a network consisting of a number of neurons (nodes) connected through directional links. Each node represents a process unit, and the links between nodes specify the causal relationship between the connected nodes. All, or part of the nodes are adaptive. Being adaptive means that the outputs of these nodes depend on modifiable parameters pertaining to these nodes. The modifiable parameters are updated according a learning rule, or learning algorithm, and the update should guarantee a minimization of a prescribed error measure, which can be a mathematical expression that measures the discrepancy between the networks actual output and a desired output [49]. The main advantages of neural networks are its computing power through its massively process parallel structure, its learning capacity, and therefore generalize, which means computing accurate output for inputs not provided in training [42]. They offer useful properties, being the following some of the interesting ones: 1. Nonlinearity A neuron is a nonlinear device, and by defining a neural network an interconnection of neurons is, itself, nonlinear. 19

44 2. Input-output mapping The network learns from examples where an input signal has a desired output. The learning algorithm changes the synaptic weights of the network through training until reaches steady state. So, then, the network learns from examples by constructing an input-output mapping of the given problem. 3. Adaptivity As previously said, the network may adapt its weights to changes in the environment. More over, it can be programmed to adapt its synaptic weights in real time or retrain to deal with minor changes. 4. Fault tolerance If a neuron, or some of the connecting weights, are damaged, the damage must be extensive before provokes considerable error in the response, due to the massively distributed computational capacity of the network. The remaining Chapter continues with a presentation of a model of a Neuron, and network architectures, where the two architectures proposed are briefly explained. It concludes with network learning algorithms Models of a neuron A neuron is a fundamental unit to the operation of a neural network. It is an information-processing unit that is constituted by three basic elements (Figure 2.1): 1. Synapses or connecting links; 2. An adder; 3. An activation function. It may also include a threshold. The synapses are characterized by having weights. When the input signal flows through 2 different neurons, it is multiplied by the synaptic weight connecting those neurons, w kj. The adder works as a linear combiner by summing the input signals. The activation function limits the amplitude of the output signal (a squashing function)[42]. neuron Inputs neuron weights neuron x1j x2j w1j w2j... z activation function ᵠ(z) neuron Output ᵠ(z) xkj wkj bj bias Figure 2.1: Model of a neuron Steps 1 and 2 can be described by the following equation: p z k = w kj x kj. (2.28) j=1 20

45 Sometimes is useful to add a bias, an extra weight from a constant input, to the sum of weighted. The effect of the bias is to amplify the net input of the activation function and its contribution results in: p z = w kj x kj + b j = [ w T ] j b j x j, (2.29) j=1 1 where [x j 1] T is the expanded input vector and [ w T j b j] is the weights vector. Concerning the activation function, there are several types of functions. Its possible to have threshold functions, piecewise-linear functions or sigmoid functions. In the most common case we have the sigmoid function, which is defined as a strictly increasing function that exhibits smoothness and asymptotic properties. One example of the sigmoid function is the logistic function, defined by [42]: ϕ(z) = exp( az) (2.30) Allowing the activation function to have negative values has analytic benefits, and has neurophysiological evidence of an experimental nature Architecture The organization of the neurons through the network determines the networks structure. Its possible to differentiate networks according to the number of layers, single-layer networks or multi-layer networks. Single-layer networks are the simplest form of layered networks. It consists of just one layer of source inputs that projects onto a output layer of neurons. Multi-layer networks can have multiple hidden-layers (HL). It is also possible to classify a NN as fully connected or partially connected networks. Fully connected networks are structures in which every node in a layer is connected to every node in the next layer, while in partially connected networks its possible to have neurons that connect only to a reduced number of the neurons in the next layer. Another way to classify NNs architectures can be concerning the direction information flows: Feedforward Networks the output of each node is connected to nodes in the proceeding layer. The information flows only in this direction. Recurrent Networks If there is a feedback link that forms a circular path in a network. Connections can be feed-backed to neurons in the same layer or to neurons in the preceding layer. In this work were used two types of neural networks. Firstly it was used a fully connected feedforward neural network which considers the time-series problem as a set of input-output pairs with causal relationship. The other network was a specific type of recurrent networks, NARX, described next. The feedforward network is, also called a multilayer perceptron (MLP), constituted by one or more hidden layers. In a dynamic problem, the MLP maps the input output relation as the classification fuzzy model: ŷ(t + 1) = f(x(t + 1)), (2.31) 21

46 where x is the input vector. The function of these layers and neurons in it is to intervene between the external input and the network input. By increasing the number of hidden layers, the network is able to extract higher-order statistics. Typically, the neurons in each layer have their inputs as outputs of the neurons in the preceding layer only. The output of the neurons in the output layer constitute the overall response of the model to the activation pattern supplied by the source nodes in the first layer. In Figure 2.2, a feedforward neural network is presented Input Layer First Hidden Layer... Nth Hidden Layer Output Layer Figure 2.2: Feedforward classification neural network The NARX network is a dynamical neural architecture commonly used for input- output modeling of nonlinear dynamical systems. When applied to time series prediction, the NARX network is designed as a feedforward Time Delay Neural Network (TDNN) [66]. Some qualities related to the use of NARX with gradient-descending learning algorithms are the more effective learning than other NNs, and a faster convergence and generalization [66]. The network consists of a MLP, and predicts the output through the computation of a window of past inputs and targets. The mathematical model is described in the following equation. ŷ(t + 1) = f[y(t),..., y(t d y + 1); x(t), x(t 1),..., x(t d x + 1)] (2.32) ŷ(t + 1) = f[y(t); x(t)] (2.33) The states of the NARX network correspond to a set of two tapped-delay lines (Figure??). One the d x taps on the input values, and the other the d y taps on the output values. The architecture of the NARX network changes from training to testing. So, when predicting a time series using NARX, there are two different structures used in two different phases: 1. During training, the network has a general form of a feedforward MLP, where the inputs are a window of past inputs and outputs, defined by two tapped delays, n u and n y, respectively. Also referred as Series-Parallel (SP) mode [66]. 2. After training, the loop is closed with the predicted outputs, and the network is simulated using the same size tapped delays, but instead of using real inputs and outputs, there are only used past real inputs and the past-predicted outputs, becoming a recurrent net. This mode is also known as the Parallel (P) mode [66]. 22

47 u(t) Z -1 u(t-1)... y(t+1) Z -1 u(t-d u+1) y(t-d y+1) Z -1 y(t) Z -1 y(t-1)... Z -1 Figure 2.3: NARX neural network Network learning One of the major interests concerning NNs is their learning capacity. The learning capacity results in improvements in performance over time. As referred before, the learning consists in an update of synaptic weights and bias (or thresholds). Ideally, the network becomes more knowledgeable about its environment after each iteration of the learning process (also known as training process) [42]. There are three basic learning classes [42] [49]: Unsupervised learning In this type of learning, the network is optimized concerning a taskindependent measure of the quality of representation that the network is required to learn. Reinforcement Learning Is the on-line learning of an input-output mapping through a process of trial and error designed to maximize a scalar performance. Supervised learning The network is provided with the desired (or target) output to train the input vector. The training stopping criteria is based on the output error. One of the most used algorithms is the back-propagation algorithm, where the error terms are propagated backwards to the flow of the network, i.e., form the output nodes, to the input nodes. The main steps of the back-propagation algorithm consist in: Step 1: Calculate the predicted output of the network, f(x), from the input vector; Step 2: Compute the error, generally the mean-squared error, between the real output and the predicted output: d ( ) 2 MSE = E ˆθj θ j (2.34) Step 3: Using a gradient method, such as Levenberg-Marquart or steepest descent, adjust the parameters of the network; Step 4: Repeat Step 1 until some criteria is reached (number of training epochs (iterations), gradient level reached). j=1 It can be stated that a neural network is capable of learning through weight adjustment and generalizing knowledge to unseen test samples. Concluding, it successfully mimics the human brain learning process. 23

48 2.4 Adaptive neuro-fuzzy inference systems (ANFIS) The fuzzy modeling approached in section 2.2, has found numerous applications [48]. Despite this fact, there are some aspects that need a better understanding, such as the lack of methods for transforming human knowledge or experience into the rule base of the system, or the need for effective methods for tuning the membership functions to minimize the output error measure or maximize performance index. In that sense, the adaptive neuro-fuzzy inference system presents itself as a system able to serve as a base for constructing a set of fuzzy if-then rules with appropriate membership functions to generate the stipulated input-output pairs [48]. Adaptive neuro-fuzzy inference systems are functionally equivalent to fuzzy inference systems [48]. It is like a Takagi-Sugeno system mapped into a neural network. There are several configurations possible, but the most common has 5 layers, and is the one being described further. Nodes in different layers have different structures. Next it will be presented ANFIS architecture and its hybrid learning rules, which are the most distinctive characteristics of this type of models Architecture Assuming a two input x and y and one output f first-order TS model, a general rule is of the form: R i : If x is A i and y is B i, then f i = p i x + q i y + r i The model can be observed in Figure 2.4. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 x y x y A 1 B 1 A 2 Ν Ν 2 B 2 w 1 w 1 w 1 f 1 w w 2 w 2 f 2 x y f Figure 2.4: Model of a neuron The first layer is an adaptive node with a node function, which is a membership grade of a fuzzy set, and it specifies the degree to which the given input satisfies the quantifier. These are referred as the premise parameters. The second layer is a fixed node layer, whose output is the product of all the inputs of the node. For the given example let it be O 2,i = w i = µ Ai (x)µ Bi (y), i = 1, 2, (2.35) where each node output represents the firing strength of a rule. Any operator that performs the fuzzy AND can be used. Every node in the third layer performs the ratio of the i th rules firing strength to the 24

49 sum of all rules firing strength: O 3,i = w i = These are called the normalized firing strengths. In the forth layer there are adaptive nodes with node functions: w i w 1 + w 2, i = 1, 2. (2.36) O 4,i = w i f i = w i (p i x + q i y + r i ) (2.37) where w i is a normalized firing strength from layer 3 and {p i, q i, r i } is the parameter set of this node. In this layer, parameters are referred to as consequent parameters. The last node, on the last layer, is a fixed node which computes the overall output of the model as the sum of all incoming signals. overall output = O 5,i = w i f i i w i f i = (2.38) w i i The above description presents an adaptive network functionally equivalent to a Sugeno model. This structure is not unique, as it is possible to combine layers. The approach in this work consisted of developing an adaptive network as a classification model, where the input- output mapping is the same as equation (2.31) i Hybrid learning One of the distinguishing characteristics of the ANFIS model is its hybrid-learning algorithm, a combination of the steepest descent method and least-squares estimator. From Figure 2.4, it is observable that the overall output can be determined as a linear combination of the consequent parameters when the values of premise parameters are fixed. The output can be rewritten in the following way; f = w 1 w 1 + w 2 f 1 + w 2 w 1 + w 2 f 2 (2.39) = w 1 (p 1 x + q 1 y + r 1 ) + w 2 (p 2 x + q 2 y + r 2 ) (2.40) = ( w 1 x)p 1 + ( w 1 y)q 1 + ( w 1 )r 1 + ( w 2 x)p 2 + ( w 2 y)q 2 + ( w 2 )r 2 (2.41) which is linear in the consequent parameters p 1, q 1, r 1, p 2, q 2, r 2. Concerning the hybrid algorithm, lets assume that the output of a neural network is represented by o = F (i, S), (2.42) where i is the vector of input variables, S is the set of parameters, and F is the overall function implemented by the adaptive network. If there exists a function H such that the composite function H F is linear in some of the elements of S, then these elements can be identified by the least-square method. More formally if the parameter set S can be divided into two sets S = S 1 S 2, (2.43) 25

50 ( is the direct sum), such that H F is linear in the elements of S 2, and then applying H to equation (2.42) it s possible to obtain Now, given the consequent parameters p 1, q 1, r 1, p 2, q 2, r 2, we have S = set of total parameters, S 1 = set of premise (nonlinear) parameters, S 2 = set of consequent (linear) parameters. H(o) = H F (Bi, S). (2.44) In equation (2.43), and H( ) and F (, ) are the identity function and the function of the fuzzy inference system, respectively, in equation (2.44). Therefore the algorithm can be applied. So, given the values of S 1, its possible to plug P training data into equation (2.44) and obtain a matrix equation Aθ = y (2.45) where θ is an unknown vector whose elements are parameters in S 2. This is a standard linear least-squares problem and the best solution for theta which minimizes Aθ y 2 is the least-squares estimator (LSE) θ : θ = (A T A) 1 A T y, (2.46) where A T is the transpose of A and (A T A) 1 A T is the pseudoinverse of A if A T A is nonsingular. It is also possible to apply the recursive LSE formula. Its now possible to combine the steepest descent and the least-square estimator to update the parameters in an adaptive network. Each epoch is composed of a forward pass and a backward pass. In the forward pass, the node outputs are computed layer by layer until a corresponding row in the matrices A and y in 2.45 is obtained. Its repeated for every data pair to form A and y, and then the parameters in S 2 are identified, the error measure can be computed for each training pair. In the backward pass, the error signals are back propagated and the gradient vector is accumulated for each training data entry. At the end of the backward pass for all the training data, the parameters in S 1 are updated by the steepest descent in α = η δ+ E δα. (2.47) For given fixed values of the parameters in S 1, the parameters in S thus found are guaranteed to be the global optimum point in the S 2 parameter space because of the choice of the squared error measure. Not only this hybrid learning rule decrease the dimension of the search space explored by the original steepest descent method, but, in general, it will also substantially reduce the time speed needed to reach convergence. Concluding, ANFIS are equivalent to fuzzy inference systems that are highly flexible and thus can be generalized in a number of different ways. In addition, by employing the adaptive network as a common framework, is possible to construct other adaptive models that are tailored for applications such as data classification and feature extraction [48]. 26

51 Chapter 3 Preprocessing of sales data The objective of this chapter is to explain the approach to data preprocessing before applying intelligent modeling techniques to sales forecasting. The specific aspects in need of preprocessing are the following: Selection of forecasting horizon (Period selection), Variables that influence sales behavior (Feature construction). One of the goals when building prediction models was the ability to use them on the prediction of different periods (years). For that, they need to be sufficiently generic and its inputs robust enough so they can translate the system s dynamics effectively. For this purpose, it s necessary that the inputs are constructed in a coherent manner with a logic that sustains the numerical values that they have. At the end of this chapter, the reader should understand which, why, and how periods and inputs were defined. Concerning the existing data, it was only provided data related to sales itself. Weekly sales data was provided for a first approach, which helped the definition of the prediction periods. After defining these periods, daily data was provided for modeling and forecasting. Although it is known that sales depend on some variables, here there was no data provided for this effect, which means that all the features thought to influence sales had to be constructed according to ground knowledge. Firstly, it will be explained the logic behind period selection for forecasting, as it s important to have the prediction horizons defined before defining the inputs of the models. Afterwards, the logic behind input construction is presented. Due to confidentiality restrictions none of the timespan can be associated to real years, months or specific days (such as known holidays). 3.1 Period selection In this section, it ll be explained how the forecasting periods were chosen. As mentioned earlier, the data provided for this analysis was the weekly sales data. The timespan of the information in hand (see Fig. 3.1) comprised weekly sales data from week aa to aa + 44 (named week ff) in years A and A+1 (the latter named year B); and from week aa until the end of week aa + 22 (named week ee) of year A+2 (named year C ). As the data available concerns three different years, the methodology used for 27

52 forecasting was to train models with data from a certain period for years A and B, and then test the models for the same period of year C data. Thus, the first move had to be the selection of the forecasting periods. Year A Year B Year C Week aa Week ee Week ff Figure 3.1: Weekly sales from week aa of year A to the week ee of year C per year In Figure 3.1 the weekly data is presented, where is possible to observe some peaks. These peaks correspond to three types of effects, which are developed in Section 3.2, and are the major promotions, monthly seasonality and yearly seasonality. A first approach to the modeling problem was to select a period with negligible variations, such as seasonality or major promotions, i.e., the most stationary as possible. After having defined a model to accurately predict that period s sales, the goal was to develop models to predict sales from increasingly complex periods. From this dataset, it was possible to define three periods with increasing complexity which will be presented next and defined in the following subsections. Stationary period, Stationary period with disturbances, Non-stationary period Stationary period As a starting point, to select the stationary period, it was important to compile a sales vector without the major disturbances (named flat). Comparing the flat vector with the true sales vector it would be easier to confirm that a certain period from true sales would have negligible disturbances expressed, as it would have a similar trend as the flat vector. If the real and flat vectors are identical, it would mean that for that period no disturbances are significant. The seasonality was removed in 2 different ways. To remove the monthly seasonality, which is known to be an increase in sales in the beginning of each month, the point of the week containing this effect was averaged to the ones on its neighborhood. Concerning the yearly seasonality, such as events that create an annual trend, the methodology was to define sales in that period as a constant with value of the week before the seasonality to be effective. The major promotions effect was removed as the monthly seasonality, by averaging the value of sales in the week the promotion takes place to the points on its neighborhood. A comparison of both curves (true sales and sales without these previous effects) can be seen in Figure

53 Sales Flat Sales 0 Year A Year B Year C Figure 3.2: Real sales against sales without seasonality and major promotions effects Compiling the curves for the years of A to C in just a monthly window is better to see which periods sales have the same behavior in all the years, which can be seen in Figure "%12$3 4$5%6"2 )*/ 0"%1$3 0"%1$7 0"%1$4 )*. )*, )*+!""#$%%!""#$&&!""#$''!""#$(( 0"%12$3 4$86%9$2%6"2 )*/ 0"%1$3 0"%1$7 0"%1$4 )*. )*- )*- )*, )*+!""#$%%!""#$&&!""#$''!""#$(( Figure 3.3: Yearly sales and yearly sales without seasonality and major promotions effect The goal is to select a period in which sales have a similar stationary behavior during all years. From figure analysis it seems to be the case for the beginning of the year. In Figure 3.3 it can be seen that from weeks aa to bb this behavior is shown and in a zoomed analysis that is confirmed (the zoom can be observed in the Appendix A1). As Figure 3.3 shows, the sales curve is similar for the 3 different years, and even more importantly, all the curves have the same behavior as the flat curve (without promotions or seasonality effects), which shows that it is a period without these effects. In fact, it was a period of this type that was useful for the first approach modeling. The fact that the real curve has the same behavior as the flat curve means that seasonality and promotions don t have an effect during this period, or, putting it differently, all the variations in sales each year are not a consequence of these effects. It s important to make a remark at this point. The fact that real sales versus flat sales seem similar, doesn t mean that monthly seasonality doesn t exist in this period. What means is that monthly seasonality isn t as relevant to influence sales as in other periods, like, for instance, during year B in Figure 3.2, 29

54 there are some very well defined peaks and aligned to the end/beginning of each month. Concluding, the simplest model predicts sales from weeks aa to bb, named stationary period. The data used in modeling was daily and is presented in Figure 3.4. In daily data, the hole set is constituted by 126 points (days), 42 per year, that correspond to 6 weeks. Although this period has no promotions and negligible seasonality, one must refer that there is an effect of a holiday on data, in year A, making sales superior to what would be normal (the hump to the right on the last week - tuesday - named h 1 ). 1 Year A Year B Year C h1 Figure 3.4: Daily sales in weeks aa to bb for all years together As is possible to observe in Figure 3.4, there is a pattern on sales and it seems a stationary evolution, as the goal was Stationary period with disturbances A second approach would be to predict in a period where some stationarity is still observed, and where there were also some disturbances to the system, such as some holidays and promotions. This period would consist, for instance, from weeks aa to cc. In Figure 3.5 is presented this period, named stationary period with disturbances, and it consists of 9 weeks, 189 days, 63 per year. This period has 2 major types of promotions (p 1 and p 2 ) happening in this timespan. In addition, it also comprises a holiday (named holiday type 1, h 1 ) for all the years. These events are marked on Figure 3.5 as follows: Promotion type 1 on year B point p 1 -B on figure; Promotion type 2 on year C point p 2 -C on figure; Holiday type 1 on year A point h 1 -A; Holiday type 1 on year B point h 1 -B; Holiday type 1 on year C point h 1 -C. As is possible to observe in Figure 3.5, there is still a visible pattern, and sales have a stationary evolution until the end of the week bb, where disturbances start to be noticed Non-stationary period Finally, and in order to use almost all data available, the final approach would be to model a period with plenty of perturbations, being a non-stationary period. This period is influenced by several events such 30

55 1 Year A Year B Year C h1 A h1 C p2 C p1 B h1 B Figure 3.5: Daily sales in weeks aa to cc for all years together as holidays, and also promotions. It is presented in Figure 3.6 and consists of 19 weeks, 420 days, 120 per year, from week aa to dd. As can be see in the figure, this period shows a lot more disturbances than the previous, although all are contained in this one. There are the promotions previously mentioned, p 1 and p 2, there are also other promotions in year A and B, and a partial promotion and full promotion in year C. There is a drop in sales in the beginning of year A, and in the middle of the period in year B representing another holiday (named holiday type 2). There is also the effect of other two holidays combined with high promotional activity (named holiday type 3). All these events are detailed in the next section, and also listed next, which specifies their location in Figure 3.6. Fixed events: Promotion type 1 on year A point p 1 -A; Promotion type 1 on year B point p 1 -B; Promotion type 1 on year C point p 1 -C; Promotion type 2 on year C point p 2 -C; Promotion type 3 on year C point p 3 -C; Holiday type 1 on year A, B and C point h 1 -A, h 1 -B and h 1 -C; Holiday type 2 on year A, B and C point h 2 -A, h 2 -B and h 2 -C; Holiday type 3 on year A, B and C point h 3 -A, h 3 -B and h 3 -C; 1 Year A Year B Year C h1 A h1 C p2 C p1 B h1 B h2 A p1&h2 C h2 B p1 B&h3 A,B,Cp1 A&p3 C Figure 3.6: Daily sales in weeks 3 to 22 for all years together A final remark should be made. As it was possible to conclude during this section, in addition to the growing complexity of the datasets throughout the section, the sets also have more data, corresponding 31

56 to more days. It was chosen to proceed this way because, by increasing the information the set has, it is possible to capture more effects of the dynamics in the training phase, which, ultimately, might contribute to a greater performance by the model. 3.2 Feature construction The goal of this section is to explain all of the attributes that have an effect on sales and were accounted when modeling. Then, after enumerating and defining the effect they have, an approach to define that parameter as an input to the models is made. The inputs here are defined considering the training set only, as the goal is a prediction and, therefore, is as if the test set is unknown. In the presentation of the various attributes, the figures don t have year C data. As the goal is to predict, and when predicting there is no knowledge about the future, none of the year C data was considered. Sales (s) are influenced by: Weekly seasonality (ws), Monthly seasonality (ms), Promotions (p), Purchasing power of costumers (pw), Holidays or festive days (h). The major effects known to have an effect on sales are the seasonality (weekly, monthly and yearly), and promotions. In the following Subsections all of the attributes are defined Weekly seasonality The weekly seasonality could be compared to a distribution of sales during the week. It is a pattern that describes sales during the several days of the week. During the weekend, specially on saturdays, there is a peak on sales, which starts to raise from wednesday, and then declines until tuesday. The shape of this seasonality is similar to a positive parabola with both vertical asymptotes on saturday and global minimum on wednesday. The input started to be constructed as a normalized [0 1] distribution, being saturday 1 and wednesday 0, but after some adaptations, the input resulted as presented in Table 3.1. These adaptations were done based on a greedy heuristic, developed for this feature. Initially all the weights are linearly distributed from 1 to 0 from saturday to wednesday and 0 to 1 from wednesday to saturday. Then the following algorithm was applied from the lowest weight to the highest: 1. Increment the weight by 0.1 and evaluate training performance, (a) If training performance increased repeat Step 1. (b) If training performance decreased go to Step Decrement the weight by 0.1 and evaluate training performance, 32

57 (a) If training performance increased repeat Step 2. (b) If training performance decreased go to Step Increment the weight by 0.01 and evaluate training performance, (a) If training performance increased repeat Step 3. (b) If training performance decreased go to Step Decrement the weight by 0.01 and evaluate training performance, (a) If training performance increased repeat Step 4. (b) If training performance decreased End. As it is a pattern that will remain constant, unless significant political measures occur, such as saturdays become working days as well, which may happen; this type of seasonality will have minor changes through the years. The resulting input is presented on the following table. Table 3.1: Weekly seasonality Day of the week Monday Tuesday Wednesday Thursday Friday Saturday Sunday Seasonality value As shown, the values of the input vary from 1 on saturdays to -0.2 on wednesdays, and the shape can be observed in Figure Weekly seasonality final values Weekly seasonality initial values Sun Mon Tue Wed Thu Fri Sat Figure 3.7: Weekly seasonality Monthly seasonality The monthly seasonality refers to the way people spend money during a month period. Being used to receive wages at the end of the month, usually the last weekend of a month and the first weekend of the next are the weekends where customers are more willing to spend. There is also an addition to this trend, as in the weekend before customers receive their wages they have less money, which has the consequence of being less willing to spend money. So there is contention of expenses and sales on those weekends tend to be lower than on a regular weekend. This effect can also be seen in the weekly data provided for the first approach, presented in Figure 3.8. In this Figure, is possible to observe the weekly evolution of sales for some weeks in years A and B, and the increase in sales in the two weeks of the beginning of each month (indicated in figure) is clear. 33

58 0.42 Year A Year B End of Month End of Month End of Month Figure 3.8: Monthly seasonality with weekly data The values given to this type of seasonality are represented on Table 3.2, presented next. And on Figure 3.9 are the values presented graphically. The logic behind its construction was the increase in sales due to this effect, compared with the average sales of the same day of a regular week. For example, in case of a saturday on the beginning of a month, sales would increase in average 27% compared to the average saturday sales. Having the regular saturdays and sundays the value of 1, an increase would represent an addition to 1, while a decrease would be represented by a lower than 1 value (multiplicative factor). As can be seen in Table 3.2, which confirms the expected, on a weekend before the end of the month, sales decrease 6% and 24% on saturday and sunday, respectively. On the other hand, sales increase 27% and 5% on the the two saturdays and sundays, respectively. These later weekends represent the weekends where customers have more budget to spend. In order to have a better estimate, the average values were computed for the hole training set. Table 3.2: Monthly seasonality Type of effect Weekend before the end/beginning of the month End/Beginning of the month weekends Saturday Sunday Saturday Sunday Value In Figure 3.9, which shows a 4 week period, approximating a month of 28 days to make a simple example. This month begins on monday, therefore the firs weekend is on the 6 th and 7 th day. It s possible to see that sales increase in the first and last weekends, represented by higher than one values; in the third sales decrease, represented by lower than one values. 1.2 Monthly seasonality Sat Sun Sat Sun Sat Sun Sat Sun Figure 3.9: Monthly seasonality 34

59 The first group of numbers concerns to the weekend of contention of expenses, and the last to the expected increase in sales. The construction of this attribute had the logic presented before, thus the values given to the data points correspond to the weekends mentioned are the ones presented on Table Purchasing power of costumers During the analysis of the daily sales data, in a second phase, it was clear that the years A, B and C had a similar pattern during the same periods if removed the major promotions and variable holidays. Despite having the same pattern, there was also clear that sales had been decreasing along the years, see Figure 3.10, presented next again. 1 Year A Year B Year C h1 Figure 3.10: Vector of daily sales for weeks aa to bb from year A to C In that sense it was needed to construct an input that would account for this influence. It was named purchasing power, although it doesn t have a direct relation with the purchasing power definition, and the goal was to build a vector of inputs which values would be constant during each year, relating years B and C overall sales behavior with year A. In that sense, the input for year A was zero. Its construction was made through the analysis of internal documentation for the years A and B, and assumed for the forecasting year. Internal documents provided the relation between A and B sales, stating that sales in B dropped 10% from A. That was the value of the input for year B. Concerning the last year, C, after experts contribution on the subject, it was decided to assume that sales would drop 18%, which would mean a drop from year A of 26.2%. It was rounded to 26%, which corresponds to a drop of 17.8% from year B to C. This input is presented on Table 3.3 Table 3.3: Purchasing power Year Purchasing power A 0 B C

60 3.2.4 Promotions Concerning the promotions, there are of three types. Type 1 are transversal promotions discounts spendable in future sales. This means that when a customer goes to a store he pays without price reduction, but has the discount amount spent on that day available to spend on future purchases. Generally these promotions last 2 or 3 days. The announcement is made 2 days before the beginning of the promotion, which begins on a friday, if it s a 3 day promotion; or on a saturday, if it s a 2 day promotion. The objective of announcing only 2 days earlier is to avoid extending a usual drop in sales that happens between the two events. The promotions of this type are here referred as p 1. The weekly data provides low information on how it affects sales between the days of the announcement and the beginning of the promotion, and on the weeks were the amount of discount may be spent. In that sense, in Figure 3.11 all the major promotions in training data are presented. All the promotions except two in the hole data set are of the above type presented. The exceptions are included for two reasons. Firstly, they are a similar type of promotion, although it was only a test, which result was expected to be only slightly different. The second reason is due to the lack of prior information on this type of promotion. One is a promotion that was expected to have a similar type of effect as the p 1 promotion but there was no way of defining it as a new parameter as there was no data available in training. The other is a p 1 promotion but only applied to a category of the full company. The second type of promotions, named promotion of type 2 (p 2 ) only happens once in year C, and is in the test set. The difference is that, instead of the transversal discount, the customer has to spend multiples of a certain amount to have access to the same discount. The third type of promotions, named promotions of type 3 (p 3 ), are promotions equal to p 1 but only applied to a category (a part) of a business unit, instead of being transversal. There is also only one promotion of this type, in year C Year A Year B Figure 3.11: Promotions in the data Analyzing the figure, it is possible to confirm the decrease in sales expected to happen between the announcement of the campaign and its beginning. What is not possible to confirm in the analysis is the expected increase in sales throughout the month after promotions, consequence of the discount expense. Expert knowledge pointed that usually on the two following weekends that might be noticed, but that is only possible to observe in the first promotion of year B, but one of those weekends correspond to a holiday, which can mean that sales increase not only because of the discount expense but also because of this festive day. From analyzing promotions in a general form, it can be observed that, as expected, sales decrease 36

61 on the 2 days before the beginning of the promotion. If it s a 2 day promotion, sales have it s peak on saturday and on sunday decrease again. For a 3 days promotion there is no data available. The approach for the construction of this parameter was similar to the monthly seasonality. It was compiled the sales value of the peaks of the promotion compared to the average of the corresponding weekday (without the effect of promotions) sales. The value computed was the multiplicative factor in sales due to promotions compared to the same regular weekday. This procedure was also adopted in the 2 days before promotions and the 2 weekends after the promotion, but by comparing the training performance, the final input consisted only of the days of promotion effect. In that way, the input for the type 1 promotions, p 1, was constructed as a vector of ones, with the values of Table 3.4 whenever there was a promotion. For the promotions in the prediction period, year C, the approach changed a little. The definition of the type 2, p 2, the first in the prediction period, was slightly different. Although the type of discount was similar, this promotion lasted for a week, in opposition to the type 1 and 3, which lasts for two days, or, exceptionally, three. The approach taken was to compute the average increase in sales of all days of the type 1 promotion, i.e., the average of all the values of corresponding to the training set in Table 3.4. By doing this, the average effect of a p 1 was computed. Then the attribute was constructed with 40% of this value for all days of promotion. It was multiplied by 40% as expectations were that it would not have the same effect than the p 1. This value was assumed. It was also chosen to use the average effect for all the days, even in weekdays, due to the fact that this value represented a multiplicative increase in sales, and that would mean that if sales were lower, than the effect would also be lower than if sales were higher. This was assumed to be a correct reasoning, as a promotion in a weekday (lower sales) has a lower effect than on a weekend (higher sales). Concerning the p 1 existing in year C, the approach was similar to the definition of monthly seasonality, with a minor change. This promotion lasted three days, while in training were only available promotions lasting 2 days. The first day of promotion was a holiday on a friday; and the last (sunday) was a holiday too. As the sunday holiday was a religious holiday, and as it s a day that people tend to spend with family and stores would have lower affluence; and friday is also a holiday but not with the previous characteristics, it was decided to define this attribute as a 2 day promotion beginning on friday and ending on saturday. This makes the effect of the promotion on sunday having the value of 1, which meaning is the cancellation of the effect of the promotion with the effect of lower affluence generated by the sunday holiday. The type 3 promotion, p 3 was modeled in a slightly different way. Firstly, the weight of the full business unit (BU CC) was compiled, dividing the business unit sales of the training data, by the total sales of the training data. Business unit CC had a weight of 25.6% of the total sales of the training data. Then, an approach thought was to multiply the values of the p 1 by 0.256, but that would lead to values lower than one, which wouldn t make sense, as that would mean a decrease in sales. The increase in a p 1 is the value given in its input less 1. In that sense, what was done was using the average values for the p 1 promotion in year C, compiled as the average multiplicative factor of sales for that day of promotion, and subtract one, to represent the increase in sales instead of the multiplicative factor. Then, this value was 37

62 multiplied by the business unit weight, and added one. These values represent the multiplicative factor of total sales due to a p 1 on only a business unit with a determined weight in total sales, creating the p 3 input. A final approach was done, as the promotion was only applied to a part of the hole business unit. Instead of the value of corresponding to 25.6%, it was used the value of 0.2. This doesn t mean that Image has the sales corresponding to 20% of the business unit, it has less, approximately 18%. However, due to the promotion, BU CC has a bigger weight in total sales during those days, and the percentage corresponding to the weight of the category in promotion grows. Here, it was assumed that the weight of the BU would grow, and also the weight of the category up to 20%. In that sense, the resulting input of this promotion would be: p 3 = [1 + (p 1 1) 0.2] (3.1) Next is presented Table 3.4 which shows the input values. Table 3.4: Promotion input Promotion 2 days of active promotion p 1 year A [ ] p 1 year B [ ] p 1 year B [ ] p 2 year C [ ] 0.4 p 1 year C [ ] p 3 year C [ ] In Figure 3.12 is possible to visualize the effect of this promotion in Promotions Input p1 A p1 B p1 B p2 C p1 C p3 C Figure 3.12: Promotions input Holidays or festive days Finally, the last effect taken in account was the existence of holidays or festive days. These are of three types: Type 1 - h 1 - Holiday always on the same day of the week movable; 38

63 Type 2 - h 2 - Combination of two holidays, on friday and sunday, which affect the hole previous week movable; Type 3 - h 3 - Combination of two holidays always on the same day (different days of the week in different years). Type 1 In fact, this event is not a holiday. Some employees were allowed not to work on this day in year A and B by government (tolerance), although some part worked. These days have some consumption traditions associated, which might implicate going to a major commercial surface and being exposed to the retailer products. From the hole data available, Figure 3.13 focus on the effect the type 1 holidays had in the three years of data Year A Year B h1 h1 Figure 3.13: Type 1 holiday (h 1) from year A to C As is possible to observe, in years A and B the effect that this day had was avoiding sales to decrease as it would in a regular weekday. Actually, sales have, apparently, similar values for monday and tuesday (event day) as sunday s sales. That makes sense, as it s a day with more commercial activity associated than usual days. Concerning year C, the same behavior is expected to happen. In that sense, the reasoning for the construction of this input, as happened with other variables, was constructed by compiling the quotient between the value of sales on monday, tuesday and wednesday of the week of the event, and the average value of sales of the corresponding weekday. The reason for choosing wednesday as well is because the event seems to create a delay on the weekly seasonality curve, which would mean still an increase of sales on wednesday. The values given in this approach had a mathematical meaning, the percentage increase, or drop, in sales for that day, while being ones on the other days. The values are presented in Table 3.5. Table 3.5: Type 1 effect Year Monday Tuesday (event day) Wednesday A B C

64 Type 2 This type of holidays is a combination of 2 holidays but has an effect on a hole week. It is a religious period, marked by contention due to the facts associated with it. The week ends with holidays on friday and sunday. Specially on sunday, families tend to gather and spend the day together, even if not religious. Being an important holiday, in years A and B, stores were only partially opened (half day), which had the consequence of lowering sales in those days Year A Year B Beginning of h2 event End of h2 (H) Beginning of h2 event Enf of h2 (H) Figure 3.14: h 2 holiday and week before for each year Although sunday is an important day, the hole week has a different behavior, as can be seen by Figure 3.14, specially on year A. Figure 3.14 has the plots of the event week and week before for the two years. The goal of this figure is to show that sales during weekdays increase slightly when compared to weeks before. Due to stores being closed half day on sunday, and due to friday being a holiday, the behavior of the week is somehow similar to a week with increased sales on weekdays, and friday and saturday replacing the weekend. As before, the rational behind the definition of the hole week as input, was the quotient of sales between those days and average sales for the same weekday, without promotions effect. The prediction input was slightly different, as it wasn t the average of the effect on the previous years for one of the days. In year C there was a p 1 on the weekend, beginning on friday and ending on sunday, which means that stores weren t partially closed on sunday. In that sense, for this day, as the previous years shape a partially closed day, the input was as if there was no effect, as if it was a regular sunday. This approach was chosen for two reasons. Firstly, it s impossible to know the effect of a open sunday holiday, as the data doesn t show it. Secondly, there is a p 1 promotion on that weekend and sundays was the last day of promotion, which may be sufficient to counter the possible decrease that the sunday holiday of this type may have in sales. In that sense, the input is presented in Table 3.6. This approach was already mentioned for the promotions. Table 3.6: Type 2 effect Monday Tuesday Wednesday Thursday Friday (H) Saturday Sunday (H) A B C

65 As can be seen, specially in year B, the difference between expected sales and sales on the week of the events starts growing larger on thursday. The similarity of saturday sales with a regular week saturday sales is more visible in year A, where the value of the input is The almost 0.5 values of sundays can be seen and easily understood from the known fact that stores were partially closed. Type 3 This is an effect caused by a combination of two holidays. This holidays have 5 days in between, which means that they form a complete week if the days between are considered, which was the case.the approach used to build this attribute was similar to the remaining, having some minor changes. 0.8 Year A Year B h3 Beginning h3 Ending h3 Beginning h3 Ending Figure 3.15: Week of the type 3 holiday (h 3) Figure 3.15 represents the week of the event, the week before, and the after, for the two training years. As is possible to observe, in these years both holidays have a completely different behavior, which makes the definition of this attribute very difficult to accomplish. This has concerns to politics in the company as well because, as in the Type 2 holidays, where shops closed on sunday afternoon in year A and B and then were opened in year C; here, for instance, on the holiday of the end of the week of year A shops did close on the afternoon (sales break down), whereas in years B and C they were open. Is possible to observe another issue as well, as the first holiday of the week in year B was the day right after the Type 2 sunday (holiday), when shops closed in the afternoon, leading to an increase in sales. In addition to these facts, both holidays have a fixed day, which, in comparison with Type 1 and 2, is worse for input construction, because it means that they move around weekdays during the years. And, as it was possible to see throughout the chapter, sales have a high dependence on weekdays, and the fact that these holidays change their weekday depending on the year, makes defining its influence through training very difficult. The procedure to the construction of this input was to define a quotient between the sales value of the point in construction and the average sales of the points of same weekday in the train set, as it was on several previously defined inputs as well. This intended to provide a ponderation in how much sales were superior or inferior in relation to average sales to that day in the week. In that sense, the points respecting to promotions were removed for the average, but Type 1 and 2 holidays (the ones not coincident with the promotions) were not removed, as it was assumed that their weight in the average 41

66 of the hole set was marginal. This approach allowed to obtain the inputs for the week in analysis in the years A and B. For the prediction, the procedure was to average values of the previous years for each day. This may not seem a good procedure but it was the best solution possible, thus, thinking about it, the most important days in the week are the two extremes, the holidays. They might change their weekday during the week, but its effect will be similar in all weekdays, raising sales, because are days similar to a weekend. The only doubt is about the weekends, and will be explained why. Nowadays, stores don t close in holidays anymore, except in Christmas and New Year s day. And, unless first holiday of the event week is a sunday, which makes the second holiday a saturday, which makes it as if there were no holidays; there will always be at least one holiday on a weekday, which will always (expectably) increase sales even if the other is on a weekend. Averaging the remaining points between the two holidays might not be the best solution, however as there are only being used 2 years for training, in the position of the weekend on those points will not change drastically, and only will dilute slightly the power of the weekend. The only points that were not averaged, were the ones where in one of the years was a promotion (year B) and where the stores were partially closed (year A). Those values were chosen to be the same as the year A values, in the first case; and the same as the values in year B, in the second case. Finally, due to strong promotional activity on those days for the forecasting year, these values were multiplied by 2. The values of this attribute resulted in the ones presented in Table 3.7, presented next. Table 3.7: Type 3 holidays input Year Day 1 (H) Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 (H) A B C Finally, to end the Subsection there is presented in Figure 3.16 the inputs of the models represented graphically for the largest set, non-stationary period. 3 Holidays Input h1 A h2 A h3 A h1 B h2 B h3 B h1 C h2 C h3 C Figure 3.16: Holidays input 42

67 3.3 Overall overview of prediction periods and inputs To compile all the information presented in this chapter, in order to make it clear before applying intelligent modeling to the problem in hand in Chapter 4, in this section is presented a brief review of all the chapter. In this chapter there were developed three different prediction periods, presented in Table 3.8. Table 3.8: Prediction periods Prediction periods Weeks Number of data points/year Total of data points Stationary From weeks aa to bb of years A, B and C Stationary w/ disturbances From weeks aa to cc of years A, B and C Non-stationary From weeks aa to dd of years A, B and C Concerning to the inputs, there were defined the following: Weekly seasonality (ws) describes sales distribution during weekdays. Monthly seasonality (ms) represents the end of the month effect on sales. Purchasing power (pw) macroeconomic consequence observed by an offset in different years sales. Promotions (p) describes sales multiplicative factor due to the major promotional activities Holidays or festive days (h) describes the effect of this type of days. In Figure 3.17 is possible to observe the input vector for the non-stationary period, the biggest period possible. The remaining periods are sub-periods of this one. 2 Weekly seasonality 1.1 Monthly seasonality Purchasing power Promotions Holidays Figure 3.17: All inputs 43

68 44

69 Chapter 4 Intelligent modeling for sales forecasting In this chapter, soft computing techniques were applied to sales forecasting for the periods defined in Chapter 3, using the inputs defined in it. The aim is to define the best parameters that constitute every model. Along the chapter, the methodology followed to obtain those parameters is exposed. Due to space restrictions, this methodology is more extensively presented for the stationary period, while for the remaining only the best model s performance is presented. As the models are of neural and fuzzy type, the model development consisted of training for different parameters. For the fuzzy models, the modifiable parameters are: Train/test set size; Fuzziness exponent; Termination tolerance; Seed for initialization; Type of antecedents; Consequent estimation; Clustering algorithm; Number of clusters. The train/test size depends on the forecasting period. The number of clusters is subject of evaluation for each model. The remaining parameters are the same for every fuzzy model. The fuzziness exponent was set to 2; the termination tolerance to 0.01; the seed to sum(100 clock); the antecedents were antecedent membership functions and the consequents estimated with locally weighted Least-Squares. The clustering algorithm was chosen to be the Fuzzy C-means. Concerning the neural networks, the modifiable parameters are: Train/test set size; Number of training epochs; Training algorithm; 45

70 Number of hidden-layers (HL); Number of neurons per HL (N); Concerning the train and test size, as in the fuzzy models, it depends on the forecasting period. However, the number of training epochs and training algorithm used were 1000 training epochs and Levenberg-Marquardt algorithm, respectively. The number of hidden-layers and neurons is evaluated for each model. For the ANFIS models, the modifiable parameters are: Train/test set size; Number of training epochs; Number of membership functions (MF); Shape of membership functions. Concerning the train/test size, the above reasoning is applied. The number of training epochs was set to 20 and the membership functions used were Gaussians. The number of membership functions is evaluated for each model. Concerning the NARX models, both of fuzzy and neural type, there is one more parameter to be defined which is the number of input delays. This parameter was also set independently for each model. Concerning the NARX fuzzy models, the definition of the inputs delay s was defined through feature forward selection, while the increase of the delays improved the training performance. This results are not provided on the modeling sections, as it only is possible to regard the final input definition from the mapping function at the beginning of each model description. In the NARX neural model, the definition of the delays was made executing a script that compared training performance for different delays which is presented in the modeling sections. In all the tests, four criteria were used to evaluate the models performance: ( V AF i = 1 var(yi ŷi) var(y i) ) 100%; RMSE = MSE = n i=1 (x1,i x2,i)2 n ; n i=1 (x1,i x2,i)2 n ; MAP E = n t=1 At Ft A t 100 n. In cases where the best parameter in evaluation was different for each performance criteria, VAF was used as the main criteria. The chapter is divided by period. The application of intelligent modeling to each forecasting period begins with fuzzy modeling, using classification and NARX models. Then, neural networks were developed, using feedforward and NARX networks. Finally, ANFIS was applied to sales forecasting, only using a classification structure. While each model may have different inputs, the output remains the same for each forecasting horizon. In that sense, the respective input is presented in every model development subsection. As already 46

71 mentioned, the methodology for prediction was using the years A and B for training and C for test. As the chapter concerns modeling, the models throughout the chapter are trained, with training data, and then executed with training data to evaluate training performance. 4.1 Stationary period In this section the modeling techniques used in the previous chapters will be applied. Reminding the period characteristics, it concerns to weeks aa to bb from years A, B, and C. The data set has 126 points, 84 being for training and the remaining 42 for validation/test Fuzzy modeling In this section fuzzy modeling is applied to the problem in hand, the forecasting of sales on the stationary period. There were two different approaches to fuzzy modeling. A classification modeling approach and a NARX modeling approach Classification model This type of model is a fuzzy classification model, that maps a input-output relation of the kind s(t + 1) = F ([ws(t + 1); ms(t + 1); pw(t + 1)]). (4.1) Here, ws(t+1), ms(t+1) and pw(t+1) are the weekly seasonality, monthly seasonality, and purchasing power inputs, as defined in the previous chapter. To have the best number of clusters, it was executed a script that trained 10 times models from 2 to 20 clusters. The training results are presented in figure Average VAF per cluster nº VAF 8.7 x Average RMSE per cluster nº 104 RMSE x 1010 Average MSE per cluster nº MSE Average MAPE per cluster nº 5.45 MAPE Figure 4.1: Performance per number of clusters for the classification fuzzy model - stationary period 47

72 As can be seen by figure 4.1, the performance indicators point to a high cluster configuration, although the performance reaches elevate levels for every cluster number. Table 4.1 resumes the 5 best of the previous results. Table 4.1: Performance per number of clusters for the classification model - stationary period VAF RMSE MSE MAPE # Clusters Mean ± Std Best Mean ± Std ( 10 4 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 8.43 ± ± % ± % % ± % 8.42 ± ± % ± % % ± % 8.47 ± ± % ± % % ± % 8.45 ± ± % ± % % ± % 8.46 ± ± % ± % The best model of this type has a 17 cluster structure, presenting a VAF performance of 91.38% NARX model Concerning this type of model, the input data is regressive, having past data, and also feedback from predicted outputs. This model maps a input-output relation such as, s(t + 1) = F ([s(t),..., s(t 6); ws(t); ms(t),...ms(t 2); pw(t),..., pw(t 3)]), (4.2) where {s(t),..., s(t 6)} are the sales on the previous 7 days, ws is the vector of weekly seasonality, which has 1 delay; ms is the monthly seasonality from the previous day until 3 days before (3 delays); and pw is the purchasing power input of the previous day until 4 days before (4 delays). To get the best number of clusters, a script was executed wich trained and tested the model 10 times, for 2 to 20 clusters. The average performance to each number of clusters is provided in Figure 4.2 As can be seen, the performance is better for the lower number of clusters. The best 5 results are summarized in Table

73 5 x 104 Average VAF per cluster nº VAF 2 x 107 Average RMSE per cluster nº RMSE x 1014 Average MSE per cluster nº MSE Average MAPE per cluster nº 1500 MAPE Figure 4.2: Performance per number of clusters for the NARX fuzzy model - stationary period Table 4.2: Performance per number of clusters for the NARX fuzzy model - stationary period VAF RMSE MSE MAPE # Clusters Mean ± Std Best Mean ± Std ( 10 6 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 0.07 ± ± % ± % 3 N/S N/S 1.38 ± ± % ± % 13 N/S N/S 2.08 ± ± % ± % 16 N/S N/S 1.34 ± ± % ± % 17 N/S N/S 1.69 ± ± % ± % As is possible to observe from table, the best NARX fuzzy model has a VAF of 92.60% and a structure of 2 clusters Neural modeling In this section neural modeling is applied to the problem in hand, the prediction of the stationary period. There were two different approaches to neural networks. A feedforward network approach and a NARX network approach Feedforward classification network This type of model is a feedforward neural network. The relation between inputs and outputs is the same as in the classification fuzzy model, s(t + 1) = F ([ws(t + 1); ms(t + 1); pw(t + 1)]), (4.3) where ws(t + 1), ms(t + 1) and pw(t + 1) are the weekly seasonality, monthly seasonality, and purchasing power inputs, respectively. 49

74 To find the best HL configuration, a script was executed which trained and tested models from 1 to 8 HL with 5 neurons in each layer for 10 times. The general performance results are presented in Figure 4.3, while in Table 4.3 there are summarized the 5 best. 80 Average VAF per nº of HL VAF 6 x 105 Average RMSE per nº of HL RMSE x 1011 Average MSE per nº of HL MSE Average MAPE per nº of HL 35 MAPE Figure 4.3: Performance per number of HL for the feedforward neural network - stationary period As it is possible to observe in figure 4.3, the network configuration that has the best average performance is a network with 1 HL. Table 4.3: Performance per number of HL for the feedforward network - stationary period VAF RMSE MSE MAPE # HL Mean ± Std Best Mean ± Std ( 10 5 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 1.66 ± ± % ± % % ± % 3.50 ± ± % ± % % ± % 3.51 ± ± % ± % % ± % 3.78 ± ± % ± % % ± % 3.06 ± ± % ± % In order to proceed searching for the best configuration, another script was executed which trained a network with 1 hidden layer and the same number of neurons in each layer, going from 1 to 10. Every configuration was tested 10 times. The results are presented in Figure 4.4, and the best 5, are again summarized in Figure

75 Average VAF per nº of neurons VAF 4 x Average RMSE per nº of neurons 105 RMSE x 1011 Average MSE per nº of neurons MSE Average MAPE per nº of neurons 25 MAPE Figure 4.4: Performance per number of neurons in 1 HL for the feedforward neural network - stationary period Table 4.4: Performance per number of HL for the feedforward network - stationary period VAF RMSE MSE MAPE # N Mean ± Std Best Mean ± Std ( 10 5 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 2.07 ± ± % ± % % ± % 1.80 ± ± % ± % % ± % 1.33 ± ± % ± % % ± % 1.32 ± ± % ± % % ± % 1.28 ± ± % ± % As can be seen in table 4.4, the best configuration for the static model is a structure of 1 HL with 10 neurons, presenting a VAF of 91.79% NARX network This model is a NARX neural network. It means that a neural network uses past points to help predicting future ones. The model consisted of a defined function that relates the outputs with inputs such as, s(t + 1) = F ([s(t),..., s(t 10); ws(t),..., ws(t 10);... ms(t),..., ms(t 10); pw(t),..., pw(t 10)]), (4.4) where s, ws, ms and pw are the already mentioned ses, weekly seasonality, other effects, and purchasing power vectors, respectively. All of the inputs have 11 delays, which obtention is presented next. To achieve the best hidden-layer configuration, a model varying from 1 to 8 HL with 5 neurons in each layer was trained 10 times for each configuration. The results are presented next in Figure 4.5, and the 5 best are summarized in Table

76 50 Average VAF per nº of HL VAF 6 x 105 Average RMSE per nº of HL RMSE x 1011 Average MSE per nº of HL MSE Average MAPE per nº of HL 40 MAPE Figure 4.5: Performance per number of HL for the NARX network - stationary period Table 4.5: Performance per number of HL for the NARX network - stationary period VAF RMSE MSE MAPE # HL Mean ± Std Best Mean ± Std ( 10 5 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 2.37 ± ± % ± % % ± % 2.61 ± ± % ± % 4 N/S 11.03% 3.85 ± ± % ± % 5 N/S 12.34% 5.75 ± ± % ± % 7 N/S 4.97% 4.45 ± ± % ± % As it is observable in figure 4.5, the best performance corresponds to 1 HL. In order to proceed searching for the best configuration, a script was executed which trained a network with 1 hidden layer with 1 to 10 neurons. Every configuration was trained 10 times. The results are presented in Figure 4.6, and the best 5, are again summarized in Figure

77 50 Average VAF per nº of neurons VAF 4.5 x 105 Average RMSE per nº of neurons RMSE x 1011 Average MSE per nº of neurons MSE Average MAPE per nº of neurons 30 MAPE Figure 4.6: Performance per number of neurons in 1 HL for the NARX neural network - stationary period Table 4.6: Performance per number of neurons in each HL for the NARX network - stationary period VAF RMSE MSE MAPE # N Mean ± Std Best Mean ± Std ( 10 5 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 2.63 ± ± % ± % % ± % 3.08 ± ± % ± % % ± % 3.20 ± ± % ± % % ± % 2.34 ± ± % ± % % ± % 2.43 ± ± % ± % The results point to a structure of 1 HL with 5 neurons in each HL. Recall that VAF is the main performance criteria. To finally determine the parameters of the best model, a script was executed to find the best number of delays that the inputs should have. A network was trained 10 times for different numbers of delays, from 1 to 14 (2 weeks). The results are presented in figure 4.7 and the 5 best summarized in table

78 200 Average VAF per nº of delays VAF 10 x 105 Average RMSE per nº of delays RMSE x 1011 Average MSE per nº of delays MSE Average MAPE per nº of delays 80 MAPE Figure 4.7: Performance per number of delays in a 7 single neuron HL NARX network - stationary period Table 4.7: Performance per number of neurons in a 7 single neuron HL NARX network - stationary period VAF RMSE MSE MAPE #d Mean ± Std Best Mean ± Std ( 10 5 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 2.39 ± ± % ± % % ± % 2.47 ± ± % ± % % ± % 1.97 ± ± % ± % % ± % 1.70 ± ± % ± % % ± % 1.81 ± ± % ± % The final configuration was, then, obtained. The neural NARX model for the stationary period has 1HL with 5 neurons, and its inputs have delays of 11 days to prediction. The best model presents a VAF of 67.67% ANFIS modeling Concerning this model, it was built a classification model, mapping the input-output relation as the previous, s(t + 1) = F ([ws(t + 1); ms(t + 1); pw(t + 1)]), (4.5) where ws(t + 1), ms(t + 1) and pw(t + 1) are the weekly seasonality, monthly seasonality, and purchasing power inputs, respectively. In terms of the train/test size, it was used the same for the remaining models of the same period. To have the best number of membership functions, 9 models were trained from 2 to 10 membership functions. The performance per membership function can be observed in Figure 4.8, and in Table 4.8, 54

79 the 5 best results are summarized Average VAF per cluster nº VAF 8 x 104 Average RMSE per cluster nº RMSE x 1010 Average MSE per cluster nº MSE Average MAPE per cluster nº 5.1 MAPE Figure 4.8: Performance per number of membership functions for the ANFIS model - stationary period Table 4.8: Performance per number of membership functions for the ANFIS model - stationary period VAF RMSE MSE MAPE #MF Mean ± Std Best Mean ± Std ( 10 4 ) Best Mean ± Std ( ) Best Mean± Std Best 4 to % ± % 7.81 ± ± % ± % As is possible to observe, the 5 best performances were exactly the same. In that sense, the best structure should be the simplest, consisting of 3 membership functions. The best model has a VAF of 92.30%. 4.2 Stationary period with disturbances In this section, the modeling techniques application spoken in the previous chapters was continued. Reminding the period characteristics, it concerns from weeks aa to cc of years A, B and C. The he data sets has 189 points, 126 being for training and the remaining 63 for testing Fuzzy modeling In this section fuzzy modeling is applied to the problem in hand, the prediction of the stationary period with disturbances. There were two different approaches to fuzzy modeling. A classification modeling approach and a NARX modeling approach. 55

80 Classification model As for the previous period, this type of model is a fuzzy classification model, but this one maps an input-output relation of the kind s(t + 1) = F ([ws(t + 1); ms(t + 1); pw(t + 1); h(t + 1); p(t + 1)]). (4.6) Here, ws(t + 1), ms(t + 1), pw(t + 1), h(t + 1) and p(t + 1) are the weekly seasonality, monthly seasonality, purchasing power, holidays and discounts inputs, respectively. To have the best cluster number, a script which trained 10 times models from 2 to 20 clusters was executed. The best model s performance is presented in Table 4.9. Table 4.9: Performance per number of clusters for the classification model - stationary period with disturbances VAF RMSE MSE MAPE # Clusters Mean ± Std Best Mean ± Std ( 10 5 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 1.07 ± ± % ± % The best classification fuzzy model for the stationary period with disturbances has a 19 cluster structure, presenting a performance of 95.64% (VAF) NARX model As for the previous period, the NARX model works with inputs from past instants. The model maps the input-output relation as, s(t + 1) = F ([s(t),..., s(t 8); ws(t),..., ws(t 2);... ms(t),..., ms(t 4); pw(t); h(k),..., h(t 2); p(t)]), (4.7) where {s(t),..., s(t 8)} are the sales on the previous 9 days, ws is the input of weekly seasonality which has inputs of the 3 previous days, ms is the monthly seasonality input of the 5 previous days; pw is the purchasing power input of the previous day. Finally, h and p are the promotions and holidays inputs which have 4 and one day of delay, respectively. To get the best number of clusters, a script was executed that trained the model 10 times, for 2 to 20 clusters. The best model performance is presented in Table Table 4.10: Performance per number of clusters for the NARX fuzzy model - stationary period with disturbances VAF RMSE MSE MAPE # Clusters Mean ± Std Best Mean ± Std ( 10 4 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 9.10 ± ± % ± % 56

81 The best fuzzy NARX model for the stationary period with disturbances has a structure of 3 clusters and a VAF of 96.33% Neural modeling In this subsection the development of neural networks for the stationary period with disturbances is presented. Firstly the feedforward model is presented, followed by the NARX network Feedforward classification network This type of model is a feedforward neural network. The relation between inputs and outputs is the same as in the classification fuzzy model, s(t + 1) = F ([ws(t + 1); ms(t + 1); pw(t + 1); h(t + 1); p(t + 1)]). (4.8) To find the best HL configuration, a script which trained and tested models from 1 to 8 HL with 5 neurons in each layer for 10 times was executed. In Table 4.11 is presented the best model s performance. Table 4.11: Performance per number of HL for the feedforward network - stationary period with disturbances VAF RMSE MSE MAPE # HL Mean ± Std Best Mean ± Std ( 10 6 ) Best Mean ± Std ( ) Best Mean± Std Best 8 N/S 20.58% 1.48 ± ± % ± % As it s possible to observe on the previous Table (4.11), the best HL configuration is with 8 HL. To evaluate the number of neurons in each HL a script was executed, where models going from 1 neuron to 10 in each HL were trained. The best training performance of the models is presented in Table Table 4.12: Performance per number of neurons per HL for feedforward networks - stationary period with disturbances VAF RMSE MSE MAPE # N Mean ± Std Best Mean ± Std ( 10 6 ) Best Mean ± Std ( ) Best Mean± Std Best 9 N/S 60.42% 1.29 ± ± % ± % From Table 4.12, it s then possible to conclude that the best model has 8 HL with 9 neurons, presenting a VAF of 60.42% NARX network The NARX model, developed in this sub-sub-section, maps an input-output relation of the kind, s(t + 1) = F ([s(t),..., s(t 10); ws(t),..., ws(t 10); ms(t),..., ms(t 10);... pw(t),..., pw(t 10); h(t),..., h(t 10); p(t),..., p(t 10)]), (4.9) 57

82 which means that both output, s(t + 1), and all the inputs feed the model with data from the 11 previous days until the day before prediction. To achieve the best hidden-layer configuration, a model varying from 1 to 8 HL with 5 neurons in each layer was trained 10 times for each configuration. The best model is presented in Table Table 4.13: Performance per number of HL for the NARX networks - stationary period with disturbances VAF RMSE MSE MAPE # HL Mean ± Std Best Mean ± Std ( 10 6 ) Best Mean ± Std ( ) Best Mean± Std Best 6 N/S 6.89% 1.12 ± ± % ± % The best average performance points to a 6 HL structure. To define the best neuron configuration, models with 1 to 10 neurons in each of the 6 HL were trained 10 times. The average performance is presented in Table 4.14 Table 4.14: Performance per number of neurons in each HL for the NARX network - stationary period with disturbances VAF RMSE MSE MAPE # N Mean ± Std Best Mean ± Std ( 10 5 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 5.57 ± ± % ± % By Table 4.14, the best architecture for this network is a 6 HL with one neuron structure. To conclude the development of this model it only remains to choose the best delay configuration. For that purpose models with one to 14 delays were trained. The result is presented in Table Table 4.15: Performance per number of delays for the NARX network - stationary period with disturbances VAF RMSE MSE MAPE #d Mean ± Std Best Mean ± Std ( 10 5 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 3.40 ± ± % ± % The model developed has a 6 HL with one neuron structure. The model also has 11 delays on its data. The best model presents a VAF of 19.05% ANFIS Concerning this model, it was built a classification model, mapping the input-output relation as the previous, s(k + 1) = F ([ws(t + 1); ms(t + 1); pw(t + 1); h(t + 1); p(t + 1)]). (4.10) Due to computational limitations, and as the data set is much bigger than the previous one (more points and more inputs), it was only possible to train the model with 2 membership functions. The performance is presented in Table

83 Table 4.16: Performance per number of membership functions for the ANFIS model - stationary period with disturbances VAF RMSE MSE MAPE # MF Mean ± Std Best Mean ± Std ( 10 4 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 9.55 ± ± % ± % The 2 membership function ANFIS has a training VAF of 96.13% 4.3 Non-stationary period This is period is the one that shows more disturbances, and was, therefore, named non-stationary period. It consists of a total of 420 points. The training part, years A and B, represent 280 of those points, being the remaining 140 used as test. The period concerns from weeks aa to dd Fuzzy modeling In this section fuzzy modeling is applied to the problem in hand, the forecasting of sales on the nonstationary period. Again, there were two different approaches to fuzzy modeling. A classification modeling approach and a NARX modeling approach Classification model This type of model is a fuzzy classification model, that maps a input-output relation of the kind s(t + 1) = F ([ws(t + 1); ms(t + 1); pw(t + 1)h(t + 1)p(t + 1)]). (4.11) To have the best cluster number, it was executed a script which tested 10 times models from 2 to 20 clusters. The performance criteria point to a 7 cluster configuration, and Table 4.17 presents the best model performance. Table 4.17: Performance per number of clusters for the classification model - non-stationary period VAF RMSE MSE MAPE # Clusters Mean ± Std Best Mean ± Std ( 10 5 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 1.81 ± ± % ± % The best fuzzy classification model for the non-stationary period has 8 clusters with a VAF of 92.20% NARX model The NARX model developed here, builds an input-output relation of the type, s(t + 1) = F ([s(t),..., s(t 6); ws(t); ms(t),..., ms(t 4); pw(t); h(t),..., h(t 13); p(t)]). (4.12) 59

84 The target is fed-back with the previous 8 predictions, and the data from 8, 12, 1, 14, and 1 previous days for ws, ms, pw, h and p, respectively, is also provided to the model. To get the best number of clusters, a script was executed which trained and tested the model 10 times, for 2 to 20 clusters. The best performance is presented in Table Table 4.18: Performance per number of clusters for the NARX model - non-stationary period VAF RMSE MSE MAPE # Clusters Mean ± Std Best Mean ± Std ( 10 4 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 1.47 ± ± % ± % As it is possible to observe in Table 4.18, the best model has 2 clusters and a training VAF of 92.35% Neural modeling In this section there were developed the last two feedforward and NARX models, to forecast the nonstationary period Feedforward classification network This type of model is a feedforward neural network. The relation between inputs and outputs is the same as in the classification fuzzy model, s(t + 1) = F ([ws(t + 1); ms(t + 1); pw(t + 1); h(t + 1); p(t + 1)]), (4.13) To find the best HL configuration, a script was executed which trained and tested models from 1 to 8 HL with 5 neurons in each layer for 10 times. In Table 4.19 is presented the best. Table 4.19: Performance per number of HL for the feedforward network - non-stationary period VAF RMSE MSE MAPE # HL Mean ± Std Best Mean ± Std ( 10 6 ) Best Mean ± Std ( ) Best Mean± Std Best 8 N/S 35.70% 2.12 ± ± % ± % The best HL configuration points to a 8HL model and, in that sense, the following step was to train different models with 8 HL with one to 10 neurons in each and evaluate which was the best configuration. Each model was tested 10 times and the best model is presented in Table Table 4.20: Performance per number of neurons in each HL for the feedforward network - non-stationary period VAF RMSE MSE MAPE # N Mean ± Std Best Mean ± Std ( 10 6 ) Best Mean ± Std ( ) Best Mean± Std Best 2 N/S 47.97% 1.21 ± ± % ± % 60

85 Concluding, the feedforward network that presents the best performance is a network with 2 neurons in each of its 8 HL, presenting an VAF of 47.97% NARX network The NARX model, developed in this sub-sub-section, maps an input-output relation of the kind, s(t + 1) = F ([s(t),..., s(t 9); ws(t),..., ws(t 9);... ms(t),..., ms(t 9); pw(t),..., pw(t 9); h(t),..., h(t 9); p(t),..., p(t 9)]), (4.14) which means that both output, s, and all the inputs feed the model with data from the 10 previous days to the one being predicted. To achieve the best hidden-layer configuration, a model varying from 1 to 8 HL with 5 neurons in each layer was trained 10 times for each configuration. The result is presented in Table Table 4.21: Performance per number of HL for the NARX network - non-stationary period VAF RMSE MSE MAPE # HL Mean ± Std Best Mean ± Std ( 10 6 ) Best Mean ± Std ( ) Best Mean± Std Best 6 N/S 12.10% 1.42 ± ± % ± % As it is possible to observe, the best model should have 6 HL. To evaluate the number of neurons each HL should have, a script that trained models with 1 to 10 neurons in each of the 6 HL was executed. The best result is summarized in Table Table 4.22: Performance per number of neurons in each HL for the NARX network - non-stationary period VAF RMSE MSE MAPE # N Mean ± Std Best Mean ± Std ( 10 6 ) Best Mean ± Std ( ) Best Mean± Std Best 2 N/S 0.64% 1.34 ± ± % ± % Finally, to evaluate the best number of tap delays into the inputs and outputs, models with 1 to 14 delays were trained 10 times each, and the performance is presented in Table Table 4.23: Performance per number of delays for the NARX network - non-stationary period VAF RMSE MSE MAPE #d Mean ± Std Best Mean ± Std ( 10 5 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 5.16 ± ± % ± % Concluding, the average best neural NARX has 6 HL with 2 neurons and 10 delays on data, presenting a VAF of 5.48%. 61

86 4.3.3 ANFIS Concerning this model, it was built a classification model, mapping the input-output relation as the previous, s(t + 1) = F ([ws(t + 1); ms(t + 1); pw(t + 1); h(t + 1); p(t + 1)]). (4.15) Due to computational limitations, and as the data set is much bigger than the first set (more points and more inputs), it was only possible to train the model with 2 membership function. The performance is presented in Table Table 4.24: Performance per number of membership functions for the ANFIS model - stationary period with disturbances VAF RMSE MSE MAPE # MF Mean ± Std Best Mean ± Std ( 10 5 ) Best Mean ± Std ( ) Best Mean± Std Best % ± % 1.61 ± ± % ± % Concluding, the best ANFIS model for this period has a VAF of 93.28% 4.4 Summary In this Chapter it was presented the development of intelligent models to forecast sales in three different periods: stationary period, stationary period with disturbances and non-stationary period. For each period, it was developed a fuzzy classification and a NARX model, a feedforward and a NARX neural network, and an ANFIS model. In Table 4.25, 4.26 and 4.27, the best models performance an configuration of each type are presented. In addition there is also presented the average VAF of each best configuration. In the next chapter, all of these models are tested for 2012 forecasting. Table 4.25: Performance of each model in the stationary period VAF RMSE MSE MAPE Type of model Structure Best Average Best Average Best Average Best Average Fuzzy Classification 17 clusters 91.38% 91.16% % 5.21% Fuzzy NARX 2 Clusters 92.60% 92.60% % 4.73% NN Feedforward 1 HL - 10 N 91.79% 80.17% %% 7.98% NN NARX 1 HL - 5 N 67.67% 67.67% % 11.43% ANFIS 4 MF 92.30% 92.30% % 4.93% 62

87 Table 4.26: Performance of each model in the stationary period with disturbances VAF RMSE MSE MAPE Type of model Structure Best Average Best Average Best Average Best Average Fuzzy Classification 19 clusters 95.64% % 6.55% Fuzzy NARX 3 Clusters 96.33% 96.26% % 5.49% NN Feedforward 8 HL - 9 N 60.42% 38.25% % 80.47% NN NARX 6 HL - 1 N 19.05% 19.05% % 18.64% ANFIS Classification 2 MF s 96.13% 96.13% % 5.93% Table 4.27: Performance of each model in the non-stationary period VAF RMSE MSE MAPE Type of model Structure Best Average Best Average Best Average Best Average Fuzzy Classification 7 clusters 92.20% 91.79% % Fuzzy NARX 2 Clusters 92.35% 92.35% % 9.94% NN Feedforward 8 HL - 2 N 47.97% % 83.94% NN NARX 6 HL - 2 N 5.48% 5.48% % 27.36% ANFIS Classification 2 MF s 93.28% % 11.17% 63

88 64

89 Chapter 5 Results and discussion In this chapter, the previously developed models are tested. After forecasting sales, a comparison between the forecasts is made. As in Chapter 4, the comparisons are separate for each forecasting horizon. At the end of each forecasting horizon it is also presented the application of the type of model that performed best in testing to the sales forecast of all the business units. Here, are presented the best training performances, which were used to select the best structure, and its corresponding test performances. The reason for presenting the training performances here and not in Chapter 4 has to due with the lack of time and space to train and test each business unit (5 in total, named AA to EE), for each type of model and forecasting horizon. In that case, instead of defining the parameters for 15 models, as it was done here, it would have to be done for 75 models (5 types of models for 5 business units and 3 forecasting horizons). 5.1 Stationary period This period s forecasting performances are presented in Table 5.1. In the table, both training and test performance per model is presented; and in Figures 5.2 and 5.1 are presented the forecasting curves compared with each other, and the best model s performance for a better visualization. At this point a remark should be made. Due to some misunderstandings, only in a final part of this work was possible to know that p 2 promotion was part of the 5 last days of this period. As the goal of this period is to forecast sales for a stationary period and a promotion of this type is expected to disturb the system, the forecasting horizon has less 5 days than what explained in Chapter 3, in order to exclude it. 65

90 Table 5.1: Performance per number of clusters for the classification model - stationary period VAF RMSE MSE MAPE Model Train Test Train Test Train Test Train Test Fuzzy Class % 84.90% % 16.83% Fuzzy NARX 92.60% N/S % 34.51% Feedforward NN 91.79% 48.83% % 28.50% NARX NN 67.67% 59.93% % 12.69% ANFIS Class % 81.31% % 10.77% As it is possible to observe, although the fuzzy NARX presents the best training performance, it doesn t present the best test performance. In fact, the fuzzy classification model, which had the worse performance of the four best models in training, presents the best test results. The ANFIS model, which had one of the best training performances, had a poor test performance, not being able to capture the system dynamics. Concerning the feedforward network, as it was expected, its test performance was lower than training. However, its performance is much lower than the performance of its most similar model, the fuzzy classification system, which maps the input-output relation in the same way. In fact, except for this period, neural networks already had a worse performance in training when compared with the fuzzy systems of the same type. The NARX NN has a similar training and test performance, being far from the values presented by the fuzzy classification model Target Fuzzy Class. 0 Figure 5.1: Sales forecasting with the fuzzy classification model - Stationary period a b As can be seen in Figure 5.1, where the sales forecast by the fuzzy classification model is presented, the model is able to capture the system dynamics. The major discrepancies between the prediction and the target are an elevate difference between the target and the prediction in the first week, a general offset between the two curves, some unconsidered effects during weekdays. Indeed, in the first week, the model only manages to predict 70% of total sales, with a N/S VAF in those days, which means that the model wasn t able to capture the dynamics of the first days. This behavior of sales may be explained by the fact that this period is usually a discounts season (not specified due to confidentiality), and in a period with lower consumption due to deteriorated macroeconomic environment customers might wait for lower prices to consume, instead of buying in a season when they would usually buy, such as Christmas. This trend could be assumed for testing, but the macroeconomic 66

91 environment has been deteriorating for more than a year and, as can be seen in Figure 3.4, this trend was not observable neither in year A nor B. As a consequence, it would be very difficult presume it would happen. In addition to that, by not having any indicators of this effect on training, it would be also difficult to shape this effect even if it was desired. The other main difference between the two curves is a general offset. This offset may be explained by overestimating the purchasing power decrease for year C. It might also be a consequence of the previously mentioned effect of the discounts season. This can be observed in the forecasting figure (5.1), where in this season the offset seems bigger than in rest of the period. Most of the unconsidered effects during weekdays were unpredictable. There are slight increases in sales on wednesdays that seem random. The increase on the point a of the figure may be explained by being the first day of a month and customers have more money available to spend. However, analyzing again Figure 3.4, it can be observed that this behavior only happened in year C and, therefore, it would be difficult firstly to know that it would happen, and secondly to correctly shape it, as there is no similar training behavior. Another special day that had a different effect than the general trend is point b. This is a consume-related day (lets name it b day) but, from previous years (Figure 3.4), it is possible to observe that on the same day of year A there was a similar effect, while in year B there was no effect. In year A, the type 1 moving holiday (h 1 ) coincided with this day. As in year B there was no effect at all on that day s sales, this led to the assumption that b day had no effect on sales. And by the training data, due to the fact that the effect seen in year A is also due to h 1, the effect that b day would have in year C was not so predictable as might seem by being a special day. In Figure 5.2, all the models forecast is presented. The ANFIS model, which has a similar performance to the fuzzy classification model, presents a similar dynamic to the sales curve. The offset seems slightly bigger, and by sales overestimation, instead of underestimation. The feedforward model has a lower deviation on the weekends but, dynamically, misses the system s behavior during weekdays. Finally, concerning the fuzzy NARX model, it completely fails to capture the system s dynamics. It may be related to the fact that the beginning of the prediction period has a different behavior that a general week has. 1 Target Fuzzy class. Fuzzy NARX NN class. NN NARX ANFIS Figure 5.2: Comparison of all the results - Stationary period a b Application of the best model to Business Units In this part, the application of the model that yielded the best test results to the 5 business units is presented. The procedure for training was the same 67

92 as in the aggregate sales. The models were trained from 2 to 20 clusters 10 times each. The remaining parameters were the same as in Chapter 4. In Table 5.2 the best cluster configuration, training and test results for each business unit are presented. Table 5.2: Application of the fuzzy classification model to the different business units - stationary period VAF RMSE MSE MAPE B. U. Clusters Train Test Train Test Train Test Train Test BA % 84.77% % 14.75% BB % 45.89% % 17.14% BC % 49.62% % 32.06% BD % 84.46% % 15.79% BE % 66.22% % 8.19% From table analysis, there are some interesting results. Firstly, because for BU BA the test results outperform the training results. Secondly, for the first four BU, which have all similar results in training, their test results are quite different in groups of two. Taking in account that the inputs used for each business unit were the ones used for the aggregate sales, and each business unit may have a significantly different behavior, it was expected that the results for training were lower than the training results of the aggregate model. Why BU test performance is higher for BU BA than training performance might have to due with a better fitting of the model to the prediction data, i.e., the test data may be more similar to the aggregate training data than the training data is. The reason for BU BB and BC have poorer test results comparing with BU BA and BD may have to due with the differences explained above, for the aggregate model. In fact, the general offset and the peaks during weekdays are much more present in these BUs than in the remaining. Unfortunately, due to space restrictions it s impossible to present here figures from these cases. Concerning BU BE, as the training performance were already lower, it was also expected that test performance would be lower. 5.2 stationary period with disturbances For this period, the performance of all the models is shown in Table 5.3. Then, in Figures 5.4 and 5.3 is possible to observe all the forecasting curves compared with each other, and the best model individually for a better visualization. 68

93 Table 5.3: Performance per number of clusters for the classification model - stationary period with disturbances VAF RMSE MSE MAPE Model Train Test Train Test Train Test Train Test Fuzzy Class % 54.42% % 14.45% Fuzzy NARX 96.26% N/S % 18.42% Feedforward NN 60.42% N/S % % NARX NN 19.05% N/S % 28.21% ANFIS Class % N/S % 22.31% By analyzing the table, is possible to state that the model that best predicts sales is the fuzzy classification model. None of the remaining models are able to capture the test data dynamics. In Figure 5.3 is possible to observe the forecast by the best model Target Fuzzy Class. 0 a b h1 Figure 5.3: Sales forecasting with the fuzzy classification model - stationary period with disturbances Observing the figure in detail, is possible to state that the model can capture some part of the sales dynamics. The major discrepancies between the models prediction and target are an elevate difference between target and prediction in the first week, some unconsidered effects during weekdays, an inaccurate forecast of pioneer promotions and an unpredictable behavior not taken in account at the end of the forecasting period. As in the previous forecasting period, the sales behavior on the first week of the period, which may be a consequence of the discounts season, certainly contributes for the inaccuracy of the performance. However, as this week is only one out of nine weeks of forecast, in opposition to being one out of five on the previous period, this effect is much more diluted here. The same reasoning can be made for the unconsidered effects during the weekdays. The effects happening on points a and b are diluted in a longer forecasting horizon. Although they contribute to the lower performance, these weren t certainly the effects that most led to the presented performance. In addition to that, and as explained on the previous forecasting horizon, these are unpredictable effects due to the fact that on the same days of previous years none of these effects had an impact on sales. The days added to this forecasting horizon are the ones that had the biggest contribution to the lower performance of this period. For instance, the VAF of the prediction corresponding to the previous period 69

94 is of 81.3%, whereas the VAF for the rest of the prediction is of 48.0%. In fact, just by analyzing the Figure is possible to conclude it. The first discrepancy begins with the new promotion made, the p 2 promotion. This promotion is represented from points b to c. It was a pioneer event, which resulted on not having similar events for training the model. In that sense, it is normal to fail to fully capture the dynamics of the system. Even though its estimation is not accurate there are some similarities with what happened, specially with the combined effect of the p 2 promotion with the h 1 holiday (point c). The biggest discrepancies of the model are on the weekend and on the days before, where there is an increase estimated that doesn t happen in practice. Concerning the rest of the period, the only effect taken in account for these days was the monthly seasonality. It was not expected any effect of the kind presented. As can be seen in the figure, sales on the 3 last weekends of prediction are much higher than the ones before. That effect can be seen specially on the first one. In addition to that, there is also an elevate increase in sales on the days nearer the weekends. From a discussion with sales experts, the only possible predictable effect happening on those days is a type of promotion, which happens every year on the same dates. This promotion, as other minor promotions weren t taken in account due to the lack of impact that they showed on a first approach analysis. From observation of Figure 3.5, it is not clear to conclude about the increase shown. Firstly, because in year A, the sales increase is only presented on the two weekends corresponding to the end of the month. Secondly, in year B, in one of the weekends there is a p 1 promotion, but there is also an increase on the following 2 weekends in comparison with the rest. This increase might be explained by the spending of the amount of discount and the fact that it s the end of the month (explaining why sales on the second weekend are higher than on the third). When modeling it was not allowed to take year C in consideration but, now, observing the same figure, it can be seen that on the two last weekends, sales in year C have similar values than sales on the same period in years A and B. When comparing this values with the rest of the prediction period, where there is an offset for each year, it is easier to realize that this values are not expectable for a period where no other effect besides monthly seasonality was taken in account. For that reason, in a future approach, this specific subject must be addressed more carefully. In Figure 5.4 can be seen the forecasting for all the models. For this period, during the weeks concerning the stationary period all the models but the NN have an accurate performance. Even the fuzzy NARX, which wasn t able to capture the system dynamics on the previous period, manages to be close to the target curve, although with some larger errors than fuzzy classification and ANFIS. For the disturbances part of the prediction, the fuzzy NARX forecast has much more disturbances that the real curve has, and for that fraction of the curve, it fails to capture the dynamics of the system. The NN s completely fail to capture the system s dynamics. Concerning the ANFIS model, which also has the ability to accurately forecast sales on the first part of the period, in the disturbances part, it also completely fails to capture the system s dynamics but only during the p 2 promotion. For the rest of prediction it has a similar behavior to the classification model. 70

95 1 Target Fuzzy class. Fuzzy NARX NN class. NN NARX ANFIS a b h1 Figure 5.4: Comparison of all the results - stationary period with disturbances Application of the best model to Business Units In this part, the application of the model that yielded the best test results to the 5 business units is presented. The procedure for training was the same as in the aggregate sales. The models were trained from 2 to 20 clusters 10 times each. The remaining parameters were the same as in Chapter 4. In Table 5.4 the best cluster configuration, training and test results for each business unit are presented. Table 5.4: Application of the fuzzy classification model to the different business units - stationary period with disturbances VAF RMSE MSE MAPE B. U. Clusters Train Test Train Test Train Test Train Test BA % N/S % 14.75% BB % N/S % % BC % N/S % % BD % 10.71% % 78.48% BE % N/S % % From the table, is possible to observe that the models applied to the business units perform poorly. Due to space restrictions it s not possible to make an exhaustive discussion about the problem. Possibly, due to the values of the forecast, it may be some problem in clustering, a singularity problem, although the clusters chosen were the best in training. It might also suggest over fitting to training data, being the test data significantly different from training. This latter reason is possible, although not probable, due to the fact that for aggregate sales there is a main trend always present in the sales curve, despite the presence of some unpredictable and pioneer effects. 5.3 Non-stationary period For this period, there were only three models that performed accurately enough in training: the fuzzy classification, fuzzy NARX and the ANFIS. Consequently, these were the models selected for testing. In Table 5.5, the training and test performance per model is presented and Figure 5.5 shows all the forecasts 71

96 compared with each other. Table 5.5: Comparison of the best models in testing - non-stationary period VAF RMSE MSE MAPE Model Train Test Train Test Train Test Train Test Fuzzy Class % N/S % 33.26% Fuzzy NARX 92.35% N/S % 34.81% Feedforward NN 47.97% N/S % 36.07% NARX NN 5.48% N/S % % ANFIS Class % N/S % 48.47% From the table, is possible to state that the none of the models has reasonable performances, although all have accurate training performances. The fuzzy NARX has been presenting poor results but the classification model has been presenting accurate performances and this poor performance was not expected. In order to search for a possible reason for these performances, in Figure 5.5 is possible to observe the forecast for all the models. 1 Target Fuzzy class. Fuzzy NARX NN class. NN NARX ANFIS a b h1 h2 & p1 h3 p3 Figure 5.5: Sales forecasting with the fuzzy classification and NARX model - Non-stationary period Analyzing the figure, the NARX fuzzy model has problems in capturing the sales dynamics in the non-stationary periods, as it did on previous periods. Both NN completely miss the system s dynamics. The ANFIS model can capture some parts dynamics but has very elevate errors on other parts of the forecast. On the contrary, the fuzzy classification model is able to capture the dynamics of the major parts of the curve, but in some periods, such as the p 2 promotion, the h 2 and h 3 holiday weeks, it has elevate errors. In a more general point of view, anyone with the minimal business knowledge would know that the prediction is wrong for that period. Specially an expert that would implement the model and for those periods would have estimated a sales increase (as there is on the test inputs), and the estimation results in a sales decrease. The reason for this error might be an error due with clustering or due to matrix singularities. In that sense, and specially due to the fact that is quite obvious that the forecast is wrong, a different model was chosen. The approach in choosing the model was to choose the next best training model that had a different cluster structure. This was an iterative procedure, as only on the 72

97 second next best model, no errors of this kind were present. The new model has a 20 clusters structure, and its training and testing performances are shown of Table 5.6 and the sales forecast in Figure 5.6 Table 5.6: Performance per number of clusters for the classification model - non-stationary period VAF RMSE MSE MAPE Model Train Test Train Test Train Test Train Test Fuzzy Class % 64.41% % 19.67% As can be seen from table analysis, the training performance is similar to the best model, that presented a VAF of 92.20, only differing about 0.2%. From figure analysis, the major discrepancies are an elevate difference between target and prediction in the first weeks, unconsidered effects during weekdays, an inaccurate forecast of some pioneer promotions, an unpredictable behavior not taken in account in the middle of the forecasting period and an inaccurate forecast for the week of the h 3 holidays. 1 Target Fuzzy Class a b h1 h2 & p1 h3 p3 Figure 5.6: Sales forecasting with the fuzzy classification model - Non-stationary period Most of this discrepancies were already present in the previous forecasting horizons. Thus, for more detailed analysis is recommended to the reader to read Sections 5.1 and 5.2. As in the previous periods, the difference of the first week forecast has a contribution for the lower VAF. However, as already mentioned on Section 5.2, as the forecasting horizon is longer, this effect has a lower impact on worsening VAF, not being, certainly, one of the major influences in the system. Concerning the p 2 promotion it was already discussed that the model didn t forecast accurately this promotion due to the fact of being a pioneer promotion and not being present in the training data. The reason for the inaccuracy on the following weeks forecast was also discussed on the previous section. It was suggested a careful approach for future predictions due to the fact of not being clear if it was an isolated event or if it was, in fact, a consequence of other types of promotions combined with the monthly seasonality effect. For this period, there is another trend that the model had difficulties in capturing. The week of the h 3 holidays was a period difficult to forecast due to several factors. Firstly, because these are two fixed holidays, which mean that they move on weekdays, and in training there was no previous effect of having 73

98 these holidays in the same weekday as in year C. Secondly, because on previous years stores were only partially opened and for the forecasting year they were all opened and with an increase in promotional activity. For these reasons it was difficult to know exactly what would happen and, more importantly, it was even more difficult for the model to know due to the lack of information about this holidays in its database. In a more interesting analysis, this period allows the evaluation of well defined features in training. This is the case of the promotions, where there is a p 1 and p 3 promotion in testing. For the p 1 promotion is possible to observe that, dynamically, the model has some difficulties capturing its trends. The VAF for this three days of promotion has no significance (N/S), which is consistent with the dynamic behavior of the prediction because the dynamic behavior of the promotion has the peaks in different days. This has to due with the mixed effect of the p 1 with the h 2 and, although performance seems poor, analyzing the comparison of total sales in promotion and the forecast, the results are much more accurate. In fact, the forecast overestimates the total sales by only 1.7%, which is an extremely accurate result. As mentioned earlier, this was the retailer s first time in the use of forecasting methods, as before there wasn t any type of forecasting approach. The only type of prediction, or estimation, usually done was trying to estimate the effect of p 1 promotions by raw methods. The approach in that estimation was to assume that the promotions would triple sales. Thus, having that effect, they d calculate a average weekend sales and would multiply it by three, having the promotions days sales. The evaluation of the this estimation was made by MAPE, considering the promotion a single point containing the sum of sales during the promotion. For the p 1 promotion in the test set, this of 30% considering all days of promotion. Comparing it the best model results, it yielded an over estimation of a remarkable 1.7%. In fact, the model outperforms the company s estimation by far, providing and accurate forecast of the effect that this event has on sales. Concerning the partial promotion p 3, and by observing Figure 5.6 again, it can be seen that the dynamics of the forecast is significantly more accurate, having a VAF of 98.98%. Comparing the total sales in the same period, the forecast underestimates sales on 2.2%, which is, once again, an extremely accurate prediction, specially taking in account that there were some assumptions made. In that sense, although the overall performance of 64%, the model presents accurate performances for some periods, and its estimation of well defined in training is particularly useful. Application of the best model to Business Units In this Subsection, the application of the model that yielded the best test results to the 5 business units is presented. The procedure for training was the same as before. The models were trained from 2 to 20 clusters 10 times each. The remaining parameters were the same as in Chapter 4. In Table 5.7 the best cluster configuration, training and test results for each business unit are presented. 74

99 Table 5.7: Application of the fuzzy classification model to the different business units - non-stationary period VAF RMSE MSE MAPE B. U. Clusters Train Test Train Test Train Test Train Test BA % N/S % % BB % N/S % % BC % N/S % % BD % N/S % % BE % N/S % % The reasoning from the analysis of the results from the BU s models is similar to the one presented on Subsection 5.2. Due to space and subject restrictions, it cannot be done an exhaustive analysis to the problem. One of the reasons that led the models to present these performances may have to due with clustering (singularities). Sustaining this idea might be the approach presented on this Section earlier, for the fuzzy model. In a future approach this part of the work should be addressed in the same way as that approach, in order to try to obtain better performances. 75

100 76

101 Chapter 6 Conclusions This work addressed the problem of sales forecasting, by applying soft, or intelligent, computing techniques to a retailer. Firstly, the approach to the problem was defined, selecting the periods to forecast and which features to use in each period. Then, different modeling techniques were applied and the models for each technique were obtained. Finally, the forecasting results were achieved for each horizon of prediction. This chapter concludes about each part of the development of this work, finishing with suggestions for future improvements. There were five different models developed in this work. The applicability of fuzzy classification models, fuzzy NARX models, multilayer perceptrons (or feedforward neural networks), NARX neural networks and adaptive neuro-fuzzy models was tested over three different forecasting periods. It s possible to conclude that the fuzzy classification and ANFIS provide accurate forecasts for the stationary period. For this period, the feedforward network, NARX network and NARX fuzzy model are models that don t manage to forecast accurately. The fuzzy classification model is capable of producing very accurate forecasts for some parts or events in the stationary with disturbances and non-stationary periods, although presenting poor forecasts for specific events of the hole periods. In that sense, the classification model is applicable for the stationary period, while for the remaining can be applicable for the prediction of well defined events. For the two longer periods, the remaining models also fail to capture the system s dynamics and present a poor performance. Therefore are not applicable to the forecasting problem. The fuzzy classification model, as mentioned previously, also has the ability of obtaining accurate forecasts for the stationary evolution in all periods. It would be expected that the neural networks would perform better, due to their good generalizing capacities and, specially, due to their wide successful application, presented in Chapter 1. On one side, a reason for their poor performance may have to due with the feature construction, which might have led the networks to perform forecasts with elevate variance, and when developing the HL numbers and N numbers, it may have driven the network development in a wrong direction. On the other side, on the stationary period, the feedforward network performed well in training, even better than the fuzzy classification model, so it was expected to perform better in test as well. 77

102 Comparing the neural networks with fuzzy inference systems, it can be concluded that the fuzzy systems outperform the neural networks in this problem. In fact, in every period except one, the fuzzy models not only have better training performance, as they maintain the advantage in testing. Only the feedforward network manages to outperform the classification model in training. However, in testing, this model achieves worse results than any of the two fuzzy models. When comparing neural networks with ANFIS, similar conclusions can be made. The adaptive neurofuzzy model outperforms neural networks in every period in test and in training. Finally comparing ANFIS with fuzzy systems, it can be concluded that although ANFIS has a better training performance in all the periods, except the stationary, it is always a worse predictor compared to the fuzzy classification model in all the periods but achieves a better test performance than the fuzzy NARX in the stationary period.. Analyzing the forecasts in a more general way, it s possible to conclude that the best models in each period (fuzzy classification) predict accurately the stationarity in the periods. This means that, when there are no major events such as holidays or promotions, it s possible to have an accurate forecast. Concerning events that had to be assumed, as it was already expected, the models fail to accurately predict its effects on sales. In both periods where the p 2 promotion is present, the model doesn t manage to capture it. Although in the non-stationary period the model forecasts almost as if there was no major event happening, for the stationary period with disturbances the model captures some part of its dynamics and only overestimates the sales on saturday. Due to the fact of having the p 2 event on the same day as h 1, none of the models had to forecast the effect of h 1 holiday by itself, and in that sense, it s impossible to conclude anything about it. The other assumed event was the week of h 3 holidays. It was a feature built based on some assumptions, because there was no data about the behavior of sales when this events occurred in the same weekdays. It was also assumed because there was a strong promotional activity on that week, and there was no data either to provide to the training about strong campaigns on those days. The forecast of this week was not as accurate as the stationarity weeks, as expected. Although there is some response as sales are much higher than on a regular week, it is concluded that the forecast isn t accurate enough to provide a good estimation of sales pattern in that event. Regarding events with previous similar data, such as the h 2 holiday, as there is a p 1 promotion influencing the weekend of the h 2, it is not possible to conclude anything about the forecast of this effect, as the p 1 changes any pattern that would be visible due to the h 2 holiday. One major conclusion that can be drawn is the extremely accurate forecast of the p 1 and p 3 promotions. On the two promotions of this type in testing, the sales value estimations were very accurate. Although the dynamic of the first one was not accurately captured, the sales error was inferior to 2% (considering all the promotions day), which is a very accurate performance, specially comparing to the error of 30% from the expert in the retailer, although his prediction is a raw estimate. The other promotion was a partial promotion whose effect had to be assumed due to the fact that no other partial promotions were present in training data. Unfortunately for the p 3 promotion there is no comparison available to the way the expert estimated. An important conclusion that can be drawn form this performances is that when 78

103 there is enough data to correctly construct features, it can be a valuable tool that might allow to reach competitive advantages. Concerning the application of these techniques to the different business units, it can be concluded that for the stationary period is possible to obtain accurate performances. For the remaining periods, the models present poor performances, and the models are not suitable to be applied to those periods. In a general conclusion, it may be stated that has been made a contribution for the problem of sales forecasting in retail. Specially due to the fact that it was a first approach to the company in which this work was developed. The models developed perform an accurate sales forecast of the stationary trend and some well defined events. Especially for the stationary period, the performance was superior than for the remaining periods. The reasons for not having performances as high as on the first period have to due with exceptional events that either weren t present in training data and, therefore the model couldn t learn its impact; or were present in training but not in the way as it happened in the test year. So, although the model had knowledge about the impacts of these events, it didn t had knowledge about the precise circumstances in which these events would affect the system. The first examples are, for instance, the three weeks after the p 2 promotion, where there is an increase in sales and, on previous years, that didn t happen and it wasn t predictable to happen in year C. It is also the case of the p 2 promotion which is a promotion never made before. For the second case there is the week of the h 3 holidays, when although there was knowledge about past years, the fact that these holidays are movable, in addition with a strong promotional activity made this events unpredictable. What sustains this conclusions is the fact that when is possible to shape the effect of an event consistently, as it happened with the p 1, the models accurately forecast these effects. In that sense, there is space for improvements, which are suggested next. 6.1 Future improvements In this work of sales forecasting in retail there is still space for improvement. The following suggestions aim to guide a further development of this thesis: One of the major faults in this work is the ability to forecast the effect of events that are not expressed in training data. For this work the provided sales data concern almost 2 full years, although in this approach there was only used data concerning the same periods as the forecasting horizon (much less than a year). When a determined pioneer event is going to be forecasted and there is a certain effect expected by the sales expert but there is no past information in the training data, a suggestion might be to search in the unused data for any behavior of the expected type, and include it in the training data in order to check for improved forecasts. As it is a forecast, there is the risk that the expert thinks he knows the effect some event is going to happen, and the result is completely different. But, as doing with past data in cases where is not a clear relation of the past year events with present year events was proved not to be accurate, the addition of expert knowledge here might improve the forecast. Concerning a certain trend visible in longer periods, an approach may be to introduce a new input 79

104 related with a moving average of sales as a yearly seasonality. As this trend is visible in all the training year it can be easily computed. An interesting approach would be to compare different scenarios using, for example, different purchasing power values to see how it would influence the sales forecast. Another research that may be done in order to improve forecasting accuracy would be to confirm the impact of any promotions or not, in the weeks following the p 2 promotion. It would be important to understand why there is such an increase and why it happens only in year C in order to construct a feature that allows to forecast more accurately on that period in future years. Other improvement might be the use of different types of models, such as Support Vector Machines, for instance, in order to check for some improvements. Finally, concerning the business units, the approach used for constructing features for the aggregates sales models may be applied to each business unit, in order to confirm that this approach would improve the sales forecast. 80

105 Bibliography [1] Deepak Agrawal and Christopher Schorling. Market share forecasting: An empirical comparison of artificial neural networks and multinomial logit model. Journal of Retailing, 72(4): , [2] A. Al-Anbuky, S. Bataineh, and S. Al-Aqtash. Power demand prediction using fuzzy logic. Control Engineering Practice, 3(9): , [3] Ilan Alon, Min Qi, and Robert J. Sadowski. Forecasting aggregate retail sales: A comparison of artifcial neural networks and traditional methods. Journal of Retailing and Consumer Services, pages , [4] Angela P. Ansuj, M. E. Camargo, R. Radharamanan, and D. G. Petry. Sales forecasting using time series and neural networks. Comput. Ind. Eng., 31(1-2): , October [5] A. Azadeh, S.M. Asadzadeh, and A. Ghanbari. An adaptive network-based fuzzy inference system for short-term natural gas demand estimation: Uncertain and complex environments. Energy Policy, 38(3): , Security, Prosperity and Community Towards a Common European Energy Policy? Special Section with Regular Papers. [6] E. Michael Azoff. Neural Network Time Series Forecasting of Financial Markets. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, [7] Robert Babuska. Fuzzy Modeling for Control. Kluwer Academic Publishers, Norwell, MA, USA, 1st edition, [8] H. Bacha and W. Meyer. A neural network architecture for load forecasting. In Neural Networks, IJCNN., International Joint Conference on, volume 2, pages vol.2, jun [9] Erkan Bayraktar, S.C. Lenny Koh, A. Gunasekaran, Kazim Sari, and Ekrem Tatoglu. The role of forecasting on bullwhip effect for e-scm applications. International Journal of Production Economics, 113(1): , Research and Applications in E-Commerce and Third-Party Logistics Management - Special Section on Meta-standards in Operations Management: Crossdisciplinary perspectives. [10] K. Bergerson and II Wunsch, D.C. A commodity trading model based on a neural network-expert system hybrid. In Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on, volume i, pages vol.1, jul

106 [11] Michael R. Berthold, Christian Borgelt, Frank Hppner, and Frank Klawonn. Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data. Springer Publishing Company, Incorporated, 1st edition, [12] J. Bezdek, R. Ehrlich, and W. Full. FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2-3): , [13] James B. Boulden. Fitting the sales forecast to your firm. Business Horizons, 1(1):65 72, [14] George Edward Pelham Box and Gwilym Jenkins. Time Series Analysis, Forecasting and Control. Holden-Day, Incorporated, [15] Gonçalo Jorge da Silva Farinha. Feature selection methods in large databases: Application to sepsis outcome prediction. Master s thesis, Instituto Superior Técnico, Av. Rovisco Pais, 1, June [16] Pei-Chann Chang, Chen-Hao Liu, and Robert K. Lai. A fuzzy case-based reasoning model for sales forecasting in print circuit board industries. Expert Systems with Applications, 34(3): , [17] Pei-Chann Chang, Yen-Wen Wang, and Chen-Hao Liu. The development of a weighted evolving fuzzy neural network for pcb sales forecasting. Expert Systems with Applications, 32(1):86 96, [18] Pei-Chann Chang, Yen-Wen Wang, and Chi-Yang Tsai. Evolving neural network for printed circuit board sales forecasting. Expert Systems with Applications, 29(1):83 92, [19] F.L. Chen and T.Y. Ou. Sales forecasting system based on gray extreme learning machine with taguchi method in retail industry. Expert Systems with Applications, 38(3): , [20] Miao-Sheng Chen, Li-Chih Ying, and Mei-Chiu Pan. Forecasting tourist arrivals by using the adaptive network-based fuzzy inference system. Expert Syst. Appl., 37(2): , March [21] Toly Chen and Mao-Jiun J. Wang. Forecasting methods using fuzzy concepts. Fuzzy Sets and Systems, 105(3): , [22] An-Chin Cheng, Chung-Jen Chen, and Chia-Yon Chen. A fuzzy multiple criteria comparison of technology forecasting methods for predicting the new materials development. Technological Forecasting and Social Change, 75(1): , [23] Ching-Wu Chu and Guoqiang Peter Zhang. A comparative study of linear and nonlinear models for aggregate retail sales forecasting. International Journal of Production Economics, 86(3): , [24] Kevin G. Coleman, Timothy J. Graettinger, and William F. Lawrence. Neural networks for bankruptcy prediction: the power to solve financial problems. AI Rev., 4:48 50, July

107 [25] M. Cottrell, B. Girard, Y. Girard, M. Mangeas, and C. Muller. Neural modeling for time series: A statistical stepwise method for weight elimination. Neural Networks, IEEE Transactions on, 6(6): , nov [26] Rorbert Griñó Cubero. Neural networks for water demand time series forecasting. In Proceedings of the International Workshop on Artificial Neural Networks, IWANN 91, pages , London, UK, UK, Springer-Verlag. [27] Douglas J. Dalrymple. Using box-jenkins techniques in sales forecasting. Journal of Business Research, 6(2): , [28] P.K. Dash, A.C. Liew, S. Rahman, and S. Dash. Fuzzy and neuro-fuzzy computing models for electric load forecasting. Engineering Applications of Artificial Intelligence, 8(4): , [29] C. de Groot and D. Wurtz. Analysis of univariate time series with connectionist nets: A case study of two classical examples. Neurocomputing, 3(4): , [30] J. Deppisch, H.-U. Bauer, and T. Geisel. Hierarchical training of neural networks and prediction of chaotic time series. Physics Letters A, 158(1-2):57 62, [31] Eugen Diaconescu. The use of narx neural networks to predict chaotic time series. WSEAS Trans. Comp. Res., 3(3): , March [32] K.A. Duliba. Contrasting neural nets with regression in predicting performance in the transportation industry. In System Sciences, Proceedings of the Twenty-Fourth Annual Hawaii International Conference on, volume iv, pages , jan [33] Aiken et al. A neural network for predicting total industrial production. Journal of End User Computing, 7(2):19 23, [34] E.O. Ezugwu, S.J. Arthur, and E.L. Hines. Tool-wear prediction using artificial neural networks. Journal of Materials Processing Technology, 49(3-4): , [35] R Fildes and R Hastings. The organization and improvement of market forecasting. Journal of the Operational Research Society, (45):1 16, [36] Michael D. Geurts and J. Patrick Kelly. Forecasting retail sales using alternative models. International Journal of Forecasting, 2(3): , [37] Carey Goh and Rob Law. Modeling and forecasting tourism demand for arrivals with stochastic nonstationary seasonality and intervention. Tourism Management, 23(5): , [38] Wilpen L. Gorr, Daniel Nagin, and Janusz Szczypula. Comparative study of artificial neural network and statistical models for predicting student grade point averages. International Journal of Forecasting, 10(1):17 34, [39] Clive W. J. Granger. Can we improve the perceived quality of economic forecasts? Journal of Applied Econometrics, 11(5): ,

108 [40] David J. Haas, Joel Milano, and Lance Flitter. Prediction of helicopter component loads using neural networks. Journal of the American Helicopter Society, 40(1):72 82, [41] Esmaeil Hadavandi, Hassan Shavandi, and Arash Ghanbari. Integration of genetic fuzzy systems and artificial neural networks for stock price forecasting. Knowledge-Based Systems, 23(8): , [42] Simon Haykin. Neural Networks: A Comprehensive Foundation (3rd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, [43] Tim Hill, Leorey Marquez, Marcus O Connor, and William Remus. Artificial neural network models for forecasting and decision making. International Journal of Forecasting, 10(1):5 15, [44] M J C Hu. Application of the adaline system to weather forecasting. Technical report, E. E. Degree Thesis, Stanford Electronic Lab, [45] David G. Huntley. Neural nets: An approach to the forecasting of time series. Social Science Computer Review, 9(1):27 38, [46] C.E. Imrie, S. Durucan, and A. Korre. River flow prediction using artificial neural networks: generalization beyond the calibration range. Journal of Hydrology, 233(1-4): , [47] Ashu Jain and Avadhnam Madhav Kumar. Hybrid neural network models for hydrologic time series forecasting. Appl. Soft Comput., 7(2): , March [48] J.-S.R. Jang. Anfis: adaptive-network-based fuzzy inference system. Systems, Man and Cybernetics, IEEE Transactions on, 23(3): , may/jun [49] Jyh-Shing Roger Jang and Chuen-Tsai Sun. Neuro-fuzzy and soft computing: a computational approach to learning and machine intelligence. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, [50] Chuanjin Jiang and Jing Zhang. Nonlinear autoregressive model with exogenous input for exchange rate forecasting. In Computer Engineering and Technology (ICCET), nd International Conference on, volume 4, pages V4 69 V4 72, april [51] Everette S. Gardner Jr. Exponential smoothing: The state of the art: Part ii. International Journal of Forecasting, 22(4): , [52] Suh Young Kang. An investigation of the use of feedforward neural networks for forecasting. PhD thesis, Kent, OH, USA, UMI Order No. GAX [53] Nowrouz Kohzadi, Milton S. Boyd, Bahman Kermanshahi, and Iebeling Kaastra. A comparison of artificial neural network and time series models for forecasting commodity prices. Neurocomputing- Financial Applications, Part I, 10(2): , [54] R.J. Kuo. A sales forecasting system based on fuzzy neural network with initial weights generated by genetic algorithm. European Journal of Operational Research, 129(3): ,

109 [55] R.J. Kuo, P. Wu, and C.P. Wang. An intelligent sales forecasting system through integration of artificial neural networks and fuzzy neural networks with fuzzy weight elimination. Neural Networks, 15(7): , [56] R.J. Kuo and K.C. Xue. A decision support system for sales forecasting through fuzzy neural networks with asymmetric fuzzy weights. Decision Support Systems, 24(2): , [57] R.J. Kuo and K.C. Xue. An intelligent sales forecasting system through integration of artificial neural network and fuzzy neural network. Computers in Industry, 37(1):1 15, [58] Winston Lasschuit and Nort Thijssen. Supporting supply chain planning and scheduling decisions in the oil and chemical industry. Computers and Chemical Engineering, 28(67): , FOCAPO 2003 Special issue. [59] Aaron A. Levis and Lazaros G. Papageorgiou. A hierarchical solution approach for multi-site capacity planning under uncertainty in the pharmaceutical industry. Computers and Chemical Engineering, 28(5): , ESCAPE 13. [60] M. Li, K. Mehrotra, C. Mohan, and S. Ranka. Sunspot numbers forecasting using neural networks. In Intelligent Control, Proceedings., 5th IEEE International Symposium on, pages vol.1, sep [61] Tammy Lo. An expert system for choosing demand forecasting techniques. International Journal of Production Economics, 33(13):5 15, Special Issue on Pacific Manufacturing. [62] Ivette Luna and Rosangela Ballini. Top-down strategies based on adaptive fuzzy rule-based systems for daily time series forecasting. International Journal of Forecasting, 27(3): , Special Section 1: Forecasting with Artificial Neural Networks and Computational Intelligence Special Section 2: Tourism Forecasting. [63] E. Maasoumi, A. Khotanzed, and A. Abaye. Artificial neural networks for some macroeconomic series: A first report. Econometric Reviews, 13(1): , [64] Christos T. Maravelias and Ignacio E. Grossmann. Simultaneous planning for new product development and batch manufacturing facilities. Industrial & Engineering Chemistry Research, 40(26): , [65] L. Marquez, T. Hill, M. O Connor, and W. Remus. Neural network models for forecast: a review. In System Sciences, Proceedings of the Twenty-Fifth Hawaii International Conference on, volume iv, pages , jan [66] José Maria P. Menezes, Jr. and Guilherme A. Barreto. Long-term time series prediction with the narx network: An empirical evaluation. Neurocomput., 71(16-18): , October [67] Chung ming Kuan and Tung Liu. Forecasting exchange rates using feedforward and recurrent neural networks,

110 [68] S. Mitra and Y. Hayashi. Neuro-fuzzy rule generation: survey in soft computing framework. Neural Networks, IEEE Transactions on, 11(3): , may [69] K Nam and T Schaefer. Forecasting international airline passenger traffic using neural networks. Logistics and Transportation Review, (3), [70] J. Park and I. W. Sandberg. Universal approximation using radial-basis-function networks. Neural Comput., 3(2): , June [71] P. Payeur, Hoang Le-Huy, and C.M. Gosselin. Trajectory prediction for moving objects using artificial neural networks. IEEE Transactions on Industrial Electronics, 42(2): , apr [72] Roger A. Pielke. 7 the role of models in prediction for decision. [73] A. N. Refenes, M. Azema-Barac, L. Chen, and S. A. Karoussos. Currency exchange rate prediction and neural network design strategies. Neural Computing & Applications, 1:46 58, /BF [74] J.C. Ruiz-Suárez, O.A. Mayora-Ibarra, J. Torres-Jiménez, and L.G. Ruiz-Suárez. Short-term ozone forecasting by artificial neural networks. Advances in Engineering Software, 23(3): , [75] S.C. Wheelwright S. Makridakis and V.E. McGee. Forecasting: Methods and Applications, 2nd edn. John Wiley & Sons, Chichester, New York, NY, USA, [76] Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19): , September [77] E. Schoneburg. Stock price prediction using neural networks: A project report. Neurocomputing, 2(1):17 27, [78] P. C. Soman. An adaptive narx neural network approach for financial time series prediction. Master s thesis, New Brunswick Rutgers Graduate School, State University of New Jersey, [79] J M Sousa and U Kaymak. Fuzzy decision making in modeling and control. world scientific pub, [80] Dipti Srinivasan, A.C. Liew, and C.S. Chang. A neural network short-term load forecaster. Electric Power Systems Research, 28(3): , [81] Zhan-Li Sun, Tsan-Ming Choi, Kin-Fan Au, and Yong Yu. Sales forecasting using extreme learning machine with applications in fashion retailing. Decision Support Systems, 46(1): , [82] Kar Yan Tam and Melody Y. Kiang. Managerial applications of neural networks: the case of bank failure predictions. Manage. Sci., 38(7): , July [83] Zaiyong Tang, C. de Almeida, and P. A. Fishwick. Time series forecasting using neural networks vs. Box Jenkins methodology. Transactions of The Society for Modeling and Simulation International, 57: ,

111 [84] Zaiyong Tang and Paul A. Fishwick. Feedforward neural nets as models for time series forecasting. INFORMS Journal on Computing, pages , [85] F.M. Thiesing and O. Vornberger. Sales forecasting using neural networks. In Neural Networks,1997., International Conference on, volume 4, pages , jun [86] Robert R. Trippi and Efraim Turban, editors. Neural Networks in Finance and Investing: Using Artificial Intelligence to Improve Real World Performance. McGraw-Hill, Inc., New York, NY, USA, [87] N. Turkkan and N. K. Srivastava. Prediction of wind load distribution for air-supported structures using neural networks. Canadian Journal of Civil Engineering, 22(3): , [88] Susana Vieira. Soft computing techniques applied to feature selection. PhD thesis, Universidade Tcnica de Lisboa, Lisbon, Portugal, March [89] Fu-Kwun Wang and Ku-Kuang Chang. Adaptive neuro-fuzzy inference system for combined forecasts in a panel manufacturer. Expert Systems with Applications, 37(12): , [90] A. S. Weigend, B. A. Huberman, and D. E. Rumelhart. Predicting Sunspots and Exchange Rates with Connectionist Networks. In M. Casdagli and S. Eubank, editors, Nonlinear modeling and forecasting, pages Addison-Wesley, [91] Andreas S. Weigend, Bernardo A. Huberman, and David E. Rumelhart. Predicting the future: A connectionist approach. International Journal of Neural Systems, 01(03): , [92] Andreas S. Weigend, David E. Rumelhart, and Barnardo A. Huberman. Generalization by weightelimination with application to forecasting. In Proceedings of the 1990 conference on Advances in neural information processing systems 3, NIPS-3, pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [93] P. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, Cambridge, MA, [94] Paul J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4): , [95] Heidi Winklhofer, Adamantios Diamantopoulos, and Stephen F. Witt. Forecasting practice: A review of the empirical literature and an agenda for future research. International Journal of Forecasting, 12(2): , [96] W.K. Wong and Z.X. Guo. A hybrid intelligent model for medium-term sales forecasting in fashion retail supply chains using extreme learning machine and harmony search algorithm. International Journal of Production Economics, 128(2): , Supply Chain Forecasting Systems. [97] Berlin Wu. Model-free forecasting for nonlinear time series (with application to exchange rates). Computational Statistics and Data Analysis, 19(4): ,

112 [98] Tiaojun Xiao and Xiangtong Qi. Price competition, cost and demand disruptions and coordination of a supply chain with one manufacturer and two competing retailers. Omega, 36(5): , [99] Tiaojun Xiao and Danqin Yang. Price and service competition of supply chains with risk-averse retailers under demand uncertainty. International Journal of Production Economics, 114(1): , [100] G.Peter Zhang and Min Qi. Neural network forecasting for seasonal and trend time series. European Journal of Operational Research, 160(2): , [101] Guoqiang Zhang, B. Eddy Patuwo, and Michael Y. Hu. Forecasting with artificial neural networks:: The state of the art. International Journal of Forecasting, 14(1):35 62, [102] Xiande Zhao, Jinxing Xie, and Janny Leung. The impact of forecasting model selection on the value of information sharing in a supply chain. European Journal of Operational Research, 142(2): ,

113 Appendix 89

114

115 Appendix A Extended results A.1 Modeling data preprocessing In this section, it is presented the zoom from weeks aa to bb as mentioned in Chapter 3. This figure zooms the weeks concerning the stationary period, presenting the curves for the real sales and flat sales vector, showing that both curves are very similar. Years A C Sales 0.6 Year A Year B Year C Week aa Week bb Years A C Flat sales 0.6 Year A Year B Year C Week aa Week bb Figure A.1: Yearly sales and yearly sales without seasonality and major promotions effect - zoom in weeks aa to bb 91

116 A.2 Intelligent modeling for sales forecasting In this section are presented the figures and tables for the stationary period with disturbances and non-stationary period, which weren t presented in the body of the document due to space restrictions. A.2.1 A Stationary period with disturbances Fuzzy classification model 95.6 Average VAF per cluster nº VAF 1.12 x 105 Average RMSE per cluster nº RMSE x 1010 Average MSE per cluster nº MSE Average MAPE per cluster nº 6.9 MAPE Figure A.2: Performance per number of clusters for the classification fuzzy model Table A.1: Performance per number of clusters for the classification model - stationary period with disturbances VAF RMSE MSE MAPE Number of Clusters Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % A Fuzzy NARX model 92

117 x 105 Average VAF per cluster nº VAF x 1015 Average MSE per cluster nº MSE x 107 Average RMSE per cluster nº RMSE Average MAPE per cluster nº 8000 MAPE Figure A.3: Performance per number of clusters for the narx fuzzy model Table A.2: Performance per number of clusters for the NARX fuzzy model - stationary period with disturbances VAF RMSE MSE MAPE Number of Clusters Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % A Feedforward classification network 93

118 0 Average VAF per nº of HL VAF 2 x 106 Average RMSE per nº of HL RMSE x 1012 Average MSE per nº of HL MSE Average MAPE per nº of HL 140 MAPE Figure A.4: Performance per number of HL for the feedforward NN Table A.3: Performance per number of HL for the feedforward network - stationary period with disturbances VAF RMSE MSE MAPE Number of HL Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % 0 Average VAF per nº of neurons VAF 2 x 106 Average RMSE per nº of neurons RMSE x 1012 Average MSE per nº of neurons MSE Average MAPE per nº of neurons 120 MAPE Figure A.5: Performance per number of neurons per HL for the feedforward NN 94

119 Table A.4: Performance per number of neurons per HL for feedforward networks - stationary period with disturbances VAF RMSE MSE MAPE Number of neurons Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % A NARX network 0 Average VAF per nº of HL VAF 2.5 x 106 Average RMSE per nº of HL RMSE x 1012 Average MSE per nº of HL MSE Average MAPE per nº of HL 140 MAPE Figure A.6: Performance per number of HL for the NARX NNl Table A.5: Performance per number of HL for the NARX networks - stationary period with disturbances VAF RMSE MSE MAPE Number of HL Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % 95

120 200 Average VAF per nº of neurons VAF 2 x 106 Average RMSE per nº of neurons RMSE x 1012 Average MSE per nº of neurons MSE Average MAPE per nº of neurons 120 MAPE Figure A.7: Performance per number of neurons in each HL for the NARX NNl Table A.6: Performance per number of neurons in each HL for the NARX network - stationary period with disturbances VAF RMSE MSE MAPE Number of neurons Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % 50 Average VAF per nº of delays VAF 12 x 105 Average RMSE per nº of delays RMSE x 1011 Average MSE per nº of delays MSE Average MAPE per nº of delays 80 MAPE Figure A.8: Performance per number of delays in the NARX NNl 96

121 Table A.7: Performance per number of delays for the NARX network - stationary period with disturbances VAF RMSE MSE MAPE Number of HL Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % A.2.2 Non-stationary period A Fuzzy classification model Average VAF per cluster nº VAF 2 x Average RMSE per cluster nº 105 RMSE x 1010 Average MSE per cluster nº MSE Average MAPE per cluster nº 13.5 MAPE Figure A.9: Performance per number of clusters for the classification fuzzy model Table A.8: Performance per number of clusters for the classification model - non-stationary period VAF RMSE MSE MAPE Number of Clusters Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % A Fuzzy NARX model 97

122 2 x 104 Average VAF per cluster nº VAF 2 x 107 Average RMSE per cluster nº RMSE x 1014 Average MSE per cluster nº MSE Average MAPE per cluster nº 1500 MAPE Figure A.10: Performance per number of clusters in the fuzzy NARX Table A.9: Performance per number of clusters for the NARX model - non-stationary period VAF RMSE MSE MAPE Number of clusters Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % A Feedforward classification network 0 Average VAF per nº of HL VAF 3 x 106 Average RMSE per nº of HL RMSE x 1013 Average MSE per nº of HL MSE Average MAPE per nº of HL 200 MAPE Figure A.11: Performance per number of HL in the feedforward NN 98

123 Table A.10: Performance per number of HL for the feedforward network - non-stationary period VAF RMSE MSE MAPE Number of HL Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % 100 Average VAF per nº of HL VAF 2.5 x 106 Average RMSE per nº of HL RMSE x 1012 Average MSE per nº of HL MSE Average MAPE per nº of HL 160 MAPE Figure A.12: Performance per number of neurons in the feedforward NN Table A.11: Performance per number of neurons in each HL for the feedforward network - non-stationary period VAF RMSE MSE MAPE Number of HL Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % A NARX network 99

124 0 Average VAF per nº of HL VAF 3 x 106 Average RMSE per nº of HL RMSE x 1012 Average MSE per nº of HL MSE Average MAPE per nº of HL 250 MAPE Figure A.13: Performance per number of HL in the feedforward NN Table A.12: Performance per number of HL for the NARX network - non-stationary period VAF RMSE MSE MAPE Number of HL Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % 0 Average VAF per nº of HL VAF 3 x 106 Average VAF per nº of HL RMSE x 1012 Average VAF per nº of HL MSE Average VAF per nº of HL 250 MAPE Figure A.14: Performance per number of neurons in each HL for the NARX NN 100

125 Table A.13: Performance per number of neurons in each HL for the NARX network - non-stationary period VAF RMSE MSE MAPE Number of neurons Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % 6 Average VAF per nº of delays VAF 4 x 106 Average RMSE per nº of delays RMSE x 1012 Average MSE per nº of delays MSE Average MAPE per nº of delays 300 MAPE Figure A.15: Performance per number of delays for the NARX NN Table A.14: Performance per number of delays for the NARX network - non-stationary period VAF RMSE MSE MAPE Number of HL Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best Average Std. Deviation Best % % % % % % % % % % % % % % % % % % % % 101

126 102