Data Management in Forecasting Systems: Case Study Performance Problems and Preliminary Results

Transcription

1 Data Management in Forecasting Systems: Case Study Performance Problems and Preliminary Results Haitang Feng 1,2, Nicolas Lumineau 1, Mohand-Saïd Hacid 1, and Richard Domps 2 1 Université de Lyon, CNRS Université Lyon 1, LIRIS, UMR5205, F-69622, France {firstname.lastname}@liris.cnrs.fr 2 Anticipeo, 4 bis impasse Courteline, 94800, Villejuif, France {firstname.lastname}@anticipeo.com Résumé. La production de prévisionnels de ventes est un enjeu important dans la mise en place de stratégies commerciales pour les entreprises. Les applications permettant de calculer des prévisionnels, tels que Anticipeo dans notre cas, exploitent de grandes quantités de données issues des réalisations passées et des modèles statistiques souvent complexes. Les performances de ces applications dépendent de leur capacité à traiter et visualiser avec des latences acceptables pour un usage via le web des masses de données mises à jour périodiquement. La spécificité des données et les choix stratégiques de l entreprise font qu une migration vers des entrepôts de données n est pas toujours possible. Dans cet article, nous proposons une caractérisation des applications dédiées au calcul de prévisionnels et nous offrons une feuille de route permettant d optimiser un système de gestion de bases de données aux niveaux physique et logique pour réduire les latences liées à l accès aux données. Nous montrons que certaines solutions théoriques ne sont pas applicables dans le contexte du calcul de prévisionnels et que dans le cas concret de l application Anticipeo des gains réels ont été obtenus. Keywords: Query optimization, performance, DBMS Tuning, OLAP 1 Introduction Forecasting systems are based on historical information and use specific methods (typically statistical models) to derive decisional information and present them in an easily exploitable way. These systems are an important issue in an industrial context because strategic decisions partly depend on forecasting. The features of a good forecasting system are: economy, availability, plausibility, simplicity and accuracy[1]. Therefore, a forecasting system has several singularities that the well-known solutions of data warehousing[7] do not consider. These singularities concerns the limited life duration of data and its non-reusability in future forecasting. Moreover, data updates could occur which require to update derived analyses[8]. Finally, its role in the business intelligence implies data access latency has to be acceptable. Our concrete case study is Anticipeo 3. Currently Anticipeo application has a number of performance limits. Performance issues exist both in the forecasts calculation stage and in the exploration stage on which data are visualized and updates are performed. This paper mainly focuses on the exploration stage performed via a web browser. The contributions of this paper are the following: we first characterize forecasting systems w.r.t. requirements, data and queries. Then, key elements at the logical and physical level of the database have been investigated in order to provide some recommendations regarding optimization strategies. These recommendations could also be useful for other forecasting applications. Finally, preliminary solutions are evaluated with the focus on the benefits we got and the bounds of existing solutions stemming from the data warehouse domain in this specific case. The paper is organized as follows : Section 2 presents the features of Anticipeo and forecasting systems in general, in terms of process, data and queries, Section 3 describes some related works and the optimization solutions we consider. In Section 4, we show the experimental results. We conclude in Section 5. 2 Forecasting Application Features In this section, we discuss the case of Anticipeo with more details and then define generic features of forecasting systems. Research partially supported by the French Agency ANRT ( and Anticipeo ( 3 Anticipeo provides software and service for the forecasting and monitoring of sales activities.

2 Processing approach used in Anticipeo Anticipeo is one of the forecasting systems which uses statistical models to derive forecasting sales from historical sales. It helps decision makers or executives to adapt their business plan to future trends of the market. Every month, customers provide Anticipeo with their new achieved sales. Anticipeo integrates the data into the system, calculates forecasting data and prepares information for analysis. Customers can then navigate the forecasting sales via a devoted interface. If they perform some modifications on the forecasts, the visualization generator will be relaunched to produce updated presentations. Anticipeo uses relational databases to store information. Issues regarding the manipulation of very large tables at the calculation stage and at the visualization stage constitute the main problems that we tackle. To achieve a reliable result 4, the calculation engine deals with each sale individually with a corresponding statistical model (e.g., standard, seasonal, occasional, etc.). The treatment relies on 36-month historical data in addition to tens of parameters (e.g., weighting coefficient, serial correlation, seasonal coefficient, etc.) used in statistical models. As a consequence, the calculation stage is quite time-consuming. Regarding the exploration part, results are displayed in hierarchies. To show the sales trend, we display both 36-month historical data and 24-month predictive data. This means 60-month aggregations should be generated. In addition, we need some meta-data to understand/interpret the displayed data. As the number of sales could be very high, the tables used to display the results can easily reach gigabyte scale. Data in Anticipeo To analyze the results, three dimensions are defined: Customer, Product and Time. Since the sales are grouped by month, the time dimension is implicit. Only two dimensions are explored: Customer and Product. Each of these two dimensions is composed of several hierarchies such as geographical distribution for the customer dimension. Regarding the navigation of sales, materialized views containing all information about customer, product, hierarchy, purchase date and sales volume are created (called MV example in the rest of this paper) to guarantee a quick data access. The main schema of MV example is shown in Table 1. Attribute Name Attribute Type sales key customer key product key customer name varchar(25) product name varchar(25) hierarchy key parent hierarchy key dimension char(2) hierarchy number tinyint(3) hierarchy level tinyint(3) hierarchy name varchar(25) achieved sale 35 decimal(18,8)... (achieved sales) decimal(18,8) achieved sale 0 decimal(18,8) forecasting sale 1 decimal(18,8)... (forecasting sales) decimal(18,8) forecasting sale 24 decimal(18,8) fiscal year sales cumulation 1 decimal(18,8)... (fiscal year sales cumulation) decimal(18,8) fiscal year sales cumulation 4 decimal(18,8) Table 1. Schema for one of the materialized views used for efficient data display Queries in Anticipeo We gather and analyze the workload of users and we notice that there are in total four categories of typical queries performed by users. Most of the time, users require information on sales at different levels and according to different hierarchies. The data have been precomputed and stored in materialized views such as MV example, so that the response to queries is fast. The first category of queries consists of simple data retrieval from materialized views. In some cases, users need to modify forecasting results. The modification can occur on any level of any given hierarchy. When it happens, the system distributes the modification over the sales level and recomputes all the superior levels of hierarchies. Of course, the heavy work is performed by resorting to batch tasks. 4 Here, reliable stands for acceptable quality and acceptable approximation.

3 However, at the time a modification occurs, the system should provide users with updated information for the level of hierarchy where users perform the modification. Two other categories of typical queries are respectively updates on sales level and construction of a certain level of a certain hierarchy on the fly. Other manipulations on the interface consist of retrieving data across hierarchies. Unfortunately, unlike the simple browsing in the first case, it is almost impossible to build all the aggregates across hierarchies, because the number of combinations is tremendously high even if we consider only two hierarchies from two different dimensions at a time. So, the last category of queries consists of calculation of across-hierarchy information on the fly. Generic features of forecasting systems Sales forecasting systems often use multidimensional design structure to analyze resulting data so as to support better business decision-making. Queries on the data cubes are not only simple data retrieval, but also updates on part of the data cube. Compared to other information systems, the design of forecasting systems should comply with some requirements: Limited life duration: Forecasting calculation is periodically triggered. Once the period passed, the forecasts are no longer exploitable. For example, the weather or the traffic forecasts predicted for yesterday are not necessarily useful today. Non-reusable: The forecasting is a global calculation and it is almost impossible to reuse the former results to reduce future forecasting calculation. This means partial recomputation of precomputed results is not easy. Updates on built data cubes: As forecasting information results from a computation process, there is often no need for updates. However, in some cases, the system needs some corrections for some unpredictable situations: e.g., the impact of a car crash for a traffic forecasting system. Timeliness: Forecasting systems should provide just in time responses for users. Decision makers devise corresponding plans of purchasing, manufacturying, logistics, etc., according to sales forecasting. The delay between achieved data entering and new data forecasting should be as short as possible. 3 Optimization Guidelines Here, the problem we face is a performance problem with regards to the exploration of a multidimensional database using a relational database. Since the application is already operational, we should consider solutions that require less effort for their implementation. 3.1 Hardware and Operating System The first question we consider is whether our application is working in the appropriate environment. The main technical characteristics of the server in use are: two Intel Quad core Xeon-based 2.4GHz, 16GB RAM and one SAS disk of 500 GB. The operating system is a 64-bit Linux Debian system using EXT3 file system. Our focus in this audit is to inspect the hardware for three criteria: CPU, memory and I/O. 3.2 DBMS Configuration Anticipeo uses MySQL to implement and manage forecasting data. We would like to know if the DBMS is well tuned to support the workloads of the application. Since InnoDB is the main storage engine used here, the main MySQL system variables 5 manipulated during the tuning are innodb buffer pool size, innodb log file size, query cache size, innodb flush log at trx commit, key buffer size, etc. Some blind tuning has been done based on existing experimental results on different web services 6. The actual configuration is 8 GB, 800 MB, 64 MB, 0, and 512 MB, respectively. Additional benchmarking will be discussed in the following to determine whether the adopted configuration is efficient. If not, we will propose an optimal configuration. 3.3 Additional Materialized views The visualization of achieved and forecasting results by considering different dimensions is a typical Data Warehouse problem (using relational databases). In this domain, one of the most used solutions is to select useful intermediate results and store them as materialized views. Many approaches have been proposed for the selection of materialized views. The main idea is to use the greedy approaches[4, 6, 10]. These solutions pre-process the most beneficial intermediate results in a limited-space hypothesis to avoid complex computations so as to enhance data access. Extensions of these solutions consider also the maintenance cost[2] or large scale databases[5, 9], or make the set of materialized views dynamic according to the workload[3]. They have already been proved to bring significant improvements to data access. For this, we implemented the basic Greedy Algorithm. 5 See for variable definition 6

4 3.4 Database Design This application works on a large materialized view for result visualization. The advantage is obvious: we can omit time-consuming joins over tables containing millions of rows. However, it creates other problems, e.g., for queries which need to make several joins on this view to derive aggregations for superior levels. Our idea is to find a medium solution that can both avoid costly joins on different tables and reduce the time of self join of this table as well. The star schema or snowflake schema[11] (normalized version of a star schema) could be a solution. The first test is to simply break MV example into two, otherwise, we create two materialized views instead of one: one view contains only the hierarchical information and the other one contains all remaining information. The second step is to build a real star schema on existing data to study the impact on performance issues. 4 Experimental Results All the following evaluations were carried out in the environment presented in Section 3.1 and Observations on Current Implementation of the Application First of all, some experiments on different databases have been conducted to estimate the execution time of the application for both the calculation and the navigation processes. The information of different databases is given in Table 2 with the execution time on the calculation part and a query evaluation which refers to information retrieval across hierarchies. DB size Sales number MV size Calculation time(ct) CT/sale Query evaluation time(qet) QET/sale DB1 55.8GB GB 89.23min 8.02ms 18.28s 0.027ms DB2 61.7GB GB min 6.18ms 21.89s 0.021ms DB3 68.9GB GB min 6.37ms 43.31s 0.031ms DB4 81.7GB GB min 6.37ms 41.65s 0.020ms Table 2. Experimental databases characteristics and results The result shows that the calculation time and the query evaluation time is polynomial in respect to the sales number. This conclusion helps us to estimate the execution on a new database if we know the number of sales. The previous results also show one brake of the enterprise: in the case of a 55-GB database, it takes more than 18 seconds to answer an online user query. If the application considers larger databases, the response time can hardly be acceptable by the user. 4.2 Diagnosis of Latency Provenance Two diagnoses are performed in this part to determine the latency provenance. Program level: We first conduct an analysis on time distribution. For the mathematical calculation, time spent on the application server represents 30% of the total time and the remaining 70% is used to access the database. During the navigation, the part of the time spent on the application server represents less than 10% of the total time. Hardware level: In a second time, we are interested in better knowing the system behavior. We use SAR, one of Linux performance monitoring tools to collect and analyze system activity information. CPU is idle during on average 89.42% of time. Memory is normally used without swapping activities (0% during all the process time). Disk I/O also shows a normal activity with rare occurrences of device saturation according to the average number of read and written sectors. The diagnosis shows that the latency observed on the application is not due to hardware, neither to software. The performance issues of the database might be the source of the performance problem. 4.3 DBMS configuration We set values above and below the current settings of main variables presented in 3.2 to run some tests. The result is, for most of variables, the current value is already the optimal one. Only two variables innodb buffer pool size and innodb log file size could do better if defined as 12 GB and 1 GB, respectively. We then run a complete test with new DBMS settings on the whole system and we got a better performance of 7.32% with innodb log file size set to 1 GB. However, the current value of innodb buffer pool size is already optimal for a machine that serves at the same time as a database and a web server. So, the system is running with an almost optimal configuration.

5 Number of selected views Total evaluating cost Total gain no view materialized /4 tuples materialized % 1/2 tuples materialized % 3/4 tuples materialized % all views materialized % Table 3. Result of theoretical gain of implementing Greedy Algorithm 4.4 Selection of Materialized Views After the implementation of the classical greedy algorithm on a 55-GB database, the theoretical benefit can be estimated (see Table 3). We have done some adaptation during the implementation. We have noticed that the cardinality of nodes (# tuples of views) is highly skewed among the nodes in the lattice. Instead of limiting the number of selected views, we limit the number of tuples to be materialized. The result reveals a significant interest in precomputing materialized view candidates for this application. With only one quarter of materialized tuples, more precisely, rows, approximately 87.91% performance improvement can be achieved. Unfortunately, MySQL, the DBMS used in this case, supports neither materialized views nor automatic query rewriting by materialized views. Our results show the possibility of performance improvement if we had used another DBMS that supports automatic query rewriting. 4.5 Database Schema Modification We instantiate every query type presented in Section 2 and evaluate them on both the actual data schema and the new data schemas. Creating two separate materialized views Table 4 shows the results using the actual schema and two separate materialized views. We have an interesting improvement: an average gain of 8.08%, 4.03%, 17.13% and 12.09% respectively for the four query types. Regarding the response time of a data retrieval from precomputed data, there is some degradation Response time 1st try 2nd try 3rd try 4th try 5th try Average Gain Query type 1 Before After % Query type 2 Before After % Query type 3 Before After % Query type 4 Before After % Table 4. Evaluation of slow queries on actual and new data schema using two materialized views (in seconds) because an extra join has to be added to put the hierarchical information and the rest together. However, the evaluation time is still on the scale of tens of milliseconds. So, this newly created difference could be ignored for a web service. Using star schema In a second time, we make some more important changes on the data schema. We consider a star schema and implement data in fact table(s) and dimension tables. In this case, there are three dimensions Customer, Product and Time. Fact table stores foreign keys of the dimension tables and sales measures, i.e. turnover, quantity and price. This means that the former implicit time dimension becomes explicit. Due to some technical difficulties, we have to build these tables on two steps: create two dimension tables for customer and product and one fact table within the implicit time dimension and then create the time dimension table and the final fact table. We have conducted benchmarking on the two schemas and we have obtained some unexpected results shown Table 5 (here, DIM stands for dimension). The percentage of gain in comparison to the actual data schema shows that all other queries got a significant improvement using the star schema without time dimension instead of the one with time dimension except the query type 1. The particularity of this application is at least 36 month of sales are displayed in the interface to show the sales trend. In this case, the time dimension is rather an aggregation criteria than a condition of

6 Response time 1st try 2nd try 3rd try 4th try 5th try Average Gain Actual schema Query type 1 Without time DIM % With time DIM % Actual schema Query type 2 Without time DIM % With time DIM % Actual schema Query type 3 Without time DIM % With time DIM % Actual schema Query type 4 Without time DIM % With time DIM % Table 5. Evaluation of slow queries on actual and star data schemas with and without time dimension (in seconds) selection, which makes the query take more time to execute. For the query type 1, we define the month to be manipulated. This condition gives the query optimizer a chance to eliminate many tuples before entering the heavy part of the execution tree. So our conclusion is to design the star schema with adaption (here with only two dimensions). 5 Conclusion and Future Work We have presented four main optimization approaches that we consider in order to improve the performance of a forecasting system: hardware, DBMS configuration, DB design and selection of materialized views. In our context, we consider a real world application, Anticipeo, and we prove that with our solutions, the system presents a better performance. Further research work will include indexing strategies and parallelization. So far, we have not considered the improvement of the calculation performance. As mentioned in the introduction, former forecasting results are not used for future calculations. This means, at time t i, the database DB is at state S i. Once new forecasting information at time t i+δ has been derived, the database DB changes to S i+δ, which is completely different from S i. There could be some repeating intermediate results during the calculation and these intermediate results should be kept and used for future needs. We will also investigate this issue. References 1. T. K. BAN. Features of a good forecasting system, Jan E. Baralis, S. Paraboschi, and E. Teniente. Materialized views selection in a multidimensional database. In VLDB, pages , S. Biren, R. Karthik, and R. Vijay. A hybrid approach for data warehouse view selection. International Journal of Data Warehousing and Mining(IJDWM), 2:1 37, H. Gupta. Selection of views to materialize in a data warehouse. In ICDT, pages , H. Gupta and I. S. Mumick. Selection of views to materialize under a maintenance cost constraint. In ICDT, pages , V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In SIGMOD Conference, pages , W. H. Inmon. Building the Data Warehouse (3rd Edition). Wiley, R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition). Wiley, Y. Kotidis and N. Roussopoulos. Dynamat: A dynamic view management system for data warehouses. In SIGMOD Conference, pages , A. Shukla, P. Deshpande, and J. F. Naughton. Materialized view selection for multidimensional datasets. In VLDB, pages , P. Vassiliadis and T. K. Sellis. A survey of logical models for olap databases. SIGMOD Record, 28(4):64 69, 1999.