Using Materialized Views To Speed Up Data Warehousing

Transcription

1 Using Materialized Views To Speed Up Data Warehousing Michael Teschke, Achim Ulbrich IMMD 6, University of Erlangen-Nuremberg Martensstr. 3, D Erlangen, Germany {teschke, Abstract Running analytical queries directly against the huge raw data volume of a data warehouse results in unacceptable query performance. The solution to this problem is storing materialized views in the warehouse, which pre-aggregate the data and thus avoid raw data access and speed up queries. In this paper, at first the problems concerning the selection of the right pre-aggregations and their utilization are discussed. The main focus then is the maintenance of the pre-aggregates due to changes of the raw data. We emphasize the specific aspects of warehouse maintenance compared to common view maintenance and present a warehouse model optimized for speeding-up view maintenance. 1 Motivation Lately, the notion of a data warehouse (DW) has become extremely popular. A DW is an integrated collection of data, that is extracted from distributed and heterogeneous (legacy) database systems. Various sorts of applications like on-line analytical processing (OLAP), knowledge discovery or data mining can use the DW as a consistent data basis that contains information which could not be accessed together before. What distinguishes a DW from a federated database system? In the latter, distributed and heterogeneous database systems are integrated as well but in a DW the data is redundantly stored and not only integrated in a view-based manner. Beyond that, the data from the sources is not just copied: it is cleaned from inconsistencies and transformed for optimal access and performance. Additionally, the raw data is aggregated according to application specifications, because analytical queries perform bad when they have to scan the enormous raw data volume. For example imagine a DW for market research, in which sales data like number of sold products or price figures are stored. Let the products be organized into product groups and product groups into product areas. If the summary of sales belonging to the same product groups is aggregated and stored (materialized) in the warehouse, a query about the market percentage of product areas can be calculated much faster, because it does not have to access raw data but can use the higher classificatory level of product groups. The market research example comes from various modeling and performance case studies we have performed with the GfK, Europe s largest market research company ([LeRT95], [LeRT96]). In [LeRT96] we have shown, that the performance gain achieved with storing redundant pre-aggregations outweighs the additional storage overhead by far. These materialized views (MVs) or pre-aggregations are the main focus of this paper. We give an overview of all 1

2 Selection of Views 2 the aspects that must be considered to speed up data warehousing with MVs, together with a summary of related work. The reminder of the paper is organized as follows. In section 2 we discuss ways to decide, which views should be materialized in the DW. In section 3 we show, how the selected set of MVs can be used in query evaluation. Once these views have been computed, they have to be maintained to keep the degree of actuality the users or applications have specified. An overview of the related work concerning view maintenance aspects is given in section 4. In section 5 we present our approach of speeding-up warehouse maintenance. The paper concludes with a summary and some ideas we want to work on in the near future. 2 Selection of Views The most essential issue in speeding up data warehousing with MVs is to select which views should be materialized. Of course, there is a trade-off between time and space requirements. The more views are materialized, the more likely it is to find an appropriate view for answering a query. Generally, the following approaches have been brought up recently: Materialize all Using algorithms Empirical selection In the following, we discuss these points in more detail. The basis for selection and analysis is always the notion of modeling the data in a multidimensional way with dimensions for accessing single cells (or slices) of the cube. Furthermore, different hierarchies can be defined on top of the dimensions, to ease selecting. In Figure 2.1, an example is sketched, where sales data is stored in a 3-dimensional array (cube). On top of the product dimension, a hierarchy is defined, which groups single products up to product areas, as mentioned earlier in section 1. Candidates for pre-aggregations are on the one hand all the possible combinations of dimensional elements. On the other hand, all kinds of SPJ-Views (views defined by a combination of selection, projection and joins) together with aggregate functions can be used. year quarter month... country region shop... product areas main groups product groups items Sales 15 Fig. 2.1: 3-dimensional cube 2

3 2.1 The Selection Cube-Operator of Views 2.1 The Cube-Operator [GBLP96] proposed a cube operator, which calculates all the possible combinations between the elements of the existing dimensions. For three dimensions, this results in 2 3 = 8 different aggregations. This number sounds rather small, but does not take hierarchies into account. The problem gets even worse, if beside this classification oriented analysis, also features (attributes) are taken into account. In features oriented analysis, data is aggregated according to some features of the raw data. For example, the average price of TVs with remote control (feature 1) and HI8-stereo sound (feature 2) would be an interesting question in the context of market research. Even for a small number of features per dimension, say 7, pre-aggregating all feature combinations would blow up the cube and result in about 2 21 = 10 6 combinations. Products which stick to this paradigm cannot solve the real world problems of the GfK, where each product has up to 20 features ([LTAK97]). 2.2 Algorithmic Selection Several algorithms for selecting views have been proposed in literature lately. To the best of our knowledge, most of them base on the same greedy -approach: Given a certain amount of space S, find the best set of views ([HaRU96]). It is shown that the performance gain achievable with these greedy-algorithms is at least 67% of the optimal solution, which is NP-hard to compute. In [GHRU97], this approach is extended to include selecting indices, thus S is shared between views and indices. In [Gupt97] candidate views are weighed by a frequency factor stemming from a monitoring process of the users access behavior. 2.3 Empirical Selection An empirical way to select the views to pre-aggregate is monitoring the queries submitted by the users. Based on factors like query frequency or data volume, a number of views can be calculated and materialized, which supports the top number of queries. This approach is used in the Informix MetaCube Aggregator, where user queries are monitored and views to materialize are proposed to the database administrator. To make our point clear, we do not suppose to materialize one support view per query. On the contrary, we believe that views can be selected, which support a bunch of queries, when the classification hierarchy of the qualifying data is taken into account. The performance gain has been validated in a case study ([LeRT96]). In this study, an aggregation hierarchy has been built, which supports a typical set of market research queries, provided by the GfK. One of the aggregations in the hierarchy is not used directly by any query at all, but supports about seven other aggregations. Depending on their position in the aggregation hierarchy (the higher the more data volume the aggregations have) query response times have been improved from hours or minutes to minutes and seconds. 3

4 Using Views in Queries 3 3 Using Views in Queries To achieve better query performance the queries have to be transformed to access the MVs instead of the raw data. We can distinguish two ways of using MVs in queries: Direct match: The query can be answered completely with one MV, other relations or views are not necessary. Partly match: Only a part of the query can be computed with a MV. This view is not sufficient to answer the total query, other relations or views are necessary. Direct match means to use a MV which can be used to process the query. Of course it is not possible to materialize a view for every query, only for standard queries this technique can be used. To reduce the number of views, they should be defined as universal as possible to fulfill the demand of several queries without decreasing the efficiency ([LeRT96]). For example, in [Hans87] it is shown, that projecting out some attributes does not decrease the cost of querying. For this reason no projections are required in any MV, unless some attributes are never used. Also selections could be performed very fast using index structures. So selections are only useful if some data is never required. Especially for ad-hoc queries, direct match is not always possible. Partial match usage of MVs can be applied in more cases, because only parts of queries are calculated with MVs. The non-matching parts of the queries can/must access raw data. There are two methods to replace the relations in the original query by MVs: Explicit use: The user is aware of the views and uses them explicitly when formulating the query. Transparent use: Instead of hard-coding the view in the query, the query optimizer is used to answer the query as efficient as possible with the use of MVs. Explicit use is not very user-friendly. The user has to know all relevant views and if the views are changing all users must be informed. Furthermore, as shown in section 4.4, if multiple maintenance policies are supported, only querying a single viewgroup is allowed. Otherwise inconsistencies are possible and that could lead to incorrect answers. Thus the user would be responsible for data consistency. To avoid these drawbacks, the query optimizer has to be extended for the detection and usage of the MVs with which the query performance is best. Then the user does not have to know anything about the MVs and the query modification is transparent for him. [GuHQ95] have introduced the generalized projection approach, which explores the query tree wrt. whether a given MV can be integrated instead of a part of the tree. For that, the query tree and the tree of the MV have to be normalized. To implement transparent use of MVs to increase queries in DWs, a lot of information has to be stored as meta data. Beside the view definitions, also the normalized query tree of the view is required. With implementing the transparent-use extension of the query optimizer, administration capabilities for the MVs can be enhanced. It would be possible to analyze the queries to the warehouse and to detect whether a new MV would increase query performance (section 2.3). 4

5 4 View Maintenance 4 View Maintenance One of the major performance problems in DWs is the maintenance process. Whenever source data is changing, the warehouse does not reflect the correct state of this source and has to be maintained. Fortunately only parts of the warehouse are affected by a modification to a single source. So the obvious solution would be to reload these parts. But the huge data inventory and the great amount of changes forbids reloading even parts of the DW from the scratch. For this reason only the changes due to the source modifications are computed to the warehouse. Maintaining a DW is based on the theory of maintaining MVs. The process of updating MVs in response to changes to the underlying base relations is called view maintenance. Instead of recomputing the view, incremental view maintenance can be used: the view is computed incrementally from its old state and its changes V due to the changes in the base relations. The maintenance process can be written as V = (V θ (V)) + (V), where (V) and + (V) include the tuples, that have to be deleted from or inserted into the old materialization, respectively ( (V) = (V) + (V)). θ denotes the contained difference, the disjoint difference ([QiWi91]). In [QiWi91] it is shown how to calculate the changes to the MVs with algebraic differencing. In [GrLi95] this approach is extended to views with duplicates. The technique of computing V depends on several influence factors listed below (extension of [GuMu95]): Type of modification: For maintenance, at least insertions and deletions have to be handled. Updates can be handled directly or modeled as deletions followed by insertions. Another sort of modification are changes to the view definition, which we omit due to space constraints ([GuMR95]). Query language expressions: The expressions used for the definition of the MVs have influence on the maintenance techniques. In this paper we focus on MVs with aggregate functions to stress their influence for speeding-up query performance. Information available for maintenance: There are four different information sources: The view definition, the modifications, the contents of the MV (often called materialization) and the base relations. The definition of the view and the modifications are at least necessary for incremental maintenance. Furthermore additional information sources can be used as shown in section 4.3. System environment: We can distiguish between local view maintenance, where the base relations exist local to the views and derived view maintenance, where they reside on heterogeneous, distributed sources. For the latter, maintaining is more difficult and expensive. Moment of update: In [CKL+97] three different maintenance policies are distiguished: Immediate views: These views are updated immediately within the same transaction as the changes to the base relations. Immediate views allow fast querying with maximum actuality, at the expense of slowing down the update transactions. 5

6 View Maintenance 4.1 Deferred views: Deferred views are maintained asynchronously to the modifications. They are refreshed typically when the view is queried. This leads to faster update of base data, with the disadvantage of slowing down querying. Snapshot views: If staleness can be tolerated in the moment of access, views can be maintained asynchronously to the modifications and to the moment of querying. Thus it is possible for queries to access old data. In the following section we present techniques to detect the modifications applied to the base relation of a MV. In section 4.2 we analyze the expressions in MVs whether they can be updated without access to the base relations to speed-up maintenance. After that we describe techniques to reduce the access to the base relations during the maintenance of not self-maintainable MVs. In the last section we explain possible consistency problems. 4.1 Detecting Changes Prerequisite for maintenance is the detection of changes done to the base relations. If the database system supports triggers, the approach of Ceri and Widom ([CeWi91]) can be used to detect modifications with production rules. Production rules in database systems allow for specification of data manipulation operations that are executed automatically when certain events occur. For every base relation three triggers have to be implemented (one for insertions, deletions and updates, if supported). The benefit of this method is to detect the changes at the moment they occur. With this method immediate maintenance is possible. Another technique for detecting changes is using log tables as shown in [KäRi87]. The system log file is parsed to obtain the relevant modifications. Since log files are used for recovery, this approach may not require any modification to the application. Because the log file can only be parsed periodically, only deferred maintenance is possible. The source databases in most DW environments are heterogeneous, autonomous and often legacy systems. For this reason logging or trigger mechanisms cannot be generally expected. So the modifications have to be extracted by comparing a current source snapshot with an earlier one by monitor programs. The problem of detecting differences between two snapshots is called the snapshot differential problem ([LaGa96]). Due to the periodical nature of the snapshots, immediate maintenance is not possible. To reduce the number of modifications, irrelevant updates can be detected and removed. Irrelevant updates are modifications which do not affect the materialization. The cost for filtering such modifications is small, because neither access to the materialization nor to the base relations is required. Previous work has been done on detecting irrelevant updates ([BlCL89], [LeSa93]). If a tuple from one of the base relations of a MV is inserted or deleted and at least one of the attributes of the tuple does not fulfill the restriction in the selection condition, this tuple has no effect to the materialization and can be ignored. 6

7 4.2 Self-Maintainable View Maintenance Views 4.2 Self-Maintainable Views Most important for the performance of MV maintenance are the informations required for refresh. Dependent on the types of modification and the expressions used in the view definition, access to the base relations is or is not needed for maintenance. The latter considerably reduces the maintenance performance, because huge base tables have to be scanned. For this reason it is very important to use MVs which can be maintained without accessing the base relations. A MV that can be maintained without access to the base relations, is called self-maintainable ([BlCL89], [GuJM96], [Huyn96]). This property depends on the expressions in the definition of the MV and the types of modification. Views that are self-maintainable for all types of modifications are called fully self-maintainable, otherwise they are called partially self-maintainable. In the following we investigate the property of self-maintainability for aggregate functions. For a detailed discussion of other view expressions, see [GuJM96] or [Huyn96]. Self-Maintainability of aggregate functions An aggregate function is self-maintainable if the new value of the function can be computed solely from the old values of the aggregation and from the changes to the base relations. In [ChMM88] these functions are called additive. With respect to this definition we can classify the different aggregate functions as follows: Additive functions: The new value of the aggregation can be computed from its old value and the change to the base relations. Examples are the COUNT- and SUM-function. Maintaining the first one wrt. insertions, only the number of the modifications with the same group-by attributes has to be added to the count. For deletions it has to be determined whether the value of the COUNT aggregation has just to be decreased or if a tuple has to be deleted from the materialization (if COUNT = 0). The same applies to updates of group-by attributes, because they correspond to a deletion of the old group-by values and an insertion to the new ones. Maintaining SUM-aggregations is likewise easy; problems resulting from null-values can be handled with the counting-algorithm (Section 4.3.1). Additive-computable functions: These are functions that cannot directly be computed from the old value and the update, but can be transformed into additive functions. For example the aggregation AVG is not additive, but it can be replaced by the additive functions SUM and COUNT (AVG = SUM / COUNT). Partly-additive functions: These functions are only additive for some kinds of updates and cannot be transformed into additive functions. For example MIN and MAX are partly-additive aggregations. Consider the case, when a new tuple is inserted into a MV with the aggregation MIN. If the new value is higher or equal the materialization remains unchanged, otherwise the inserted tuple is the new minimum. Thus, for insertions MIN and MAX are selfmaintainable. If a tuple is deleted from the base tables and the value is higher than the stored minimum, the update can be dropped, otherwise the deleted tuple is the minimum and access to the base relations is necessary to get the new minimum. Thus MIN and MAX are not selfmaintainable wrt. deletions. 7

8 View Maintenance 4.3 Non-additive functions: If no type of modification can be computed to an aggregation without access to the base relations and the aggregation cannot be transformed into an additive function, it is called non-additive. One example is the median, which bisects the set of all values ordered by size. 4.3 Making Views Self-Maintainable As shown in [GuJM96], only a small subset of view definitions is self-maintainable without transformation. For this reason techniques to make views self-maintainable have been developed. The first approach presented in section is representative for a class of approaches where additional attributes are added to the view definition. The second class uses auxiliary MVs to make a set of views self-maintainable The Counting-Algorithm If a tuple from the materialization shall be deleted, it has to be detected, whether the deleted tuple is the only derivation in the base relations for the tuple in the view or not. In the first case the tuple has to be deleted from the materialization, in the second case the materialization remains unchanged. Querying the base relation to get the number of derivations can be avoided by changing the definition of the MV. Using the counting-algorithm ([GuMS93], [MuQM97]), the numbers of derivations for a tuple in the materialization (count) is added as an extra attribute to the view definition. If a tuple is inserted into the materialization and there is no other derivation for this tuple, the count has to be initialized with 1. Otherwise there are other derivations for this tuple and only the count has to be increased. If a derivation for a tuple in the materialization is deleted, the count has to be decreased and if the count is equal zero the tuple has to be deleted from the materialization Auxiliary Materialized Views In section 4.2 it is proposed to use full-outer-joins for self-maintainability. Instead we can also use some other MVs to make the set of these views self-maintainable. These views are called auxiliary materialized views. In [QGMW96] an algorithm is presented for detecting suitable auxiliary views. Querying these views instead of the base relations increases the maintenance performance for local as well as for derived view maintenance. The change in the view definition to reference other views instead of the base relations requires that the MVs have to be maintained in the correct hierarchical order. For this reason a dependency graph ([QGMW96], [CKL+97]) is needed. The dependency graph G(V) is a directed graph, with a node for every base relation and view used in the definition of the view V, either directly or through other views. There is an edge from a node X to a node Y if X derives view Y. All nodes in G(V) have to be maintained before V. 8

9 4.4 Consistency Warehouse Considerations Maintenance 4.4 Consistency Considerations If the update of a base relations is done separately from the update transactions for the MV (deferred and snapshot refresh) and maintaining the view requires access to the base relations, distributed incremental view maintenance anomalies ([ZhGW96]) could occur. This happens, when one base relation is queried to maintain the view while another base relation is modified again. These anomalies can be avoided either by compensating the new modifications (Strobealgorithms, described in [ZhGW96]) or by using only self-maintainable MVs (section 4.2). As the first alternative considerably reduces the maintenance performance, it is preferable using only self-maintainable deferred and snapshot views, when possible. Different maintenance policies like immediate, deferred and periodically refresh could cause consistency problems for hierarchical maintenance structures ([CKL+97]). The maintenance policy of a view cannot be chosen independently from the policies of the related views or relations. For example assume two views V 1 and V 2 which both have snapshot maintenance policies but with different refresh cycles. If they are used to derive a third view V 3, V 3 can also only be maintained periodically and even then it is possible that this view reflects an inconsistent state of the raw data because of the different refresh cycles of V 1 and V 2. In order to provide consistent views in a system allowing multiple maintenance policies, [CKL+97] developed a model based on the notion of viewgroups. A viewgroup is a collection of views that are required to be mutually consistent. That is, every MV in a viewgroup relates to the same state of the underlying base relations. Viewgroups have to be isolated in the sense that maintenance of a view in a viewgroup must not cause changes to views in other viewgroups. Furthermore it must be possible to answer a query without looking outside the queried viewgroup. For this reason it is not only necessary to search for other views to speed-up maintenance, but also to check if the auxiliary views and the new MV have consistency preserving maintenance policies. 5 Warehouse Maintenance In this section we present our approach of data warehousing. At first we make some design decisions before a number of measures to speed-up the maintenance process are discussed. 5.1 The Model One basic assumption in our warehouse model is to store the normalized and filtered source relations in the warehouse. Thus we are able to answer any query, and avoid accessing the sources for view maintenance. This leads to a distinction between two kinds of warehouse maintenance. External maintenance stands for the maintenance process between the sources and the warehouse. Internal maintenance describes the process of maintaining pre-aggregations in the warehouse. Because in most cases changes to the source data are transmitted periodically to the warehouse, external maintenance can only be done periodically. After the transmission, the modifications are stored in update tables to separate the moment of transmission from the moment of maintenance. There is one update table for each source relation with an extra 9

10 Warehouse Maintenance 5.2 attribute to indicate the modification type. Beside the update tables, delta tables are used for deferred and snapshot views. These tables include the modifications to be applied to the respective views. The main aspects of our warehouse model are sketched in Figure Speeding-up Warehouse Maintenance Most important for speeding-up warehouse maintenance is to use only self-maintainable MVs, because they avoid querying the base relations for maintenance. Using self-maintainable views also avoids the distributed incremental view maintenance anomalies, which has been discussed in section 4.4. From the set of self-maintainable MVs, we consider only those with aggregations (pre-aggregations). Compared to simple SPJ-views, they provide a much better performance benefit. As only a subset of MVs is self-maintainable (section 4.2), a generation component is needed to transform the materializations selected by algorithms or empirical methods (section 2) with the techniques described in section 4.3. If querying the base relations is not avoidable (e.g. deletion of MVs with MIN), the cost of querying is reduced due to the mirrored source relations. But deletion is an exception in DWs due its characteristic of a non-volatile storage. Another approach to increase maintenance performance is to use MVs as basis for a MV (section 4.3.2). So the generation component has to extract the best maintenance path in the dependency graph. Due to varying requirements the set of MVs is often changing. For this reason, a dynamic dependency graph concept has to be developed. Not only for the generation of a new MV the best dependency graph has to be found. Also for already stored views it has to be detected whether the new view could improve maintenance performance. Furthermore also the elimination of a MV could cause changes to the maintenance structure of other views. Analytical Systems Query transformer Data Warehouse Hierarchy of pre-aggregations Internal Maintenance filtered and normalized raw-data meta-data delta tables update-tables External Maintenance source-data Source 1 Modification Transmission... Fig. 5.1: Data warehouse model source-data Source n 10

11 5.2.1 Speeding-up Warehouse Maintenance For internal maintenance immediate, deferred or snapshot refresh can be used. To avoid consistency problems, querying the mirrored base relations is not allowed during their refresh. For the same reason access to the immediate MVs based on these relations is also forbidden during maintenance. To shorten refresh-time, deferred and snapshot views can be used. Snapshot maintenance allows fast querying and updates, but queries can read data that is not up-to-date. Unfortunately in most papers snapshots are updated only in a time-based manner periodically. In our opinion value based maintenance is even more interesting (e.g. the value of the aggregate must not deviate more than three percent from the actual value). Especially for aggregations, a deviation from the base relations is tolerable, if the difference between the old value and the new value is insignificant. This insignificance threshold must be defined by the user and stored as part of the meta data. Only if a given threshold is reached, the view has to be maintained. Since the number of views to be maintained is often large, further methods to improve performance have to be applied. Some techniques we consider most useful are discussed in the reminder of the paper Modification Compression For the reason that mostly heterogeneous and autonomous sources are used in DWs, the modifications of source data are transmitted periodically in a batch-manner to relieve the network. The number of updates sent to the warehouse can be reduced by modification compression. That is to reduce all modifications for the same tuple (with the same primary key) to one. For example an insertion followed by an update on the same tuple can be replaced by a modified insertion, or two different updates can be integrated to a single update transaction Premaintenance For deferred and snapshot views the relevant modifications have to be stored in the delta tables. When all relevant immediate views and delta tables have been updated, a modification can be deleted from the update table. In this section we present a technique to reduce the number of modifications in the delta table, to speed-up maintenance. At first we are able to eliminate the irrelevant updates to the MVs by checking the view definitions as described in section 4.1. In [MuQM97] a premaintenance technique is described to increase maintenance performance by dividing the update process into two separate functions: propagate and refresh. The propagate function can be processed without locking the MVs, so querying will not be disturbed. The MVs are not locked until the processing of the refresh function, which applies the modifications in the delta tables to the corresponding view. Therefore, the goal of the propagate function is to do as much work as possible to minimize the time required by the refresh function. With the delta tables and parallel processing, every self-maintainable view can be refreshed at the same time. 11

12 Warehouse Maintenance Fully Self-Maintainable Aggregations In this section, we consider only fully self-maintainable aggregations and the counting-algorithm (section 4.3.1) for maintenance to show how the refresh process can be made more efficient. The propagate function computes all modifications with the same group-by attributes into one tuple, thus the refresh function has only to take care of one modification for a group-by combination instead of many. This is done by pre-computing the aggregate functions, called aggregate compression. Let t be a new modification for a given MV, either there is another tuple in the delta table with the same group-by attributes or not. In the second case the modification is inserted into the delta table without any compression. Otherwise the new modification can be computed to the already stored entry as shown in Table 5.1. v is the value of the aggregation computed over the expression exp, that is already stored in the delta table. v denotes the aggregation value of the new modification (over the same expression and with the same group-by attributes) and v is the value of the aggregation after maintenance. As mentioned, the countingalgorithm is needed to maintain SUM without accessing the base relations for deletions. For this reason the count is represented by c and c (after maintenance). Of course the table can be extended to other aggregate functions. Aggregation value insert( v) delete( v) v = COUNT(*) v = v + 1 v = v - 1 v = COUNT(exp) if v = 0 : v = v; else: v = v + 1 if v = 0 : v = v; else: v = v - 1 v = SUM(exp) v = v + v; c = c + 1 Tab. 5.1: Aggregation compression v = v - v; c = c - 1 For the refresh process the following alternatives result from the values of v and c stored in the delta table: If v = 0 and c = 0 (only for SUM) no modification has to be applied to the tuple in the view with the same group-by attributes. Otherwise the values have to be updated with the values in the delta table. We can see that this aggregation compression is possible even without reading the values in the views. We considered only insertions and deletions here, because direct updates to MVs have to be transformed to insertions and deletions anyway. Either the value of the aggregated attribute or one of the group-by attributes changes. In the first case the difference between the old and the new value has to be computed to or from the MV. If the group-by attributes change, the tuple with the old value has to be deleted from the MV and the tuple with the new value has to be inserted. Partly Self-Maintainable Aggregations For MIN and MAX aggregation compression is not possible. The deletion problem makes it necessary to store every modification in the delta table. For example, considering a MV with the total minimum over an attribute (no group-by attributes) of a relation R. We could imagine to 12

13 6 Conclusion and Future Work store only the minimum of the insertion and the minimum of the deletions in the delta table, because only the minimum is required to modify the view. There is no problem when the value of the deletion is higher than the value of the insertion. Either the minimum in the view is smaller than the insertion (no modification during the refresh procedure required) or the insertion is the new minimum. Problems occur when the value of the deletion is equal or smaller than the value of the insertion. If they are equal we have to delete both the deletion and the insertion from the delta table, because both neutralize each other. But unfortunately there could be another insertion (not the minimum of all insertion) already deleted from the delta table that could be the new minimum. If the value of the deletion is smaller, then the insertion is irrelevant. Either the value of the deletion is higher than the minimum in the view (no modification is required to the view) or it is equal. In the second case the new minimum has to be found by querying R. Even with access to the materialization during the prepare process, we have to store every modification in the delta table. Only modifications that neutralize each other can be deleted from the delta table. But it could be useful to create an auxiliary view MV aux with a number of the smallest values (say five). Any modification can be done to MV aux without interfering with warehouse queries. Whenever a tuple is deleted from R that has a derivation in V aux we have to delete this derivation. But instead of querying R for the fifth minimum, V aux remains unchanged (now containing only the four smallest values). Not before the last minimum is deleted, R has to be queried for the five smallest values during the refresh process. But whenever an insertion is done to R, V aux has to be adapted if the value of the insertion is smaller than the highest value in V aux. If more than five values are stored we can delete the highest minimum. If the new value is smaller than the absolute minimum, also V has to be maintained. This is marked by an extra attribute. Thus V aux is used like a kind of stack to reduce the cost of querying. 6 Conclusion and Future Work In this paper, we discussed the most important issues for speeding-up data warehouse performance with materialized views: view selection, view usage and view maintenance. The latter field has been widely covered by the research community in the last years. We have shown that many approaches dealing with maintaining materialized views can be applied to data warehouses. Nevertheless, still a lot of work has to be done, particularly when considering aggregate views and view consistency problems. Whereas the view usage issue is discussed in some papers, the problem of view selection still remains unsolved. Cube-approaches, which compute the full set of possible aggregations became very popular, but haven been proven unrealistic in [LeTW97]. In the future, we try to extend our work in all the three issues. Using the classification hierarchy to reduce the number of possible combinations is discussed in [WLTA97]. Part of the Cube-Star project of our institute is the building of a query optimizer, which is able to check, which materialized view (or views) currently stored in the system is best to answer a certain query ([Cube97]). If no such view can be found, the query is computed from raw data and the result is stored as a materialized view for later queries. In our View-Star project, we investigate the 13

14 Conclusion and Future Work 6 possibilities of adjusting the view dependency paths dynamically. This enables us to optimize the maintenance of existing views wrt. adding new materialized views in the data warehouse. References BlCL89 CeWi91 ChMM88 CKL+97 Cube97 GBLP96 GHRU97 GrLi95 GuHQ95 GuJM96 GuMR95 Blakeley, J.; Coburn, N.; Larson, P.: Updating Irrelevant and Autonomously Computable Updates, in: ACM Transaktions on Database Systems 14(3), 1989 Ceri, S.; Widom, J.: Deriving Production Rules for Incremental View Maintenance, in: Proc. 17th Int. Conf. on Very Large Data Bases, (VLDB 91, Barcelona, Spain, September 3-6), 1991 Chen, M.; Mc Namee, L.; Melkanoff, M.: A Model of Summary Data and its Applications in: Statistical Databases, in: Rafanelli, M.; Klensin, J. C.; Svensson, P. (eds.): Proc. of the 4th Int. Working Conf. on Statistical and Scientific Database Management (4SSDBM, Rome, Italy, June 21-23) Lectures Notes in Computer Science 339, Berlin e. a.: Springer-Verlag, 1988 Colby, L.; Kawaguchi, A.; Lieuwen, D.; Mumick, I.; Ross, K.: Supporting Multiple View Maintenance Policies, to appear in: SIGMOD 1997 Cubestar: Gray, J.; Bosworth, A.; Layman, A.; Pirahesh, H.: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals, in: Proc. of the 12th IEEE Int. Conf. on Data Engineering (ICDE 96, New Orleans, LA, 26. Feb.-1. März), 1996 Gupta H.; Harinarayan, V.; Rajaraman, A.; Ullman, J.: Index Selection for OLAP, in: Proc. of the 13th Int. Conf. on Data Engineering (ICDE 97, Birminghan, UK, April 7-11), 1997 Griffing, T.; Libkin, L.: Incremental Maintenance of Views with Duplicates, in: Proc. of the 1995 ACM SIGMOD Int. Conf. on Management of Data (SIG- MOD 95, San Jose, USA, May 23-25), SIGMOD Record 24(2), 1995 Gupta, A.; Harinarayan, V.; Quass, D.: Aggregate-Query Processing in Data Warehousing Environments, in: Proc. of the 21th Int. Conf. on Very Large Data Bases (VLDB 95, Zurich, Switzerland, September 11-15), 1995 Gupta, A.; Jagadish, H.; Mumick, I.: Data Integration using Self-Maintainable Views, Technical Report, Dept. of CS, Stanford University, 1996 Gupta, A.; Mumick, I.; Ross, K.: Adapting Materialized Views after Redefinition, in: Proc. of the 1995 ACM SIGMOD Int. Conf. on Management of Data (SIG- MOD 95, San Jose, USA, May 23-25), SIGMOD Record 24(2),

15 6 Conclusion and Future Work GuMS93 GuMu95 Gupt97 Hans87 HaRU96 Huyn96 KäRi87 KLM+97 LaGa96 LeRT95 LeRT96 LeSa93 LeTW97 Gupta, A.; Mumick, I.; Subrahmanian, V.: Maintaining Views Incrementally, in: Proc. of the 1993 ACM SIGMOD Int. Conf. on Management of Data (SIG- MOD 93, Washington, USA, May 26-28), SIGMOD Record 22(2), 1993 Gupta, A.; Mumick, I.: Maintenance of Materialized Views: Problems, Techniques and Applications, in: IEEE Data Engineering Bulletin, Spezial Issue on Materialized Views & Data Warehousing 18(2), June 1995 Gupta, H.: Selection of Views to Materialize in a Data Warehouse, in: 6th Int. Conf. on Database Theory (ICDT 97, Delphi, Greece, Jan 8-10), 1997 Hanson, E.: A Performance Analysis of View Materialization Strategies, in: Proc. of the 1987 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD 87, San Francisco, USA, May 27-29), SIGMOD Record 16(3), 1987 Harinarayan, V.; Rajaraman, A.; Ullman, J.: Implementing Data Cubes Efficiently, in: Proc. of the 1996 ACM Int. Conf. on Management of Data (SIGMOD 96, Montreal, Quebec, Juni), 1996 Huyn, N.: Efficient View Self-Maintenance, Technical Report, Dept. of CS, Stanford University, 1996 Kähler, B.; Risnes, O.: Extending Logging for Database Snapshot Refresh, in: Proc. of the 13th Int. Conf. on Very Large Data Bases (VLDB 87, Brighton, Great Britain, September 1-4), 1987 Kawaguchi, A.; Lieuwn, D.; Mumick, I..; Quass, D.; Ross,K.: Concurrency Control Theory for Deferred Materialiced Views, in: 6th Int. Conf. on Database Theory (ICDT 97, Delphi, Greece, Jan 8-10, 1997), Labio, W. J.; Garcia-Molina, H.: Efficient Snapshot Differential Algorithms for Data Warehousing, Technical Report, Dept. of CS, Stanford University, 1996 Lehner, W.; Ruf, T.; Teschke, M.: Data Management in Scientific Computing: A Study in Market Research, in: Proc. of the Int. Conf. on Applications of Databases (ADB 95, Santa Clara, California, Dec , 1995) Lehner, W.; Ruf, T.; Teschke, M.: Improving Query Response Time in Scientific Databases Using Data Aggregation, in: Proc. of the 7th Int. Conf. and Workshop on Database and Expert Systems Applications (DEXA 96, Zürich, Schweiz, 9-13 September), 1996 Levy, A. Y.; Sagiv, Y.: Query Independent of Updates, in: Proc. of the 19th Int. Conf. on Very Large Data Bases (VLDB 93, Dublin, Ireland, August 24-27), 1993 Lehner, W.; Teschke, M.; Wedekind, H.: Über Aufbau und Auswertung multidimensionaler Daten (about construction and analysis of multidimensional data), in: Conf. on Databases in Büro, Technik und Wissenschaft (Office, Engineering and Science), (BTW 97, Ulm, Germany, Mar. 5-7),

16 Conclusion and Future Work 6 LTAK97 MuQM97 QGMW96 QiWi91 Quas96 WLTA97 ZhGW96 Lehner, W.; Teschke, M.; Albrecht, J.; Kirsche, T.: Building a real Data Warehouse for Market Research, submitted to DEXA 97 Mumick, I.; Quass, D.; Mumick, B.: Maintenance of Data Cubes and Summary tables in a warehouse, to appear in: SIGMOD 97 Quass, D.; Gupta, A.; Mumick, I.; Widom, J.: Making Views Self-Maintainable for Data Warehousing, Technical Report, Dept. of CS, Stanford University, 1996 Qian, X.; Wiederhold, G.: Incremental Recomputation of Active Relational Expressions, in: IEEE Transactions on Knowledge and Data Engineering 3(3), Sept Quass, D.: Maintenance Expressions for Views with Aggregation, Technical Report, Dept. of CS, Stanford University, 1996 Wedekind, H.; Lehner, W.; Teschke, M.; Albrecht, J.: Preaggregation in Multidimensional Data Warehouse Environments, submitted for publication Zhuge, Y.; Garcia-Molina, H.; Wiener, J.: The Strobe Algorithms for Multi-Source Warehouse Consistency, Technical Report, Dept. of CS, Stanford University,