Performance of KDB-Trees with Query-Based Splitting*

Transcription

1 Performance of KDB-Trees with Query-Based Splitting* Yves Lépouchard Ratko Orlandic John L. Pfaltz Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science University of Virginia Illinois Institute of Technology University of Virginia Charlottesville, VA 2293Chicago, IL 6616 Charlottesville, VA Abstract While the persistent data of many advanced database applications, such as OLAP and scientific studies, are characterized by very high dimensionality, typical queries posed on these data appeal to a small number of relevant dimensions. Unfortunately, the multidimensional access methods designed for highdimensional data perform rather poorly for these partially specified queries. A potentially very appealing idea, frequently suggested in the literature, is to adopt a node-splitting policy that takes into account the importance of individual dimensions, which could be determined either a priori or through a statistical sampling of actual queries. This paper presents the results of some carefully controlled experiments conducted to observe the effects of query-based splitting on the performance of KDB-trees. The strategy is compared to a splitting policy that selects the split dimensions in a cyclic fashion, which has been shown to be very effective, especially in high-dimensional situations. Based on the results, the query-based splitting does not appear to be a very appealing splitting strategy for KDB-trees. Keywords: information databases, multi-dimensional databases, access methods, data dimensionality. 1. Introduction Typical retrieval mechanisms are based on a subdivision of the search space into finer and finer subspaces organized into a tree structure. Each subspace is represented by an index node (page) of the tree structure. The leaf nodes correspond to the smallest regions in which the desired items are to be found. With large amounts of data, the structure must reside on the secondary storage. While the exact-match search typically follows a single path of the tree, the rangesearch queries usually require access to a possibly large number of nodes. To search a 2-or 3-dimensional space based on region queries, one would divide the space into index regions (typically rectangular ones) through a splitting process that partitions individual dimensions. How this process takes place may have a significant impact on the retrieval performance. In part because of this, we have today a variety of different structures for spatial (2-or 3- dimensional) retrieval [2]. The real problem arises when we consider data in higher-dimensional spaces. These spaces naturally occur when data is regarded with respect to a multi-dimensional parameter space. Examples include information retrieval, data mining, OLAP, multimedia systems and numerous scientific simulations, such as high-energy physics, longterm environmental observations, as well as genome and protein studies. Unfortunately, the traditional multi-dimensional access methods do not scale well to spaces with many dimensions. Their performance rapidly deteriorates as the number of dimensions grows. As a result, they impose a practical limit on the number of dimensions, which is typically quite low. The limitations of contemporary multi-dimensional access methods in spaces with many dimensions have attracted considerable attention lately [1,3,5,9]. Further complicating the matter, the multidimensional queries of many applications tend to specify only a relatively small subset of parameters of interest (dimensions). For example, the study reported in [8] cites large data spaces with more than 2 data dimensions. However, the dimensionality of queries is typically about 2 to 4. While the events of interest for experimental high- Partly funded by the DOE grant no. DE-FG295-ER25254.

2 energy physics may have more than 1 dimensional properties, the number of properties specified by the queries is much smaller, typically about 1 to 8 [4]. Since the traditional multi-dimensional access methods are generally designed assuming fully specified search predicates, they perform poorly for these partial queries. Given this situation, a potentially very appealing approach is to take into account the importance of individual parameters/dimensions in the process of node splitting, as to minimize the probability of overlap between the index regions and the queries and, thereby, reduce the number of page accesses. The importance of the dimensions could be determined either apriorior through a statistical sampling of actual queries. This idea is frequently suggested in the literature as a way of increasing the retrieval performance of multi-dimensional access methods. For example, in [7], Robinson suggested query-based splitting as a possible space-partitioning strategy that could improve the performance of KDBtrees; but the idea was never pursued. The KDB-tree has become a point access method of choice in many applications. While the structure was originally designed for low-dimensional data, in [5], we have shown that it can serve as a basis for an effective retrieval mechanism in high-dimensional spaces as well. The topic of this paper arose from a larger investigation, in which we studied the effects of various splitting policies on the performance of KDB-trees in high-dimensional situations. The goal of this paper is to present the results of some carefully controlled experiments conducted to observe the effects of querybased splitting on the performance of KDB-trees. The strategy is compared to a splitting policy that selects the split dimensions in a cyclic fashion, which has been shown to be very effective, especially in highdimensional situations [5]. Based on the results, we will argue that query-based splitting does not appear to be a very appealing splitting strategy for KDB-trees. The rest of the paper is organized as follows. In Section 2, we review the structure of KDB-trees and its policy of cyclic splitting. Section 3 discusses the idea of query-based splitting, introducing some relevant terminology. Section 4 presents the results of the experimental study conducted to compare the performance of KDB-trees with query-based and cyclic splitting. Section 5 concludes the paper by summarizing the results. 2. Cyclic Splitting of KDB-Trees A KDB-tree is a height-balanced hierarchy of nodes (pages), in which each node represents a portion of space. At every level of the structure, the d-dimensional universe is recursively divided into hyper-rectangles by means of (d-1)-dimensional hyper-planes, each of which is perpendicular to one of the axes. The root node represents the entire universe, which itself is a multidimensional rectangle. Figure 1 illustrates a portion of a KDB-tree and its partition of a 2-dimensional space. Observe that each rectangular subspace has been split first with respect to one, and then another dimension. This is the characteristic of cyclic splitting. Figure 1. A 2-dimensional KDB-tree. The leaf nodes of KDB-trees, also called point pages, contain actual data objects, i.e. points in space. In the conceptual subdivision of the space corresponding to this level of the structure, the directions of the dividing hyper-planes alternate among individual dimensions. Every interior node, called region page, maintains index entries, each representing a child node at the level below. With cyclic splitting of point pages, the splitting dimension of a point page is determined as (splitting dimension of the old page +1)MODd, whered is the dimensionality of the universe. We enhance this policy with a splitting strategy for region pages, called firstdivision splitting [5]. According to this strategy, a region page R is split along the dividing hyper-plane by which the index region of R was split for the first time (firstdivision plane). As a result, this policy follows the partitioning sequence at the leaf level, selecting the splitting dimensions of the region pages in a cyclic fashion. As shown in [5], this strategy significantly

3 improves the performance of KDB-trees in highdimensional spaces. 3. Query-Based Splitting In [7], Robinson suggested a finer splitting policy that can take the advantage of the actual query patterns. In the following, we say that a query is partially specified if it restricts only a subset of dimensions, leaving the rest of the dimensions unspecified. Of particular interest for our analysis will be mono-specified queries, whose result set can be formally defined as R={x S x min x.a i x max }, where S is the given set of points, a i is the coordinate of a point object along the i th dimension, and x min and x max are two scalar values. Figure 2 shows a mono-specified query (gray volume) defined on a 3- dimensional space. Figure 2. A mono-specified query. We say that a query is fully specified when all dimensions are restricted by the query predicate. Reusing the above notation, the result set can be defined as R={x S i [, d), x min.a i x.a i x max.a i }. Note that now x min and x max represent vectors, not scalar values as in the previous formula. Since queries of typical multi-dimensional applications tend to be partially specified, we can take a statistical analysis of the dimensions that are specified most often. If such analysis is possible, one can compute the probability of specifying each dimension and build the tree structure accordingly. Whenever a split of a point page occurs, we pick the splitting dimension in relation to the probability distribution resulting from the anticipated query pattern. This is the underlying idea of query-based (QB-) splitting. Note that this policy applies to point pages. The splitting of region pages follows the first-division splitting strategy described earlier. 4. Experimental Evidence In order to compare cyclic and query-based splitting policies, we implemented two versions of KDB-trees that differ in the way the splitting dimensions are selected. We also constructed three different test cases. The first test case compares the two variants of KDB-trees for queries that are always specified with respect to one particular dimension. In the second test case, the queries are mono-specified but with respect to different dimensions. The third case compares the cyclic splitting with for fully specified queries. For all test cases, the input was the same set of 1, randomly generated points. In each test case, we performed 1, queries and measured the average number of page accesses per query as dimensionality increases from 2 to 16. In the first two test cases, the queries were mono-specified and the probability distribution used to determine the split axes of the KDBtree with was the same as the importance of the dimensions implied by the queries. In the first mono-specified case, the query dimension was constant. In other words, all queries specified only this single dimension. In the second case, one dimension was clearly dominant and it was specified by most queries. But, some queries specified other dimensions as well. The third case was constructed to observe the performance of when the actual queries do not behave as anticipated. In this test case, the queries were fully specified. Obviously, these experiments were constructed for some extreme scenarios. For mono-specified queries, a traditional B-tree index would be far superior to any multi-dimensional structure. However, the test cases were constructed so that they can reveal the promise of querybased splitting. Certainly, if the policy does not perform well with mono-specified queries, for which the distribution of splits can be selected to match perfectly the implied relevance of the dimensions, it is unlikely that it can perform well for partial queries that specify more than one dimension. In the first test case, whose results are shown in Figure 3, the same dimension was specified by all queries. Thus, the probability of specifying predominant dimension was 1., whereas the probability of specifying any other dimension was.. In this case, the querybased splitting guarantees that all splits occur along the

4 predominant dimension. As one can see from Figure 3, for this scenario, is clearly superior to the cyclic splitting, especially in high-dimensional spaces. 5 4 matches the priorities of the dimensions. This is because, whenever a split occurs along a certain dimension, all queries that discriminate along other dimensions are penalized by this choice. Thus, even though the QBsplitting policy pays off in some extreme cases, for more general cases, it does not appear to bring any improvement that could justify the effort of forecasting the actual queries Figure 3. Test Case 1: Mono-specified queries along one exclusively predominant dimension. It is unrealistic to have all queries specify only one dimension. A more realistic scenario might have 6% of all queries specify one dimension, 3% might specify second, with only 1% specifying a third. However, an arbitrary distribution need not scale with data dimensionality. Here, we need a density distribution that remains self-similar as dimensionality grows. Thus, in the second test case, we applied a continuous square-root function to compute the probability distribution for any number of dimensions. The square root function F: x x 1/α,whereα = 2, was used on the real interval [, 1]. For example, in a 5-dimensional space, we calculate the probability p(i) that each dimension i is specified as follows: F().447 p() = F().447; F(1).632 p(1) = F(1) F().185; F(2).775 p(2) = F(2) F(1).143; F(3).894 p(3) = F(3) F(2).119; F(4) 1 p(4) = F(4) F(3).16. Observe that the first dimension is still clearly predominant as it is specified in more than 4% of all queries. The second dimension is specified in nearly 2% of queries, with the remaining dimensions appearing in about 1-15% queries. The strong bias toward the first dimension should still make desirable. Nevertheless, Figure 4 reveals little difference between QB-and cyclic splitting, even though the KDB-tree structure has a distribution of splits that perfectly 2 1 Figure 4. Test Case 2: Skewed importance of the dimensions. The queries used for test case 3, shown in Figure 5, were fully specified. Therefore, all dimensions are important. However, even though other dimensions may appear in the queries, this test case forces splits only along a single dimension. While this test case is somewhat contrived, its purpose is to show what could happen when the actual queries do not behave as expected. Clearly, the policy adapts very poorly to unexpected query patterns. is a clear winner in this situation. In summary, the idea of adapting the splitting policy to suit the collected statistics about queries is intuitively a fine concept. It is backed up by at least one scenario investigated in this paper. However, for more general cases, the query-based splitting does not bring significant improvement that can justify the effort of forecasting the actual queries. As appealing as it may sound, QBsplitting does not appear to be an effective splitting strategy for KDB-trees.

5 Figure 5. Test Case 3: Fully specified queries and only one splitting dimension. 5. Summary and Discussion The problem of accessing data in high-dimensional spaces has attracted considerable attention. The proposed solutions generally assume well-defined queries that restrict the values of all dimensions in the universe. However, in advanced applications such as scientific studies and OLAP, typical queries posed on their highdimensional data restrict a relatively small subset of dimensions, leaving the rest of the dimensions unspecified. The contemporary multi-dimensional access methods tend to perform poorly for these partially specified queries. In this paper, we investigated the idea of adopting a splitting policy that takes into account the priorities of individual dimensions, which we call query-based splitting. We presented the results of an experimental study conducted to observe the effects of query-based splitting on the performance of KDB-trees. The strategy was compared to a splitting policy that selects the split dimensions in a cyclic fashion, which has been shown to be very effective, especially in high-dimensional situations. Based on the results, the query-based splitting does not appear to be a very appealing splitting strategy for KDB-trees. In [6], we have proposed a much more effective retrieval technique for partial queries. The idea is to apply an elaborate storage organization, called the inverted space (IS), which assigns to a high-dimensional universe one data store and a number of multidimensional indexes, each supporting efficient selections on a subset of dimensions. This organization allows the system administrator to control the size of individual indexes and avoid the negative impact of very high data dimensionality on the retrieval performance. To support the IS storage organization, we have also developed a new point access method, called the KDB HD -tree [6]. Together, the two solutions can enable efficient access to persistent data of high dimensionality based on partially specified queries. 6. References [1] S.Berchtold, C.Bohm, and H.P.Kriegel, The Pyramid-Technique: Towards Breaking the Curse of Dimensionality, Proc. ACM SIGMOD Int. Conf. on Management of Data, pp , [2] V.Gaede, and O.Gunther, Multidimensional Access Methods, ACM Computing Surveys 3(2):17-231, [3] K.I.Lin, H.V.Jagadish, and C.Faloutsos, The TV- Tree: An Index Structure for High-Dimensional Data, VLDB Journal 3(4): , [4] R.Orlandic, J.Lukaszuk and C.Swietlik, "The Design of a Retrieval Technique for High-Dimensional Data on Tertiary Storage," SIGMOD Record, 22 (in press). [5] R.Orlandic, and B.Yu, Implementing KDB-trees to Support High-Dimensional Data, Proc. Int. Database Engineering and Applications Symposium IDEAS 1, pp , 21. [6] R.Orlandic and B.Yu, "A Retrieval Technique for High-Dimensional Data and Partially Specified Queries," Data and Knowledge Eng., 22 (in press). [7] J.T.Robinson, The K-D-B Tree: A Search Structure for Large Multidimensional Dynamic Indexes, Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 1-18, [8] K.A.Ross, and K.A.Zaman, Optimizing Selections over Databases, Proc. 12th Int. Conf. on Scientific and Statistical Database Management, pp , 2. [9] R.Weber, H.-J.Schek and S.Blott, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces, Proc. 24th Int. Conf. on VLDB, , 1998.