CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB

Transcription

1 CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB Badal K. Kothari 1, Prof. Ashok R. Patel 2 1 Research Scholar, Mewar University, Chittorgadh, Rajasthan, India 2 Department of Computer Science, Hemchandracharya North Gujarat University, Gujarat, India ABSTRACT: The growing interest in data warehousing for decision makers is becoming more and more crucial to make faster and efficient decision. On-line decision needs short response time. Many indexing techniques have been created to achieve this goal. Indexing technique that has attracted attention in multidimensional databases are Bitmap Indexing and Cube Indexing. In this paper our primarily focus on how to improve existing storage engine and cube indexing techniques that will allow us to explore query resolution and optimization. For that we will try to integrate existing sidera cube indexing and Berkeley DD. Keywords: Data Warehouse, OLAP, Cube Indexing, Query Optimization, Hilbert R-tree, Berkeley R-tree. [1] INTRODUCTION One of the more important problems in the area of Decision Support System is the efficient access and querying of multi-dimensional data stored in the data warehouse [7,8,13,14,15]. An OLAP DBMS is the OLAP storage engine that is responsible for how the data of the data warehouse is stored effectively on disk, as well as quickly accessed. The data cube is multidimensional data model that supports OLAP processing, that allows to view aggregated data from a number if perspectives. A cube consists of dimensions and measures. For a d-dimensional space, {A1,A2, ---, Ad} we have O(2d) attribute combinations known as views (also known as cuboids or group-bys). The existing storage engine (Sidera server[9]) employs several components like indexing, hierarchy manager, caching etc. that are used to answer multi-dimensional OLAP queries in a very efficient way. For a D-dimensional fact table is associated with d dimensions, the current Sidera storage engine creates the R-tree indexed data cubes as a 2*2d separate standard disk files. Two separate files represent the R-tree indexing for cuboids in a D- dimensional cube. These simple files are not databases in any sense and cannot efficiently support DBMS / OLAP queries. Berkeley DB combines the power of a relational storage engine with the simplicity of file system based data storage. Therefore, we will try to integrate functionality of Berkeley DB 125

2 Cube Indexing Implementation Using Integration Of SIDERA Server And BERKELEY DB into existing Sidera server. Our primarily focus on to integrate the Cube indexing and access component of existing Sidera with Berkeley database [2] HILBERT R-TREE There are two types of Hilbert R-tree, one for static databases and one for dynamic databases. The performance of R-trees depends on the quality of the algorithm that clusters the data rectangles on a node. Hilbert R-trees use space-filling curves, and specifically the Hilbert curves, to impose a linear ordering on the data rectangles. The space filling curves and specifically the Hilbert curve are used to achieve better ordering of multidimensional objects in the node. This ordering has to be good, in the sense that it should group similar data rectangles together, to minimize the area and perimeter of the resulting Minimum Bounding Rectangles (MBRs). Packed Hilbert R-trees are suitable for static databases in which updates are very rare or in which there are no updates at all. The clustering of data is done by K-means algorithm. The dynamic Hilbert R-trees employ flexible deferred splitting mechanism to increase the space utilization. Every node has a well defined set of sibling nodes. By adjusting the split policy the Hilbert R-tree can achieve a degree of space utilization as high as is desired. This is done by proposing an ordering on the R-tree nodes. The Hilbert R-tree sorts rectangles according to the Hilbert value of the center of the rectangles (i.e., MBR). (The Hilbert value of a point is the length of the Hilbert curve from the origin to the point.) Given the ordering, every node has a well-defined set of sibling nodes; thus, deferred splitting can be used. By adjusting the split policy, the Hilbert R-tree can achieve as high utilization as desired. To the contrary, other R-tree variants have no control over the space utilization. The Hilbert R-tree has the following structure. A leaf node contains at most Cl entries each of the form (R, obj _id) where Cl is the capacity of the leaf, R is the MBR of the real object (xlow, xhigh, ylow, yhigh) and obj-id is a pointer to the object description record. The main difference between the Hilbert R-tree and the R*-tree [5] is that non-leaf nodes also contain information about the LHVs (Largest Hilbert Value). Thus, a non-leaf node in the Hilbert R-tree contains at most Cn entries of the form (R, ptr, LHV) where Cn is the capacity of a non-leaf node, R is the MBR that encloses all the children of that node, ptr is a pointer to the child node, and LHV is the largest Hilbert value among the data rectangles enclosed by R. Notice that since the non-leaf node picks one of the Hilbert values of the children to be the value of its own LHV, there is not extra cost for calculating the Hilbert values of the MBR of non-leaf nodes. Figure-1 illustrates some rectangles organized in a Hilbert R-tree. The Hilbert values of the centers are the numbers near the x symbols (shown only for the parent node II ). The LHV s are in [brackets]. Figure 4 shows how the tree of Figure 3 is stored on the disk; the contents of the parent node II are shown in more detail. Every data rectangle in node I has a Hilbert value v 33; similarly every rectangle in node II has a Hilbert value greater than 33 and 107, etc. 126

3 Figure: 1. Data rectangles organized in a Hilbert R-tree (Hilbert values and LHV s are in Brackets) [3] SIDERA SERVER: HILBERT R-TREE INDEXDED CUBE In Sidera explicit multi-dimensional indexing is provided by parallelized R-trees. The R-tree indexes are packed using a Hilbert space filling curve[3,15]. So, that arbitrary k-attribute range queries more closely map to the physical ordering of records on disk. For each cuboids fragment on a node, the basic process is as follows: 1. Sort the data based upon the Hilbert sort order. Associate each of the n points records with m pages of size n/m. Write the base level to disk as cuboids name.hil. 2. Associate each of the m leaf node pages with an ID that will be used as a file offset by parent bounding boxes. 3. Construct the remainder of the R-tree index by recursively packing the bounding boxes of lower levels until the root node is reached. The end result is a Hilbert-packed R-tree for each cuboids fragment in the system. The Hilbert packed R-tree[15] is stored on disk as two physical files: a.hil file that houses the data in Hilbert sort order, and a.ind file that houses the R-tree metadata and the bounding boxes that represent the index tree. [4] CUBE INDEXING INTEGRATIONSS Sidera server uses the encoded fact table to generate the full cube as 2d views in a D- dimensional space. The Sidera indexing component described how the full cube can be materialized for a D-dimensional space. In other words, the indexed R-tree[12] for each cuboids is stored on disk as two Physical files:.hil file that contains the data in Hillbert sort order [15], and.ind files that contain the R-tree index metadata and the bounding boxes that represent the index tree. These files (.hil and.ind) are not databases and are not particularly efficient for OLAP quearies. 127

4 Cube Indexing Implementation Using Integration Of SIDERA Server And BERKELEY DB Berkeley provides four access methods B-Tree, Hash, Recno and Queue that perform very well in the context of the primary index, it is not sufficient in its current form to efficiently support the multi-dimensional queries that are executed by the Sidera server. Figure: 2. 3-dimenstional cube (PQR) Berkeley understands the notion of index / data combinations, it has no mechanism to directly support multi-dimensional R-trees.The integration process (Berkeley DB and Sidera server) consists of combining the source code for the cube indexing with the code of Berkeley DB. After this integration, Berkeley DB can be used to create a Berkeley DB database with the Hilbert R-tree access method. [5] BERKELEY R-TREE MODEL Berkeley supports the storage of many databases i.e. Berkeley Database Objects in one physical file. This physical file contains one master database supported by the B-Tree access method that has references to all the databases that are stored in the same file. Keys in this primary B-Tree are the database names that are stored in the physical file and the data of the primary B-Tree consists of the meta-data page number for each database name. After the integration of the Berkeley and Sidera, Instead of 2*2d physical files, it will build the indexed cube as O(2d)[10]. Berkeley database objects for a D-dimensional fact table, one Berkeley database with the R-tree access method for each materialized cuboids. Berkeley Db physical file contains a master database that has references to Hilbert R- tree indexed cuboids stored as Berkeley database in same file. Figure: 3. Berkeley B-tree index that has references to all indexed cuboids stored in one Physical File. 128

5 Keys in the master database are the indexed cuboids name, and the data contains two values: the indexed cuboids page meta-data number and the size in terms of number of blocks required by this indexed cuboids. [6] BERKELEY DATABASE: HILBERT R-TREE INDEXED CUBE Below algorithm represents, how the Hilbert R-tree indexed cube is stored as Berkeley database in one physical Berkeley database file. Input A set S of cuboids/cube name called C. Output Hilbert R-tree indexed cuboids stored as Berkeley database objects in one physical file. Algorithm At beginning, create and open Berkeley database environment handle that encapsulates one or more Berkeley database objects. One for each indexed cuboids in the cube. Then, for each cuboids X in the cube, create Hilbert R-tree indexed view as a Berkeley DB database with the database type DB-RTREE[10] in the open method. Berkeley DB open method with database type DB RTREE and a view X means that if the Berkeley database (X) does not exist then, first create it and store in it the Hilbert R-tree index corresponding to cuboids X. In the open method, the name of the physical file that will be used to back one or more Berkeley DB R-tree database will be the name of the cube. 129

6 Cube Indexing Implementation Using Integration Of SIDERA Server And BERKELEY DB Figure: 4. Hilbert R-tree indexing of the 3-dimensional cube (PQR) before and after integration of Berkeley DB and Sidera Server. When we open a Berkeley DB database handle [10] with a database type equivalent to DB-RTEE, a Hilbert R-tree indexed cuboids is created and stored in one Berkeley DB physical file that has the same name as the cube. [7] INDEX CONSTRUCTION As noted, the indexed cube in Berkeley is represented only in one single physical file; however, it is represented in 2*the number of view in the current Sidera server. It will reduce the index cube construction time. The primary reason for this reduction is the need for only one physical file to store the index cube in the Berkeley supported Sidera. Storing the indexed cube as 2*2d physical files produces a significant amount of disk thrashing. [8] CONCLUSION In this paper, we have described the integration of Berkeley DB components into the Sidera. This integration significantly enhances the existing Sidera storage engine. Specifically, Sidera now stores the Hilbert R-tree index in one Berkeley DB physical file. We have conceptually test the integration of the Sidera with Berkeley DB in terms of the index cube construction and query resolution. We have also compared the index construction for the cube in a single backend node before and after the integration of the codes in Berkeley DB. Conceptually, our results support the integration approaches that have taken. Berkeley supported Sidera server is better than the old Sidera server and it significantly boost run-time query performance. 130

7 REFERENCES [1] Microsoft analysis services. analysisservices.aspx. [2] Oracle essbase. Business intelligence/ essbase/index.html. [3] [4] Xml for analysis specification v1.1, [5] Cwm, common warehouse metamodel, [6] Jsr-69 javatm olap interface (jolap), jsr-69 (jolap) expert group, [7] L. Cabibbo and R. Torlone. Querying multidimensional databases. Proceedings of the 6th DBLP Workshop, pages , [8] E.Zimanyi E. Malinowski. Hierarchies in a conceptual model: From conceptual modeling to logical representation. Data & KNowledge Engineering, [9] Todd Eavis, George Dimitrov, Ivan Dimitrov, David Cueva, Alex Lopez, and Ahmad Taleb. Sidera: A cluster-based server for online analytical processing. International Conference on Grid computing, high-performance, and Distributed Applications (GADA), [10] Ahmad Table. Query Optimization and Execution for MOLAP., pages 34-44, [11] Arnaud Giacometti, Dominique Laurent, Patrick Marcel, and Hassina Mouloudi. A new way of optimizing olap queries. BDA, pages , [12] A. Guttman. R-trees: A dynamic index structure for spatial searching. Proceedings of the 1984 ACM SIGMOD Conference, pages 47 57, [13] Ralph Kimball and J. Caserta. The Data Warehouse ETL Toolkit. John Wiley and Sons, [14] Ralph Kimball and Margy Ross. The Data Warehouse Toolkit. John Wiley and Sons, [15] T. Pedersen, C. Jensen, and C. Dyreson. A foundation for capturing and querying complex multidimensonal data. Information Systems Journal, 26(5): ,