CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB



Similar documents
Indexing Techniques for Data Warehouses Queries. Abstract

QuickDB Yet YetAnother Database Management System?

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

New Approach of Computing Data Cubes in Data Warehousing

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

Dimensional Modeling for Data Warehouse

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Data Warehousing Systems: Foundations and Architectures

SeaCloudDM: Massive Heterogeneous Sensor Data Management in the Internet of Things

ETL-EXTRACT, TRANSFORM & LOAD TESTING

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Multi-dimensional index structures Part I: motivation

Data Warehousing und Data Mining

Data Warehouse Design

Physical Data Organization

Indexing and Retrieval of Historical Aggregate Information about Moving Objects

Turkish Journal of Engineering, Science and Technology

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

DATA WAREHOUSING AND OLAP TECHNOLOGY

Data W a Ware r house house and and OLAP Week 5 1

Upon successful completion of this course, a student will meet the following outcomes:

BUILDING OLAP TOOLS OVER LARGE DATABASES

SQL Server 2005 Features Comparison

DIMENSION HIERARCHIES UPDATES IN DATA WAREHOUSES A User-driven Approach

Oracle Business Intelligence Foundation Suite 11g Essentials Exam Study Guide

Database Design Patterns. Winter Lecture 24

Vector storage and access; algorithms in GIS. This is lecture 6

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies

LEARNING SOLUTIONS website milner.com/learning phone

Bitmap Index an Efficient Approach to Improve Performance of Data Warehouse Queries

SAS BI Course Content; Introduction to DWH / BI Concepts

A Design and implementation of a data warehouse for research administration universities

CHAPTER-24 Mining Spatial Databases

Course 6234A: Implementing and Maintaining Microsoft SQL Server 2008 Analysis Services

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

Data Mining and Database Systems: Where is the Intersection?

A Critical Review of Data Warehouse

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

Managing a Fragmented XML Data Cube with Oracle and Timesten

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Tracking System for GPS Devices and Mining of Spatial Data

High-Volume Data Warehousing in Centerprise. Product Datasheet

Microsoft Services Exceed your business with Microsoft SharePoint Server 2010

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

Ag + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments

Data warehouse and Business Intelligence Collateral

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

The Benefits of Data Modeling in Business Intelligence

CHAPTER 3. Data Warehouses and OLAP

Java Metadata Interface and Data Warehousing

The basic data mining algorithms introduced may be enhanced in a number of ways.

Building Cubes and Analyzing Data using Oracle OLAP 11g

Databases and Information Systems 1 Part 3: Storage Structures and Indices

Lection 3-4 WAREHOUSING

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

A Brief Tutorial on Database Queries, Data Mining, and OLAP

OLAP Theory-English version

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

SQL Server 2012 End-to-End Business Intelligence Workshop

Data Integration and ETL Process

History of Database Systems

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

<Insert Picture Here> Extending Hyperion BI with the Oracle BI Server

How To Model Data For Business Intelligence (Bi)

Building Views and Charts in Requests Introduction to Answers views and charts Creating and editing charts Performing common view tasks

Efficient Integration of Data Mining Techniques in Database Management Systems

Sizing Logical Data in a Data Warehouse A Consistent and Auditable Approach

A DATA WAREHOUSE SOLUTION FOR E-GOVERNMENT

Model-Driven Data Warehousing

Designing Business Intelligence Solutions with Microsoft SQL Server 2012 Course 20467A; 5 Days

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

IN-MEMORY DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: In-Memory DBMS - SoSe

Optimization of ETL Work Flow in Data Warehouse

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

Fluency With Information Technology CSE100/IMT100

IMPLEMENTATION OF DATA WAREHOUSE SAP BW IN THE PRODUCTION COMPANY. Maria Kowal, Galina Setlak

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

MIS636 AWS Data Warehousing and Business Intelligence Course Syllabus

Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

When to consider OLAP?

14. Data Warehousing & Data Mining

DSS based on Data Warehouse

Transcription:

CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB Badal K. Kothari 1, Prof. Ashok R. Patel 2 1 Research Scholar, Mewar University, Chittorgadh, Rajasthan, India 2 Department of Computer Science, Hemchandracharya North Gujarat University, Gujarat, India ABSTRACT: The growing interest in data warehousing for decision makers is becoming more and more crucial to make faster and efficient decision. On-line decision needs short response time. Many indexing techniques have been created to achieve this goal. Indexing technique that has attracted attention in multidimensional databases are Bitmap Indexing and Cube Indexing. In this paper our primarily focus on how to improve existing storage engine and cube indexing techniques that will allow us to explore query resolution and optimization. For that we will try to integrate existing sidera cube indexing and Berkeley DD. Keywords: Data Warehouse, OLAP, Cube Indexing, Query Optimization, Hilbert R-tree, Berkeley R-tree. [1] INTRODUCTION One of the more important problems in the area of Decision Support System is the efficient access and querying of multi-dimensional data stored in the data warehouse [7,8,13,14,15]. An OLAP DBMS is the OLAP storage engine that is responsible for how the data of the data warehouse is stored effectively on disk, as well as quickly accessed. The data cube is multidimensional data model that supports OLAP processing, that allows to view aggregated data from a number if perspectives. A cube consists of dimensions and measures. For a d-dimensional space, {A1,A2, ---, Ad} we have O(2d) attribute combinations known as views (also known as cuboids or group-bys). The existing storage engine (Sidera server[9]) employs several components like indexing, hierarchy manager, caching etc. that are used to answer multi-dimensional OLAP queries in a very efficient way. For a D-dimensional fact table is associated with d dimensions, the current Sidera storage engine creates the R-tree indexed data cubes as a 2*2d separate standard disk files. Two separate files represent the R-tree indexing for cuboids in a D- dimensional cube. These simple files are not databases in any sense and cannot efficiently support DBMS / OLAP queries. Berkeley DB combines the power of a relational storage engine with the simplicity of file system based data storage. Therefore, we will try to integrate functionality of Berkeley DB 125

Cube Indexing Implementation Using Integration Of SIDERA Server And BERKELEY DB into existing Sidera server. Our primarily focus on to integrate the Cube indexing and access component of existing Sidera with Berkeley database [2] HILBERT R-TREE There are two types of Hilbert R-tree, one for static databases and one for dynamic databases. The performance of R-trees depends on the quality of the algorithm that clusters the data rectangles on a node. Hilbert R-trees use space-filling curves, and specifically the Hilbert curves, to impose a linear ordering on the data rectangles. The space filling curves and specifically the Hilbert curve are used to achieve better ordering of multidimensional objects in the node. This ordering has to be good, in the sense that it should group similar data rectangles together, to minimize the area and perimeter of the resulting Minimum Bounding Rectangles (MBRs). Packed Hilbert R-trees are suitable for static databases in which updates are very rare or in which there are no updates at all. The clustering of data is done by K-means algorithm. The dynamic Hilbert R-trees employ flexible deferred splitting mechanism to increase the space utilization. Every node has a well defined set of sibling nodes. By adjusting the split policy the Hilbert R-tree can achieve a degree of space utilization as high as is desired. This is done by proposing an ordering on the R-tree nodes. The Hilbert R-tree sorts rectangles according to the Hilbert value of the center of the rectangles (i.e., MBR). (The Hilbert value of a point is the length of the Hilbert curve from the origin to the point.) Given the ordering, every node has a well-defined set of sibling nodes; thus, deferred splitting can be used. By adjusting the split policy, the Hilbert R-tree can achieve as high utilization as desired. To the contrary, other R-tree variants have no control over the space utilization. The Hilbert R-tree has the following structure. A leaf node contains at most Cl entries each of the form (R, obj _id) where Cl is the capacity of the leaf, R is the MBR of the real object (xlow, xhigh, ylow, yhigh) and obj-id is a pointer to the object description record. The main difference between the Hilbert R-tree and the R*-tree [5] is that non-leaf nodes also contain information about the LHVs (Largest Hilbert Value). Thus, a non-leaf node in the Hilbert R-tree contains at most Cn entries of the form (R, ptr, LHV) where Cn is the capacity of a non-leaf node, R is the MBR that encloses all the children of that node, ptr is a pointer to the child node, and LHV is the largest Hilbert value among the data rectangles enclosed by R. Notice that since the non-leaf node picks one of the Hilbert values of the children to be the value of its own LHV, there is not extra cost for calculating the Hilbert values of the MBR of non-leaf nodes. Figure-1 illustrates some rectangles organized in a Hilbert R-tree. The Hilbert values of the centers are the numbers near the x symbols (shown only for the parent node II ). The LHV s are in [brackets]. Figure 4 shows how the tree of Figure 3 is stored on the disk; the contents of the parent node II are shown in more detail. Every data rectangle in node I has a Hilbert value v 33; similarly every rectangle in node II has a Hilbert value greater than 33 and 107, etc. 126

Figure: 1. Data rectangles organized in a Hilbert R-tree (Hilbert values and LHV s are in Brackets) [3] SIDERA SERVER: HILBERT R-TREE INDEXDED CUBE In Sidera explicit multi-dimensional indexing is provided by parallelized R-trees. The R-tree indexes are packed using a Hilbert space filling curve[3,15]. So, that arbitrary k-attribute range queries more closely map to the physical ordering of records on disk. For each cuboids fragment on a node, the basic process is as follows: 1. Sort the data based upon the Hilbert sort order. Associate each of the n points records with m pages of size n/m. Write the base level to disk as cuboids name.hil. 2. Associate each of the m leaf node pages with an ID that will be used as a file offset by parent bounding boxes. 3. Construct the remainder of the R-tree index by recursively packing the bounding boxes of lower levels until the root node is reached. The end result is a Hilbert-packed R-tree for each cuboids fragment in the system. The Hilbert packed R-tree[15] is stored on disk as two physical files: a.hil file that houses the data in Hilbert sort order, and a.ind file that houses the R-tree metadata and the bounding boxes that represent the index tree. [4] CUBE INDEXING INTEGRATIONSS Sidera server uses the encoded fact table to generate the full cube as 2d views in a D- dimensional space. The Sidera indexing component described how the full cube can be materialized for a D-dimensional space. In other words, the indexed R-tree[12] for each cuboids is stored on disk as two Physical files:.hil file that contains the data in Hillbert sort order [15], and.ind files that contain the R-tree index metadata and the bounding boxes that represent the index tree. These files (.hil and.ind) are not databases and are not particularly efficient for OLAP quearies. 127

Cube Indexing Implementation Using Integration Of SIDERA Server And BERKELEY DB Berkeley provides four access methods B-Tree, Hash, Recno and Queue that perform very well in the context of the primary index, it is not sufficient in its current form to efficiently support the multi-dimensional queries that are executed by the Sidera server. Figure: 2. 3-dimenstional cube (PQR) Berkeley understands the notion of index / data combinations, it has no mechanism to directly support multi-dimensional R-trees.The integration process (Berkeley DB and Sidera server) consists of combining the source code for the cube indexing with the code of Berkeley DB. After this integration, Berkeley DB can be used to create a Berkeley DB database with the Hilbert R-tree access method. [5] BERKELEY R-TREE MODEL Berkeley supports the storage of many databases i.e. Berkeley Database Objects in one physical file. This physical file contains one master database supported by the B-Tree access method that has references to all the databases that are stored in the same file. Keys in this primary B-Tree are the database names that are stored in the physical file and the data of the primary B-Tree consists of the meta-data page number for each database name. After the integration of the Berkeley and Sidera, Instead of 2*2d physical files, it will build the indexed cube as O(2d)[10]. Berkeley database objects for a D-dimensional fact table, one Berkeley database with the R-tree access method for each materialized cuboids. Berkeley Db physical file contains a master database that has references to Hilbert R- tree indexed cuboids stored as Berkeley database in same file. Figure: 3. Berkeley B-tree index that has references to all indexed cuboids stored in one Physical File. 128

Keys in the master database are the indexed cuboids name, and the data contains two values: the indexed cuboids page meta-data number and the size in terms of number of blocks required by this indexed cuboids. [6] BERKELEY DATABASE: HILBERT R-TREE INDEXED CUBE Below algorithm represents, how the Hilbert R-tree indexed cube is stored as Berkeley database in one physical Berkeley database file. Input A set S of cuboids/cube name called C. Output Hilbert R-tree indexed cuboids stored as Berkeley database objects in one physical file. Algorithm At beginning, create and open Berkeley database environment handle that encapsulates one or more Berkeley database objects. One for each indexed cuboids in the cube. Then, for each cuboids X in the cube, create Hilbert R-tree indexed view as a Berkeley DB database with the database type DB-RTREE[10] in the open method. Berkeley DB open method with database type DB RTREE and a view X means that if the Berkeley database (X) does not exist then, first create it and store in it the Hilbert R-tree index corresponding to cuboids X. In the open method, the name of the physical file that will be used to back one or more Berkeley DB R-tree database will be the name of the cube. 129

Cube Indexing Implementation Using Integration Of SIDERA Server And BERKELEY DB Figure: 4. Hilbert R-tree indexing of the 3-dimensional cube (PQR) before and after integration of Berkeley DB and Sidera Server. When we open a Berkeley DB database handle [10] with a database type equivalent to DB-RTEE, a Hilbert R-tree indexed cuboids is created and stored in one Berkeley DB physical file that has the same name as the cube. [7] INDEX CONSTRUCTION As noted, the indexed cube in Berkeley is represented only in one single physical file; however, it is represented in 2*the number of view in the current Sidera server. It will reduce the index cube construction time. The primary reason for this reduction is the need for only one physical file to store the index cube in the Berkeley supported Sidera. Storing the indexed cube as 2*2d physical files produces a significant amount of disk thrashing. [8] CONCLUSION In this paper, we have described the integration of Berkeley DB components into the Sidera. This integration significantly enhances the existing Sidera storage engine. Specifically, Sidera now stores the Hilbert R-tree index in one Berkeley DB physical file. We have conceptually test the integration of the Sidera with Berkeley DB in terms of the index cube construction and query resolution. We have also compared the index construction for the cube in a single backend node before and after the integration of the codes in Berkeley DB. Conceptually, our results support the integration approaches that have taken. Berkeley supported Sidera server is better than the old Sidera server and it significantly boost run-time query performance. 130

REFERENCES [1] Microsoft analysis services. http://www.microsoft.com/sqlserver/2008/en/us/ analysisservices.aspx. [2] Oracle essbase. http://www.oracle.com/us/solutions/ent-performancebi/ Business intelligence/ essbase/index.html. [3] http://en.wikipedia.org/wiki/hilbert_r-tree. [4] Xml for analysis specification v1.1, 2002. http://www.xmla.org/index.htm. [5] Cwm, common warehouse metamodel, 2003. http://www.cwmforum.org. [6] Jsr-69 javatm olap interface (jolap), jsr-69 (jolap) expert group, http://jcp.org/aboutjava/communityprocess/first/jsr069/index.html. [7] L. Cabibbo and R. Torlone. Querying multidimensional databases. Proceedings of the 6th DBLP Workshop, pages 253 269, 1997. [8] E.Zimanyi E. Malinowski. Hierarchies in a conceptual model: From conceptual modeling to logical representation. Data & KNowledge Engineering, 2005. [9] Todd Eavis, George Dimitrov, Ivan Dimitrov, David Cueva, Alex Lopez, and Ahmad Taleb. Sidera: A cluster-based server for online analytical processing. International Conference on Grid computing, high-performance, and Distributed Applications (GADA), 2007. [10] Ahmad Table. Query Optimization and Execution for MOLAP., pages 34-44,78-87.2011. [11] Arnaud Giacometti, Dominique Laurent, Patrick Marcel, and Hassina Mouloudi. A new way of optimizing olap queries. BDA, pages 513 534, 2004. [12] A. Guttman. R-trees: A dynamic index structure for spatial searching. Proceedings of the 1984 ACM SIGMOD Conference, pages 47 57, 1984. [13] Ralph Kimball and J. Caserta. The Data Warehouse ETL Toolkit. John Wiley and Sons, 2004. [14] Ralph Kimball and Margy Ross. The Data Warehouse Toolkit. John Wiley and Sons, 2002. [15] T. Pedersen, C. Jensen, and C. Dyreson. A foundation for capturing and querying complex multidimensonal data. Information Systems Journal, 26(5):383 423, 2001. 131