Optimized Data Indexing Algorithms for OLAP Systems

Database Systems Journal vol. I, no. 2/200 7 Optimized Data Indexing Algoritms for OLAP Systems Lucian BORNAZ Faculty of Cybernetics, Statistics and Economic Informatics Academy of Economic Studies, Bucarest lucianbor@otmail.com Te need to process and analyze large data volumes, as well as to convey te information contained terein to decision makers naturally led to te development of OLAP systems. Similarly to SGBDs, OLAP systems must ensure optimum access to te storage environment. Altoug tere are several ways to optimize database systems, implementing a correct data indexing solution is te most effective and less costly. Tus, OLAP uses indexing algoritms for relational data and n-dimensional summarized data stored in cubes. Today database systems implement derived indexing algoritms based on well-known Tree, Bitmap and Has indexing algoritms. Tis is because no indexing algoritm provides te best performance for any particular situation (type, structure, data volume, application). Tis paper presents a new n-dimensional cube indexing algoritm, derived from te well known B-Tree index, wic indexes data stored in data wareouses taking in consideration teir multidimensional nature and provides better performance in comparison to te already implemented Tree-like index types. Keywords: data wareouse; indexing algoritm; OLAP, n-tree. Introduction Data wareouses represented a natural solution towards increasing te availability of data and information, as well as teir accessibility to decision makers. Te wareouses store important data coming from different sources for later processing and are an integrant part of analytical processing systems (OLAP). Unlike OLTP systems, OLAP systems must execute complex interrogations and large data volume analyses. To optimize, analytical processing systems analyze data and store aggregated information in special analytic structures, called cubes. Similarly to OLTP systems, OLAP systems use indexing algoritms to optimize access to data stored in data wareouses, i.e. cubes. 2. General information about cubes Wen stored in an OLAP system, te source data may be indexed to reduce te time necessary for teir processing. To index source data, OLAP systems use indexing algoritms similar to OLTP (B-Tree, Bitmap, R-Tree etc.). Processed data are stored in n-dimensional structures called cubes. Te elements of a cube are te dimensions, members, cells, ierarcies and properties [] (fig. ).

8 Optimized Data Indexing Algoritms for OLAP Systems Dimension z Cel y 0 x Fig. - Structure of a tridimensional cube Te dimensions contain descriptive information about te data tat is to be summarized. Tey are essential for data analysis and represent an axis of te cube [2]. Eac dimension corresponds to a measure of te data source and uniquely contains eac value stored in tat position. During queries, te dimensions are used to reduce te searc area and usually occur in te WHERE clause. Hierarcies describe te ierarcical relationsips between two or more members of te same dimension. A dimension can be part of multiple ierarcies. For example, in addition to te ierarcy of dimensions Quarter-Mont-Year, Time dimension can belong to te ierarcy Day-Mont-Year. Te cells of te cube contain summarized data based on dimensions values. Cells store summarized data based on te cube dimension number, dimensions values, metod of analysis and is usually, te result returned by te queries. Properties describe common features of all members of te same dimension. Properties allow selecting data based on similar caracteristics. For example, te size of product volume may ave an attribute wic allows a certain volume products. Analysis and data processing is based on te metod cosen. Te same data can be analyzed using different metods (clustering, neural nets, regression, Bayesian, Decision trees, etc.) accordingly to te user's needs. Altoug using different metods of analysis may result in different aggregate data and ordering different logic cells, te logical structure of te cube is te same. In order to optimize performance, OLAP systems implement cube indexing algoritms. Te indexes created troug tis process use te data contained in a cube s dimensions to quickly access te cells containing te data required by te user. Hence, cubes are indexed using a B- Tree type of algoritm. 3. Te B-Tree Index in OLAP Systems As te values of te cube dimensions are unique and tey are stored in te index blocks and used to locate te leaf blocks wic contain references to te pysical location of te cube cells, implementing a B- Tree index represents an effective solution for indexing cubes. A B-Tree index used in OLAP systems contains sub-trees corresponding to eac dimension. Te sub-trees are connected in suc a way tat eac pat to go troug te tree from te root node to te final level index blocks (te ones storing te references to te cube s cells) is crossed by a sub-tree corresponding to eac dimension. Tus, a tree dimensional cube contains tree levels. Te first level represents a matrix of planes (bi-dimensional space), te second level represents a matrix of lines (one dimensional space), and te tird level represents a matrix of points in space (0 dimensional space) (fig. 2).

Database Systems Journal vol. I, no. 2/200 9 Level Level 2 Level 3 Storage Root Nod Cube cells Dimension X Dimension Y Dimension Z Fig. 2 - Te structure of a B-Tree index for a tree-dimensional cube [3] Tus, eac cube size will correspond to one sub-tree of te B-Tree index and eac sub-tree will ave on cild sub-tree for eac cild dimension. Summarized data are stored inside te cube, on pages separate from te cells. As summarized data creation and storage consumes CPU resources and storage space environment, te developer and/or system administrator can coose as tey are available only at certain levels, depending on teir usage, summarized data obtained from te upper levels being calculated by processing te data summarized in te lower levels. Looking at te structure of te B-Tree index, it is easily noticeable tat te cost of locating one of te cube s cells represents te sum of te costs associated wit locating te last level index block of eac sub-tree. Te eigt of suc an index is (f.) [4]: r + logd (f.) 2 were N, d is te number of stored values witin an index block, r is te total number of values corresponding to te dimension. Te number of index blocks of a sub-tree containing d elements is: d l d = l= 0 2d were d=(m-) and is te tree eigt. (f.2) Considering tat te number of items stored in index blocks at te top level of te subtree is r, we can calculate te maximum eigt () of te index portion corresponding to a specific dimension, as follows: 2d( d ) r d => 2 d r + r + logd 2 (f.3) => were N.

20 Optimized Data Indexing Algoritms for OLAP Systems Te total cost of a searc operation witin a sub-tree is: N si =+E a +E b +2E a E b (f.7) c i =+ (f.4) Since to locate a cell of te cube is necessary to cross every sub-tree, corresponding to eac dimension of te cube, we can calculate te total cost of a queries based on (f.4): n = cti cij + ( d ) (f.5) j= were n is te number of cube dimensions, c ij is te cost of query sub-tree corresponding to te dimension j and d- is te total number of root nodes used as connecting elements of te leaf blocks of sub-trees tat ave several subordinated subtree. Since OLAP systems incorporate very large data volumes, teir performance is affected not only by te query operations cost but also by te index storage space. Indexes tend to occupy te storage space of te cube and sometimes teir size can be larger tan te data stored in te cube. If an index occupies a large memory space, it means tat te structure is ig (number of elements, elements tat store too muc data, etc.), wic increases te index creation time and query execution times. Using formulas (f.2) and (f.3) it can be calculated te storage space (SS) for a subtree indexed using a B-Tree index: S s d = S 2d p (f.6) were d is te number of items stored in a block index, is te index eigt and SS is te page size. Total storage space (St) needed to store a B-Tree index used for indexing cubes is equal to storage space for all its sub-trees. For a cube wit tree dimensions, te number of sub-trees of a B-Tree index is: were N si is te number of sub-trees of dimension i, E a and E b represents te elements number of dimension index leaf block corresponding to te oter two dimensions. Based on te (f.2)-(f.7), wole B-Tree index size can be calculated as: S t = j ( Ssi Nsi ) i= +S rn (f.8) were j is te cube dimensions number, S si is a storage space needed to store te subtree for te dimension i, te N si is te number of sub-trees coresponding to te dimension i and S rn is te size of all nodes connecting te sub-trees. Te query cost of te B-Tree index is te sum of te cost of all sub-trees between te root node and te index leaf block wic store te pysical address of te cell. 4. Te n-tree Indexing Algoritm Given te caracteristics of cubes, as well as te structure of a B-Tree index, it becomes obvious tat tis indexing algoritm is not optimized for n-dimensional data structures. Tus, te number of subtrees witin te index is directly proportional wit te number of dimensions. As a consequence, te cube is over-indexed resulting in an overconsumption of processing time and storage space. Te proposed n-dimensional indexing algoritm pays attention to te n- dimensional structure of te data. Instead of creating sub-trees corresponding to eac dimension and subsequently linking tem, it creates only one tree wic indexes data simultaneously on all dimensions. As a result, te n-dimensional space is gradually divided into ever smaller n-dimensional subdivisions, until te smallest sub-divisions represent te cells of te cube. Te resulting index as te following caracteristics:

Database Systems Journal vol. I, no. 2/200 2 - no NULL values are indexed; - te root node contains at least two subordinated index blocks if it does not coincide wit te last level index block; - eac index block contains: - values from eac dimension of te cube; te combination of suc values represents a reference point in te n- dimensional space; Te index maintains an ordered list containing unique values corresponding to eac dimension of te cube. Te values in eac list represent a subgroup of te values of te respective dimension. Combining values from eac list at a time, we can obtain te data needed to identify te reference points in te n-dimensional space simultaneously minimizing te space required to store tem. Any value corresponding to a dimension from te subordinated index block is smaller tan te value of te respective dimension corresponding to te reference point from te upper level index block. - references to te subordinated index blocks (r bs ), corresponding to te reference points (f.9): n r bs = a i i= (f.9) were a..n represents te number of values from dimensions..n stored in te index block; - n references to te index blocks tat contain larger values in a dimension tan te reference point (one for eac dimension). Tus, an index block contains a total of r references (f.0): r=r bs +n (f.0) were n equals te number of te cube s dimensions. - te last level index blocks do not contain any references to oter index blocks; instead tey store te reference to te pysical location of te cube s cells; - te pysical size of an index block is approximately one page. Eac index block stores an ordered list of unique values corresponding to eac cube dimensions. Values from eac list is a subset of tese dimensions. Combining values from eac list, one by one, points from te n- dimensional space can be identified, minimizing te needed storage space. Any value from te dimensions, stored into an index block, is lower tan te value belonging to te respective dimension from te reference point of te iger rank index block. Fig. 3 - Te structure of an n- dimensional index corresponding to a tree dimensional cube Because te dimensions values of te reference point are uniquely stored, te n- dimensional space is always a regular space. For a cube wit tree dimensions, tis space is a rectangular parallelepiped, and ideally is a cube. If some cells do not contain data units (containing null values), tey are not indexed, tus reducing te size of te index. Te lack of aggregated values corresponding to a cell does not affect te form of te space described by te values stored into a index block. Every n-tree index contains a sub-tree tat indexes te summarized data wic as te following caracteristics: - it contains a sub-tree for eac dimension of te cube;

22 Optimized Data Indexing Algoritms for OLAP Systems - eac sub-tree corresponding to a dimension as a structure similar to a B- Tree index and indexes all te values pertaining to te respective dimension; - te index blocks of te sub-trees contain references to te lower level index blocks; - te index leaf blocks do not contain references to oter indexing blocks; instead tey are te sole elements containing references to te pages were te summarized data are stored; - eac element of te index leaf blocks contains references to parts of te n- dimensional space to wic te respective value is assigned Sub-tree wic indexes te cube cells Sub-tree wic indexes te cells wic stores te summarized data Storage Root node Cell Summarized data References (pointers) Dimension X Dimension Y Dimension Z References to te summarized data Fig. 4 - Te structure of an n-tree index for to a tree dimensional cube Tus te number of referenced contained by eac element of a leaf index block is: rs=n- (f.) were n is equal to te number of te cube s dimensions. Summarized data relating to eac value stored in te sub-tree corresponding to one dimension are equivalent to a to (n-) dimensional sub-space. Te sub-spaces are distributed among te respective sub-trees, as to avoid storing multiple references to te same summarized data. Te dimensions of te index are tus reduced. For a 3-dimensional cube, te index leaf blocks will contain te following references: - te elements of te index leaf block of te X dimension sub-tree contain: - references to data summarized te value of te X dimension, all values of te Y dimension and te first value of te Z dimension (one dimensional space); - references to te data summarized te value of te X dimension, all values of te Y dimensions and all values of te Z dimension (two dimensional space) - te elements of te index leaf blocks of te sub-tree corresponding to te Y dimension contain: - references to te data summarized te value of te Y dimension, all values of te Z dimension and te first value of te X dimension (one dimensional space); - references to te data summarized te value of te Y dimension, all values of te X dimension and all te values of

Database Systems Journal vol. I, no. 2/200 23 te Z dimensions (two dimensional space); - te elements of te index leaf blocks of te sub-tree corresponding to te Z dimension contain: - references to te data summarized te value of te Z dimension, all values of te X dimension and te first value of te Y dimension (one dimensional space); - references to data summarized te value of te Z dimension, all values of te X dimensions and all values of te Y dimension (two dimensional space). 5. Creating an n-tree Index To create an n-dimensional index, all data in every index is read and n-dimensional points are created. For eac of tese points, te following operations are carried out: - an index block corresponding to an n- dimensional sub-space wose reference point as only values larger tan tat of te processed point is identified; te index block must also ave enoug free space to store te values of te corresponding dimensions of te processed point plus a reference; If suc an index block is identified, te values are added to te dimensions corresponding lists and te reference to te pysical location of te cube s cell is stored. Oterwise, a new index block is created by dividing one of te neigboring index blocks. - wen a new index block is created, te values of te reference point, as well as te reference to te parent index block are added, togeter wit references to te neigboring index blocks; S e = j i= S vi + S ref (f.2) were S e is te element size, S vi is te data type size for te dimension i and S ref is te size of a reference. Tis process could propagate itself to te root node. Generally, OLAP systems contain istorical data wit a low frequency of updating operations but wit a large volume of updates. Updating tese data also triggers an updating of te cube, and tus, of its index. It sould be noted tat te space required for inserting a new element into a block index is not always te same. If some values of n-dimensional point corresponding to a specific dimension were previously inserted te necessary free space is smaller tan te element size. 6. Te Performance of te n-tree Index Te performance of an index depends on its eigt. A larger eigt means more pysical read operations are needed to identify te cell containing te required data. Te eigt of te index depends on te size of te index block, te size of te type of indexed data, te number of references to subordinated index blocks stored in eac index block, and te number of cells. By analyzing te structure of an index block (fig. 4), we can compute te number of references it can store. Fig. 5 - Structure of an index block in an n- Tree index Te volume of te stored data in an index block may be written as (f.): S D... D a D D...... D a... D D n D n D n..... D n... D n z D... D n... D... D n z D... D n... D... D n z D a... D n... D a... D n z... n n d = a i i Sv + = i= D n ai + n S ref Values Pointers to te lowest index blocks Pointers to te neigboring index blocks (f.3)

24 Optimized Data Indexing Algoritms for OLAP Systems were S d represents te volume of te stored data, S v represents te size of te indexed value, n is te number of dimensions and S ref te size of a reference. Te blocks number of te n-tree index wic contains d elements is: d l d = l= 0 2d were is te index eigt. (f.4) Fig. 6 - Te cost of a searc operation in a two-dimensional cube Ideally, d is te maximum number of te elements wic can be stored inside an n- Tree index and its value is up to a i + n. n i= Te query cost for te n-tree index is: c i =+ (f.5) were c i is te query cost, is te index eigt and is for te index root block. Since te data in an OLAP system is rarely modified, te best performance is obtained wen te index blocks contain a volume of data equal to teir size. Terefore, we can approximate te value of S d to be equal to tat of a page. We assume tat: - te size of an index block is of 8kB (tis is te most common size in current database systems [5]); - te size of a reference is 6B (te most common size for a local index [6]); - te size of te data type is 8B (tis is te size of te datetime type of data); - eac dimension contains te same number of unique values. Using te formulas (f.3) and (f.5), we can compare te performance of a n-tree index to tat of a B-Tree index (fig 6-). Fig. 7 - Te cost of a searc operation in a tree dimensional cube Analyzing fig. 6 and fig 7, it can be seen tat te n-tree index query cost is lower te te B-Tree index query cost event for.000 cells. Te performance difference is event iger wen te cells number or te dimensions number increase. Fig. 8 - Te size [in MB] of an index corresponding to a two dimensional cube

Database Systems Journal vol. I, no. 2/200 25 Anyway, in fig. 0 and fig. it can be observed tat te n-tree index query cost is lower tan te B-Tree index query cost in any situation, excepting wen te cells number is very low. Fig. 9 - Te size [in MB] of an index corresponding to a tree dimensional cube Te same situation can be observed for te index size in fig. 8 and fig. 9. Te n-tree index is muc smaller tan te B-Tree index. Te difference comes from te lower number of te index blocks and from te flexibility of creating te summarized data. Wen te data summarized data is to be query, te result depends on te location of te location of te summarized data. Fig. 0 - Te summarized data query cost for a two-dimensional cube Fig. - Te summarized data query cost for a tree-dimensional cube 7. Conclusions Implementing a new indexing algoritm, wit muc wider scope and increased flexibility, could be te database systems optimization solution, especially wen oter indexing algoritms do not provide te desired results. Te n-tree index could be considered a more generalized B-tree index. If B-Tree index can index only uni-dimesional data, te n-tree index is optimized for any n- dimensional data. Moreover, te n-tree index will be more suitable for indexing spatial data. As sown in figures 5-9, te n-tree index outperforms te B-Tree index in locating te cells of te cube. Moreover, te difference in performance increases as te number of te cube s cells rises. In addition, te space occupied by te n-tree index in muc smaller tan tat needed for a B-Tree index. Again te superiority of te n-tree index is all te more evident wen te number of te cube s cells increases. References [] Revista Informatica Economica, nr. 4 (24), 2002; [2] Revista Informatica Economica, nr. (7), 200; [3] Computing Partial Data Cubes for Parallel Data Wareousing Applications, Frank Dene, Andrew Rau-caplin, Computational Science - ICCS 200; [4] Ubiquitous B-Tree, Douglas Corner, ACM, 979; [5] MCTS 70-43: Implementing and Maintaining Microsoft SQL Server 2005, Que, 2006; [6] Index Internals, Julian Dyke, 2005.

26 Optimized Data Indexing Algoritms for OLAP Systems Lucian Bornaz is PD candidate at Academy of Economic Studies since 2006. His researc domain is database systems wit focus on data indexing algoritms. He graduated Airforce Academy in 998 and master course at Academy of Economic Studies in 2006. Lucian Bornaz is certified by Microsoft as MCP, MCSA, MCSE, MCDBA, MCAD and MCTS.