Bitmap Index Design and Evaluation


 Ruby Miles
 1 years ago
 Views:
Transcription
1 Bitmap Index esign Evaluation CheeYong Chan Yannis E Ioannidis epartment of Computer Sciences epartment of Computer Sciences University of WisconsinMadison University of WisconsinMadison Bitmap indexing has been touted as a promising approach for processing complex adhoc queries in readmostly environments, like those of decision support systems Nevertheless, only few possible bitmap schemes have been proposed in the past very little is known about the spacetime tradeoff that they offer In this paper, we present a general framework to study the design space of bitmap indexes for selection queries examine the diskspace time characteristics that the various alternative index choices offer In particular, we draw a parallel between bitmap indexing number representation in different number systems, define a space of two orthogonal dimensions that captures a wide array of bitmap indexes, both old new Within that space, we identify (analytically or experimentally the following interesting points: (1 the timeoptimal bitmap index ( the spaceoptimal bitmap index (3 the bitmap index with the optimal spacetime tradeoff (knee ( the timeoptimal bitmap index under a given diskspace constraint Finally, we examine the impact of bitmap compression bitmap buffering on the spacetime tradeoffs among those indexes As part of this work, we also describe a bitmapindexbased evaluation algorithm for selection queries that represents an improvement over earlier proposals We believe that this study offers a useful first set of guidelines for physical database design using bitmap indexes While the query performance issues of online transaction processing (OLTP systems have been extensively studied [7] are pretty much wellunderstood, the stateoftheart for ecision Support Systems (SS is still evolving as indicated by the growing active research in this area [3] Current database systems, which are optimized mainly for OLTP applications, are not suitable for SS applications due to their different requirements workload [] In particular, SS operate in readmostly environments, which are Partially supported by the National Science Foundation under Grant IRI9173 (PYI Award by grants from EC, IBM, HP, ATT, Oracle, Informix Author s present address: epartment of Informatics, University of Athens, Hellas (Greece dominated by complex adhoc queries that have high selectivity factors (ie, large foundsets A promising approach to process complex queries in SS is the use of bitmap indexing [, 9, ] Bitmap manipulation techniques have already been used in some commercial products [1] to speed up query processing: a notable example is Model, a prerelational BMS from Computer Corporation of America [] More recently, various BMS vendors, including Oracle, Red Brick, Sybase, have introduced bitmap indexes into their products to meet the performance requirements of SS applications [, ] In its simplest form, a bitmap index on an indexed attribute consists of one vector of bits (ie, bitmap per attribute value, where the size of each bitmap is equal to the cardinality of the indexed relation The bitmaps are encoded such that the "!# record has a value of in the indexed attribute if only if the!# bit in the bitmap associated with the attribute value is set to, the!# bit in each of the other bitmaps is set to ( This is called a ValueList index [] An example of a ValueList index for a * record relation + is shown in Figure 1, where each column in Figure 1(b represents a bitmap, associated with an attribute value / (a (b Figure 1: Example of a ValueList Index (a Projection of indexed attribute values with duplicates preserved (b ValueList Index Consider a highselectivityfactor query with selection predicates on two different attributes A conventional OLTP optimizer would generate one of the following three query plans: (P1 a full relation scan, (P an index scan (using the predicate with lower selectivity factor followed by a partial relation scan to filter out the nonqualifying tuples, or (P3 an index scan for each selection predicate, followed by a merge of the results from the two index scans ue to the compact sizes of bitmaps (especially for attributes with low cardinality the efficient hardware support for bitmap operations (AN, OR, XOR, NOT, plan (P3 using bitmap indexes is likely to be more efficient than a plan that re
2 quires a partial or full relation scan (plans (P1 (P A simple cost analysis shows that evaluating plan (P3 with bitmap indexes is more efficient than using the conventional RIlist based indexes for queries with selectivity factor above some threshold Let be the relation query result cardinalities, respectively Assume that each RI is bytes long that one bitmap is scanned per predicate In terms of the number of bytes read, using bitmap indexes for plan (P3 is more efficient than using RIlist based indexes if ie, Furthermore, operations on bitmaps are more CPUefficient than merging RIlists Various bitmap indexes [, 9,, 13, 1] have been designed for different query types, including range queries, aggregation queries, OLAPstyle queries However, as there is no overall best bitmap index over all kinds of queries, maintaining multiple types of bitmap indexes for an attribute may be necessary in order to achieve the desired level of performance While the gains in query performance using a multipleindex approach might be offset by the high update cost in OLTP applications, this is not an issue in the readmostly environment of SS applications Indeed, the multipleindex approach is adopted by Sybase IQ, a BMS specifically designed for data warehousing applications, which supports five different types of bitmap indexes, requires the database to be fully inverted [1] Maintaining multiple indexes for an attribute, however, further increases the disk space requirement of data warehouse applications Understing the spacetime tradeoff of the various bitmap indexes is therefore essential for a good physical database design In this paper, we study the spacetime tradeoff of bitmap indexes for the class of selection queries, ie, queries with predicates of the form, where refers to the indexed!#" attribute, is one of six comparison operators, is the predicate!" constant We refer to selection predicates where ( as equality, predicates, to selection predicates where *+ as range predicates Information on bitmap indexes for other types of queries (eg, starjoin groupby queries can be found elsewhere [9, ] The main contributions in this paper are as follows: The presentation of a general framework to study the design space of bitmap indexes for selection queries The framework not only captures the existing proposed ideas but reveals several other design alternatives as well An analysis of the spacetime tradeoff of bitmap indexes in particular, we identify analytically or experimentally four interesting points in a typical spacetime tradeoff graph, as shown in Figure : (A The spaceoptimal bitmap index (B The timeoptimal bitmap index under a given space constraint (C The bitmap index with the optimal spacetime tradeoff, ie, the knee of the spacetime tradeoff graph ( The timeoptimal bitmap index An experimental study of the effects of bitmap compression on the spacetime tradeoff issues An analytical study of the effects of bitmap buffering on the spacetime tradeoff issues The proposal of a more efficient evaluation algorithm for bitmap indexes The new algorithm reduces the number of bitmap operations by about (/ incurs one less bitmap scan for a range predicate evaluation The rest of this paper is organized as follows In Section, we present a framework to study the design space of bitmap indexes for Time M (A SpaceOptimal (B TimeOptimal under Space Constraint M Infeasible Region (C Optimal SpaceTime Tradeoff (knee Figure : SpaceTime Tradeoff Issues ( TimeOptimal Space selection queries Section 3 presents an improved evaluation algorithm for bitmap indexes, including both analytical experimental performance evaluations Section presents an analytical cost model for the spacetime tradeoff study Section compares the spacetime tradeoff of two basic bitmap encoding schemes In Section, we examine the spaceoptimal (point (A timeoptimal (point ( bitmap indexes In Section 7, we characterize the knee of the spacetime tradeoff graph of bitmap indexes (point (C Section presents an optimal as well as a heuristic approach to find the timeoptimal bitmap index under space constraint (point (B In Section 9, we present an experimental study of the impact of bitmap compression on the spacetime tradeoff issues Section presents an analytical study of the effect of bitmap buffering on the spacetime tradeoff issues Finally, we summarize our results in Section 11 For the rest of this paper, we use the term index to refer to a bitmap index ue to lack of space, the derivations of analytical results the proofs of theorems algorithms correctness are omitted from this paper full details are given elsewhere [] 13 7 :9< = 7 B In this section, we present a framework to examine the design space of indexes for selection queries The framework has been inspired by the work of Wong et al [13, 1] Let C denote the attribute cardinality ie, the number of distinct actual values of the indexed attribute The attribute cardinality is generally smaller than the cardinality of the attribute domain ie, the number of all possible values of the indexed attribute Without loss of generality to keep the presentation simple, we assume in this paper that the actual attribute values are consecutive integer values from ( to CE For the general case where the actual attribute values are not necessarily consecutive, a bitmap index can be built either on the entire attribute domain, or on C consecutive values by mapping each actual attribute value to its rank via a lookup table As described in Section 1, the ValueList index is a set of bitmaps, one per attribute value In other words, if one views this set as a twodimensional bit matrix (Figure 1(b, the focus is on the columns If the focus moves on the rows, however, then the index can be seen as the list of attribute values represented in some particular way As there is a large number of possible representations for attribute values (integers in this case, this clearly opens up the way for defining a whole host of bitmap indexing schemes In classifying all these representations, we identify two major orthogonal parameters that define the overall space of alternatives In particular, considering the ValueList index again, we observe that each attribute value is represented as a single digit (in basec
3 C E E E a sequence of numbers arithmetic, this digit being encoded in bits by turning exactly one out of C bits on The arithmetic we choose for the value representation, ie, the decomposition of the value in digits according to some base, the encoding scheme of each digit in bits are the two dimensions of the space are analyzed below (1 Attribute Value ecomposition Consider an attribute value Fur thermore, define into a sequence of digits Then can be decomposed as follows: where! ", #( * ( E E+,, each digit is in the range ( Based on the above, each choice of sequence gives a different representation of attribute values, therefore a different index An index is welldefined if,, The sequence is the base of the index, which is in / turn called a base index If, then the base is called uniform the index is called base for short The index consists of components, ie, one component per digit Each component individually is now a collection of bitmaps, constituting essentially a base index Let denote the number of bitmaps in the!# component of an index *,,,3 denote the collection of bitmaps that form the "!# component Figure 3 shows a base ValueList index (based on the same 1record relation in Figure 1 By decomposing a singlecomponent index into a component index, the number of bitmaps has been reduced from 7 to / 3 7A 7<E (a 7AH 7A E 7<E =<= = E9:<H = => 1 1 =<= = Ḧ 9:< = => 1 1 =<= = Ḧ 9:< = => E 1 1 =<= = Ḧ 9:< = => 1 1 =<= = 9:< = => 1 1 =<= = Ḧ 9:< = => 1 1 =<= = Ḧ 9:< = => 1 1 =<= = Ḧ 9:<H = => 1 1 =<= = 9:< = => E 1 1 =<= = E9:< = => 1 1 =<= = 9:<H = => 1 1 =<= = E9:< = => E 1 1 Figure 3: Example of a Component ValueList Index (a Projection of indexed attribute values with duplicates preserved (b Base? ValueList Index (b 7AH ( Bitmap Encoding Scheme Consider the!# component of an index with attribute cardinality There are essentially two major schemes to directly encode the corresponding values (( in bits : Equality Encoding: There are bits, one for each possible value The representation of value has all bits set to (, except for the bit corresponding to, which is set to Clearly, an equalityencoded component consists of bitmaps Range Encoding: There are bits again, one for each possible value The representation of value has the rightmost bits set to ( the remaining bits (starting from the one cor Intuitively, each responding to to the left set to bitmap, has in all the records whose!# component value is less than or equal to * Since the bitmap, has all bits set to, it does not need to be stored, so a rangeencoded component consists of bitmaps Figures (b (c show the rangeencoded indexes corresponding to the equalityencoded indexes in Figure 1 Figure 3, respectively 7 / 3 79:7A> 7AB7FEG79H E 7 H 7 E 7 H (a (b (c Figure : Examples of RangeEncoded Bitmap Indexes (a Projection of indexed attribute values with duplicates preserved (b Component,! Base9 RangeEncoded Bitmap Index (c Base RangeEncoded Bitmap Index Figure shows the twodimensional design space of indexes for selection queries, with the two existing alternatives, namely, the ValueList index the BitSliced index [], classified as shown The ValueList index, which is the simplest index is the most commonly implemented, has a single component is equalityencoded The BitSliced index [] has a uniform base is rangeencoded A base BitSliced index has been used in MOEL for evaluating range predicates [] The BitSliced index is also used in Sybase IQ for evaluating range predicates E We have excluded the third basic encoding scheme (the binary encoding scheme discussed in [13, 1], as it can be characterized in our framework as a base decomposition with one of the two encodings we do present Clearly, a symmetric scheme exists as well, where the roles of right left are exchanged 3
4 b < b, b,, b> ATTRIBUTE log b times VALUE ECOMPOSITION < b, b 1 > BITMAP ENCOING SCHEME Equality Range ValueList Index BitSliced Index bitmap operations bitmap scans, compared to Algorithm RangeEval, which incurs ( bitmap operations bitmap scans Moreover, while efficient processing of Algorithm BSRange Opt requires buffering for only one bitmap (the intermediate/final result, efficient processing of Algorithm RangeEval requires buffering for two bitmaps (for,,, Result < b 3, b, b 1 > Figure : esign Space of Bitmap Indexes for Selection Queries Result B 3 1 B 1 3 B 1 performing aggregation [] In this paper, we consider all welldefined decompositions, just the two encoding schemes presented above, which are used in practice = 7 A A = 9 * = 7 > B 7 3 B B B B 7 3 B 3 An evaluation algorithm for selection queries based on rangeencoded indexes has been proposed by O Neil Quass (Algorithm 3 in [], which we referred to as Algorithm RangeEval In this section, we present an improved evaluation algorithm (Algorithm RangeEvalOpt that reduces the number of bitmap operations by about (#/ requires one less bitmap scan for a range predicate evaluation Given this, in the rest of the paper, the improved evaluation algorithm is used as the basis for the time analysis of rangeencoded indexes Both evaluation algorithms are shown in Figure Let denote bitmaps with all bits set to (, respectively Also let,, denote the logical AN, OR, XOR operations, respectively,, denote the complement of a bitmap, epending on the input predicate operator (, Algorithm RangeEval evaluates each range predicate by computing two bitmaps: the, bitmap either the, or, bitmap For example, if the predicate is, then the result bitmap, is obtained by computing the bitmaps, involved,,,, or,, steps that are not required The final result bitmap that is returned is either,,,,,,,,,, or,, corresponding to the predicate operator,,,,, or ", respectively As we show below, such an evaluation strategy not only costs more bitmap operations to incrementally build the two intermediate bitmaps, but also incurs more bitmap scans since the evaluation of the bitmap,, which corresponds to an equality predicate evaluation, is the most expensive predicate in terms of bitmap scans Algorithm RangeEvalOpt avoids the intermediate equality predicate evaluation by evaluating each range query in terms of only the predicate based on the following three identities: (1, (, (3 Furthermore, the evaluation of the predicate itself does not require an equality predicate evaluation Consequently, Algorithm RangeEvalOpt requires the computation of only one bitmap, for any predicate evaluation, as opposed to two bitmaps (, either, or, required in Algorithm RangeEval Figure 7 compares the two algorithms for evaluating the predicate? using a component base ( index Clearly, Algorithm RangeEvalOpt is more efficient as it requires only B 3 B 7 3 (a Figure 7: Comparison of Evaluation Strategies for Selection Predicate? using a Component, Base ( RangeEncoded Bitmap Index (a Algorithm RangeEval (b Algorithm RangeEvalOpt! A " * A# = A A = 9 = 7 > Evaluation Number of Bitmap Operations Num of Predicate AN OR NOT XOR Total Scans RangeEval (*+, /1 /1, (*3+, /1 /1, (*7+,, (*+,, (:9+,, (=< 9+ 1,/1, (*+ RangeEvalOpt 1,/1, = 1 (*3+ 1,/1, = 1 (*7+,, = 1 (*+,, = 1 (:9+,, (=< 9+ 1,/1, Table 1: Worst Case Analysis of Evaluation Algorithms is the number of index components Table 1 shows the worst case analysis of the performance of the evaluation algorithms in terms of the number of bitmap operations scans For Algorithm RangeEval, since each range predicate evaluation entails an equality predicate evaluation (for,> bitmap, the number of bitmap operations of a range predicate evaluation is at least twice that of the predicate evaluation By using a more direct evaluation strategy, Algorithm RangeEvalOpt B (b B B 1
5 ( C Evaluation Algorithms for Selection Queries Using RangeEncoded Bitmap Indexes is the number of components in the rangeencoded index is the base of the index!#" is the predicate operator, is the predicate value, # is a bitmap representing the set of records with nonnull values for the indexed attribute Output: A bitmap representation of the set of records that satisfies the predicate Algorithm RangeEval 1,,,, # 3 let for i = downto do if ( ( then,,,, 7 if ( then,,,, 9,,:,, else 11,,:, 1 else * 13,,,, 1,,, 1,,:, # 1,,,,,, 17 return, Algorithm RangeEvalOpt 1, if ( then 3 let if ( then if ( then,, E for i = to do 7 if ( " then,,, if ( " ( then,,, 9 else for i = 1 to do 11 if ( ( then,,, 1 else if ( then,,, 13 else, ",,, 1 if ( then 1 return,, 1 else 17 return,, * Figure : Algorithms RangeEval RangeEvalOpt reduces the number of bitmap operations for range predicate evaluations by about half the number of bitmap scans for range predicates is reduced by one Both algorithms have the same cost for an equality predicate evaluation For both evaluation algorithms, the worst case occurs when (,, which corresponds also to the most probable case! >#7 = A # = 7 1 :9 A A " = 9 * = 7 #> In this section, we compare the performance of the evaluation algorithms in terms of the average number of bitmap scans the average number of bitmap operations Indexes with uniform base are generated by varying the two main parameters: the attribute cardinality C the base number We experimented with C ( ( (, ( ( ( for each value of C, the entire space of baserangeencoded indexes are generated by varying from to C For each index, we generated a total of C selection queries!" of the form by varying the predicate ( ( The graphs the predicate constant ( C Each query is evaluated using Algorithm RangeEval Algorithm RangeEvalOpt For a given rangeencoded index an evaluation algorithm, the average number of bitmap scans (operations is computed by taking the average of the number of bitmap scans (operations over all #C queries Figure compares the performance of the two evaluation algorithms as a function of the base number with C clearly demonstrate that Algorithm RangeEvalOpt is more efficient than Algorithm RangeEval Similar trends are obtained for other values of C # A 9 7 = A " Our cost model for the spacetime tradeoff analysis uses the following two metrics: The space metric is in terms of the number of bitmaps stored The time metric is in terms of the expected number of bitmap scans for a selection query evaluation, is based on the assumption that the queries in the query space are uniformly!" distributed, : where Note that our time metric incorporates only the I/O cost (ie, number of bitmap scans does not include the CPU cost (ie, number of bitmap operations as the number of bitmap scans the number of bitmap operations for a query evaluation are correlated We denote the space time metric of an index by : :, respectively # = 7 " :9< = 7 * = This section compares the performance of the two basic encoding schemes, namely, equality encoding range encoding, for selection queries Our result shows that range encoding provides better spacetime tradeoff than equality encoding in most cases Let denote an component index with base Based on the cost model, we have the following result : Theorem 1 If is equalityencoded, then where if "!#( (1
6 3 Average Number of Bitmap Scans RangeEval RangeEvalOpt Base Number (a Average Number of Bitmap Scans as a Function of Base Number 1 RangeEval RangeEvalOpt Given the overall superiority of rangeencoding, in the remainder of this paper, we focus on the spacetime tradeoff of rangeencoded indexes Henceforth, we use the term index as an abbreviation for a rangeencoded index 7 <7 = A =,7 = A = 7 #>? In this section, we identify both the spaceoptimal timeoptimal indexes (ie, points (A ( in Figure We refer to the most spaceefficient (respectively, timeefficient index among all indexes with components as the ncomponent spaceoptimal index (respectively, ncomponent timeoptimal index Note that the component spaceoptimal index might not be unique for example, if C ( ( (, then the base base indexes are both component spaceoptimal indexes Based on Equations (3 (, we have the following result Average Number of Bitmap Operations Base Number (b Average Number of Bitmap Operations as a Function of Base Number Figure : RangeEval vs RangeEvalOpt for Uniform Base RangeEncoded Bitmap Indexes, C ( ( :! * * "! * where ( * if (!# : : If is rangeencoded, then (3 ( The time analysis for rangeencoded indexes is based on Algorithm RangeEvalOpt The evaluation algorithm for equalityencoded indexes is not shown here due to space constraint details can be found elsewhere [] A rangeencoded index requires one or two bitmap scans per component for a selection predicate evaluation On the other h, an equalityencoded index requires only one bitmap scan per component for an equality predicate evaluation, but for a range predicate evaluation, the number of bitmap scans required per component is between two half the number of bitmaps in that component Figure 9 compares the spacetime tradeoff of range equalityencoded indexes for C (, ( (, ( ( ( The results show that rangeencoding in general provides better spacetime tradeoff than equalityencoding in most cases Theorem 1 (1 The number of bitmaps in an component, where spaceoptimal index is given by C is the smallest positive integer such that C The index with base is an component spaceoptimal index ( The spaceefficiency of spaceoptimal indexes is a nondecreasing function of the number of components (3 The base of the component timeoptimal index is "! ( E ( The timeefficiency of timeoptimal indexes is a nonincreasing function of the number of components By Theorem 1, it follows that the spaceoptimal index corresponds to the index with the maximum number of components (ie, # C ( with each base number equal to, the timeoptimal index corresponds to the index with the minimum number of components (ie, These are probably immediate corollaries of information theory /or number theory results as well, but they are easy enough to prove that doing so seemed easier than tracking down the existing literature * = 7 #>,+ <7 = A 7 = / 1 In this section, we characterize the knee of the spacetime tradeoff graph, referred to as the knee index (point (C in Figure, which corresponds to the index with the best spacetime tradeoff Note that our characterization is an approximate one as finding a precise analytical characterization is generally a difficult problem However, it turns out that our approximate characterization has very good accuracy We first give a definition of the knee index that will serve as a basis for comparing with our approximate characterization Let " denote the set of optimal indexes corresponding to the points in the spacetime tradeoff graph such that : : 7 7 For, let :9 +9 denote the gradients of the line segments formed by C The set of optimal indexes < is the maximal subset of all possible indexes <>= such that for there does not exist another index in <>= that is more efficient than? in both time space,
7 ! + RangeEncoded Index EqualityEncoded Index RangeEncoded Index EqualityEncoded Index RangeEncoded Index EqualityEncoded Index Time (Expected Number of Bitmap Scans Time (Expected Number of Bitmap Scans Time (Expected Number of Bitmap Scans Space (Number of Bitmaps Space (Number of Bitmaps Space (Number of Bitmaps (a C ( (b C ( ( (c C ( ( ( Figure 9: Comparison of SpaceTime Tradeoff of Range EqualityEncoded Bitmap Indexes with, respectively are defined as follows:! "#! where : : is a normalizing factor The index " with the maximum ratio is the knee index Time (Expected Number of Bitmap Scans TimeOptimal Index SpaceOptimal Index All Index Space (Number of Bitmaps Figure : SpaceTime Tradeoff of Bitmap Indexes, C ( ( ( Time (Expected Number of Bitmap Scans Space (Number of Bitmaps Figure 11: SpaceTime Tradeoff of SpaceOptimal Bitmap Indexes, C ( ( ( 1 We now motivate our approximate characterization, which is based on the results of Theorem 1 Figure compares the spacetime tradeoff graphs for three classes of indexes: the class of spaceoptimal indexes, the class of timeoptimal indexes, the entire class of indexes, for C C ( Note that since the spaceoptimal index is generally not unique, each point in the spaceoptimal graph shown corresponds to the most timeefficient index among all equally spaceefficient indexes with the same number of components Figure shows that the tradeoff graph for spaceoptimal indexes provides a good approximation to that for all indexes In particular, the set of points on the graph for spaceoptimal indexes is a subset of the set of points on the graph for all indexes Figure 11 shows the same spaceoptimal tradeoff graph as in Figure but with each point labelled with the number of components of the corresponding spaceoptimal index We observe that the knee of the spacetime tradeoff graph for the spaceoptimal index corresponds to a component index, something that was consistent throughout our experimentation Hence, we characterize the knee index as the most timeefficient component spaceoptimal index, which is obtained from the following result ( ( ( similar results are obtained for other values of C The graph for spaceoptimal # (respectively, timeoptimal indexes consist of at most C ( points, where each point corresponds to an component spaceoptimal (respec # tively, timeoptimal index, Theorem 71 The base of the most timeefficient component spaceoptimal index is given by (, where C,! *,+ # * * E * * E/ * E, *( We have compared the knee index based on our approximate characterization with that based on the definition for various values of attribute cardinality the results show that both knee indexes match exactly for all the cases that we compared 1 = <7 = A = 7 #>39 7 # In this section, we consider the following practical optimization problem (point (B in Figure : Given a constraint on the available disk space to store an index, say at most bitmaps (or equivalently, at most bits, where is the number of records, determine the most timeefficient index We first present an algorithm that finds the optimal solution, then present a more efficient heuristic approach, which is nearoptimal Both algorithms are shown in Figure 1 In the following, let denote an component index 79: =<! : denote the component space timeoptimal indexes, respectively 7
8 Algorithms to Find TimeOptimal Bitmap Index Under Space Constraint Input: is the space constraint in terms of the maximum number of bitmaps C is the attribute cardinality Output: The timeoptimal bitmap index with no more than Algorithm TimeOptAlg 1 let be the smallest number such that : 97: =< if ( :! : =< then return! : =< : 3 let be the smallest number such that :! let : "! = return such that :, bitmaps =< : = Algorithm TimeOptHeur 1 = FindSmallestN( <,C if ( :! : < then return! : 3 return RefineIndex(I Figure 1: Algorithms TimeOptAlg TimeOptHeur 1! <7 = A SpaceOptimal Indexes 77 : Algorithm TimeOptAlg finds the actual optimal solution By result ( of Theorem 1, the solution must have at least components, where is the smallest number of components such that If the component timeoptimal index satisfies the space constraint (step, then by result ( of Theorem 1, it is the solution otherwise, by result ( of Theorem 1, the solution index must have no more than components where is the smallest number of components such that the component timeoptimal index satisfies the space constraint Figure 13 shows two cases to illustrate the bounds on the number of components of the solution index each point on the spacetime tradeoff graphs shown is labelled with the number of components of the corresponding index The time complexity of the algorithm is determined by the size of the set of cidate indexes defined by (step This search space is large as we need to enumerate for each value of,, all possible component indexes such that C Figure 1 shows the size of as a function of for C ( ( ( 1! " To avoid the costly exhaustive search in the optimal approach (steps, we present in this section a heuristic approach (Algorithm TimeOptHeur in Figure 1 to find an approximate optimal solution Our experimental results show that the heuristic approach is nearoptimal with the optimal index being selected at least 7#/ of the time The main idea is to first select a seed index that satisfies the space constraint, then to improve the timeefficiency of the seed index by adjusting its base numbers For the seed index, our approach uses an component index with :, where is the least number of components such that : 77 : Both are determined < by Algorithm FindSmallestN (Figure 1 Note that if! : satisfies the space constraint =< (step, then, as in step of Algorithm TimeOptAlg,! : is the optimal solution index Algorithm RefineIndex (Figure 1 improves the timeefficiency of an index its correctness is based on the following result Theorem 1 Let be an component index with base Suppose there exists two base numbers,, some integer constant (, such that ( if C, where if otherwise Time (Expected Number of Bitmap Scans Time (Expected Number of Bitmap Scans n = 3 n = 3 M Space (Number of Bitmaps n = (a n = 3 M Space (Number of Bitmaps (b TimeOptimal Indexes 3 1 SpaceOptimal Indexes TimeOptimal Indexes Figure 13: Bounds on Number of Components of Solution Index in Algorithm TimeOptAlg Let be another component index with base as defined above Then : : Based on Theorem 1, Algorithm RefineIndex essentially tries to maximize the number of components with small base numbers in particular, step determines the largest possible value for adjusting a pair of base numbers To illustrate Algorithm Refine Index, consider a component index with base C ( ( ( In the first iteration, with, 7 the base is refined to 1( In the second iteration, with 1(, the base is refined to ( After the final refinement step, the base of the refined index is ( ( By Equation (3, : 1
9 * Algorithm FindSmallestN: Algorithm to Find the Bitmap Index with Least Number of Components Under Space Constraint Input: is the space constraint in terms of the maximum number of bitmaps C is the attribute cardinality Output:, the smallest number of components such that : 97:, An component index with : 1 ( // is the number of components repeat 3 " // is the number of components with base ( until ( C 7 let be the component bitmap index with base return is a component bitmap index Algorithm RefineIndex: Algorithm to Improve the TimeEfficiency of a Bitmap Index Input: C is the attribute cardinality Output: A component bitmap index such that 1 let,! be the base of,? " 3 for = downto do let be the smallest base number from, delete from, if ( then 7 let be the smallest base number from, * * * * let " 9 if ((, then * " * / * * * /, # ( : ?!? +, + = 1 return component bitmap index with base : Figure 1: Algorithms FindSmallestN RefineIndex : : ( 7 By Equation (, : ( The time complexities of Algorithms FindSmallestN RefineIndex are C, respectively, where # is at most C ( Therefore, the time complexity of Algorithm TimeOptHeur is C C Attribute Percentage of Max ifference in Cardinality, Solutions Expected Number that are Optimal of Bitmap Scans / , ,, 1,1 Table : Effectiveness of Heuristic Approach to Select Time Optimal Bitmap Index under Space Constraint Table shows the effectiveness of the heuristic approach for various values of attribute cardinality The second column shows the percentage of solutions selected by the heuristic approach that are optimal for the attribute cardinality value indicated in the first column For those cases where the optimal heuristic approaches differ, the third column shows the maximum difference in the expected number of bitmap scans of their selected indexes The result indicates that the proposed heuristic approach is nearoptimal 3 * :9 = 7# = 7 7 = This section examines the effect of bitmap compression on the spacetime tradeoff of bitmap indexes 3! = 7 > = We first investigate different physical organizations for bitmap indexes to identify storage schemes that have good spacetime trade 9
10 ( 7 Number of Cidate Bitmap Indexes ata Set 1 ata Set Relation Lineitem Order Relation Cardinality ( ( ( ( ( ( ( Attribute Quantity Order ate Attribute Cardinality, C ( Table 3: Characteristics of TPC Benchmark ata used in Experiments 1( Space (Number of Bitmaps Figure 1: Size of Set of Cidate Bitmap Indexes,, as a Function of the Space Constraint for C ( ( ( off Consider an component bitmap index for a record relation, where the!# index component comprises of bitmaps, Let The entire bitmap index is essentially a bitmatrix, where the "!# component is a bitmatrix, each bitmap is a bitvector Based on the above view of a bitmap index, we compare the following three storage schemes: BitmapLevel Storage (BS Each bitmap is stored individually in a single bitmap file of bits Thus, the bitmap index is stored in bit bitmap files ComponentLevel Storage (CS Each index component is stored individually in a rowmajor order in a single bitmap file of bits, Thus, the bitmap index is stored in bitmap files IndexLevel Storage (IS The entire index is stored in a rowmajor order in a single bitmap file of bits We refer to a bitmap index that is organized using storage scheme (without any compression as a index, refer to a index with all its bitmaps compressed as a index (ie, for compressed For a component bitmap index, both a CSindex an ISindex are equivalent For a bitmap index with the maximum number of components ie, each component is of base, (1 both a BSindex a CSindex are equivalent, ( an ISindex is a projection index In a CSindex, each row of a component bitmatrix has either a consecutive stream of bits followed by a consecutive stream of ( bits if the index is rangeencoded, or exactly one bit set to if the index is equalityencoded In contrast, in a BSindex, the distribution of the bits in each bitmap is dependent on the distribution of the attribute values Thus, we expect a CSindex to be more compressible than a BSindex 3! >#7 = A 7 Our experimental study uses two data sets extracted from the TPC Benchmark []: data set 1 is for small attribute cardinality, data set is for large attribute cardinality Table 3 shows the key characteristics of our experimental data To limit the number of experiments for each data set, our comparison of the spacetime tradeoff is restricted to the class of spaceoptimal indexes (as a function of the number of index components, with being varied A projection index [] on an attribute ( in a relation is simply the projection of ( on with duplicates preserved stored in RIorder to this is motivated by our results in Section 7 which show that the spacetime tradeoff of spaceoptimal indexes provides a good approximation to the spacetime tradeoff of all indexes The data compression code used is from the zlib library Table compares the compressibility of the three storage schemes for both data sets For each component index each com pressed storage scheme, where, C, we compute the percentage of the size of under to the size of under BS For both data sets, the results show that CS Base Size of I Compressibility of of under BS Storage Scheme ( Index I (in bytes cbs ccs cis! "# #! # " " (a ata Set 1 Base Size of I Compressibility of of under BS Storage Scheme ( Index I (in bytes cbs ccs cis ( " " " # (b ata Set Table : Comparison of Compressibility of ifferent Storage Schemes indexes give the best compression In terms of query evaluation cost, a BSindex requires at most bitmap file scans per index component On the other h, both a CSindex an ISindex require all the bitmap files to be scanned moreover, they also incur additional CPU overhead to extract the bits of each relevant bitmap from the component bitmap files (which are stored in rowmajor order Thus, we expect cbsindexes to be more timeefficient than ccs cisindexes this is indeed supported by our experimental results For example, consider the base!+*,!!,*(!/ cbsindex,!, the average size of each compressed bitmap is + bytes Even when two bitmaps (an average of? bytes are scanned to evaluate an equality query, using the cbsindex is still significantly cheaper than using the ccsindex (or cisindex, which requires # 7 7 (1 bytes to be scanned Since cbsindexes are the most timeefficient ccsindexes are the most spaceefficient, we shall compare the spacetime trade? The zlib library is written by Jeanloup Gailly Mark Adler (http://wwwcdromcom/pub/infozip/zlib the compression method is based on a LZ77 variant called deflation
11 ( C ( < " C < offs of only cbs ccsindexes against BSindexes for the rest of this section Each bitmap index is used to evaluate a set of #C selection queries (using Algorithm RangeEvalOpt, where? : The predicate operator is restricted to (for range queries (for equality queries to limit the number of queries The performance of each index is measured in terms of its space timeefficiency The space metric is in terms of the total space of all its bitmaps (in MBytes, the time metric is in terms of the average predicate evaluation time (in secs which includes (1 the time to read the bitmaps, ( the time for inmemory bitmap decompression (for compressed bitmaps, (3 the time for bitmap operations The average predicate evaluation time,, is computed! as follows: C time to evaluate the predicate 3! >#7 = A A, where denote the This section presents experimental results for indexes under the storage schemes BS, cbs, ccs for data set 1 results for data set are not shown due to space limitation Figure 1(a shows the timeefficiency of the indexes as a function of the number of index components Both BS cbsindexes outperform ccsindexes significantly in particular, for ccsindexes, over (/ of the total evaluation time is due to bitmap decompression The decompression cost for ccsindexes generally increases with the number of bitmap components since the number of bits to be extracted is an increasing function of the number of components For ccsindexes, since all the compressed bitmap files need to be scanned for a query evaluation, their I/O cost is a function of the total size of all compressed bitmap files, which generally decreases as the number of components increases In contrast, for both BS cbsindexes, since the number of bitmaps to be scanned increases with the number of components, their I/O cost is an increasing function of the number of components Comparing BS cbsindexes, while cbsindexes incur a lower I/O cost, this is often offset by their bitmap decompression cost our results show that both BS cbsindexes are almost comparable in their timeefficiency Figure 1(b compares the spaceefficiency of the indexes as a function of the number of index components There are two main observations First, as we already know from Table, ccsindexes are the most spaceefficient Second, the results show that the effectiveness of bitmap compression decreases as the number of components increases This can be explained by the fact that the number of bitmaps is a decreasing function of the number of components (result ( of Theorem 1 in particular, there is a significant space reduction when an index is decomposed from one to two components Consequently, the gain of bitmap compression, with respect to space reduction, is only marginal once an index has been decomposed (ie, Thus, in terms of improving space efficiency, decomposing an component BSindex to an component BSindex could be more effective than compressing the component index Figure 1(c compares the spacetime tradeoff of the indexes The result shows that BS cbsindexes have comparable spacetime tradeoff, which is better than that of ccsindexes * :9< 7 1 = This section considers the effect of bitmap buffering on the spacetime tradeoff issues that we have addressed As the typical size of buffer space is at least GB in database systems running on SMP systems for large data warehouse applications [11], it is likely that a good number of bitmaps can remain memoryresident By taking into account the amount of mainmemory allocated for buffering bitmaps, more optimal indexes with better spacetime tradeoff can be designed The unit of buffering that we consider here is the number of bitmaps Consider an component with base Let denote the number of bitmaps that can be buffered in mainmemory, where bitmap buffer assignment for by, where : We denote a is the number of bitmaps in the "!# component of that are buffered A bitmap buffer assignment for is welldefined if In addition to the uniform query distribution assumption stated in Section, we further assume that the buffer hit rate for each referenced bitmap is uniformly distributed (ie, the probability that a reference to any bitmap in the!# component is a buffer hit is given by * Then, the expected number of bitmaps scans for a selection query evaluation using an component index with a bitmap buffer assignment is given by : ( We now present an optimal bitmap buffering policy for indexes From result (3 of Theorem 1 Theorem 1, we know that an index is more timeefficient if it has more components with smaller base numbers Since buffering the bitmaps in an index component effectively reduces the base of that component, it follows that it is more beneficial (in terms of reducing the expected number of bitmap scans to buffer a bitmap from a component with a smaller base number than from one with a larger base number The following optimal bitmap buffering policy is based on a prioritization of the index components using their base numbers Theorem 1 Consider the following priority assignment to bitmap components: 1 Components in have higher priority than those in 3 Within each set ( or 1 Partition the components of an index into two disjoint sets such that 1, components with smaller base numbers have higher priority Based on the above priority assignment, a bitmap buffering policy that favors bitmaps from a higher priority component over those from a lower priority component is optimal Based on Equation ( the above optimal bitmap buffering policy, we have the following result Theorem If index with base (, the timeoptimal index is an component Figure 17 compares the spacetime tradeoff of indexes (based on the optimal bitmap buffering policy as a function of the number of buffered bitmaps for C ( ( ( As expected, the spacetime tradeoff improves as increases Based on our experimental observations, we have the following approximate characterization of the knee: 11
12 < BS cbs ccs 3 BS cbs ccs 3 BS cbs ccs 3 Time (secs 1 Space (MB Time (secs Number of Components 1 3 Number of Components Space (MB (a Time vs Number of Components (b Space vs Number of Components (c Time vs Space Figure 1: Comparison of SpaceTime Tradeoff of Compressed Bitmap Indexes under ifferent Storage Schemes for ata Set 1 The knee of the spacetime tradeoff graph corresponds approximately to an component index with, where C!, base C,!* = E, *,+ # * ( * E * * E / Note that the approximate knee characterization in Section 7 is a special case of the above result Our experimental results show that the approximate knee characterization has very good accuracy Time (Expected Number of Bitmap Scans 1 Space (Number of Bitmaps Figure 17: SpaceTime Tradeoff as a Function of the Number of Buffered Bitmaps, C ( ( ( # A In this paper, we have presented a general framework to study the design space of bitmap indexes for selection queries, have examined several spacetime tradeoff issues To the best of our knowledge, this study represents a first set of guidelines for physical database design using bitmap indexes Our results indicate that rangeencoded bitmap indexes offer, in most cases, better spacetime tradeoff than equalityencoded bitmap indexes Concentrating on these, we have identified the timeoptimal index, the spaceoptimal index, the index with the optimal spacetime tradeoff, the timeoptimal index under diskspace constraints In addition, we have proposed an optimal bitmap buffering policy examined its impact on the spacetime tradeoff issues All these results are based on an improved evaluation algorithm for! = m= m=1 m= m=3 m= m= m= selection queries that we have proposed for rangeencoded bitmap indexes For future work, we plan to explore bitmap indexes for the more general class of selection queries that is based on the membership operator 9 [1] AIP Technical Publications Sybase IQ Indexes In Sybase IQ Administration Guide, Sybase IQ Release 11 Collection, chapter Sybase Inc, March admin/1toc [] CY Chan YE Ioannidis Bitmap Index esign Evaluation Computer Sciences ept, University of WisconsinMadison, cychan/paper1ps [3] S Chaudhuri U ayal An Overview of ata Warehousing OLAP Technology ACM SIGMO Record, (1: 7, March 1997 [] Transaction Processing Performance Council, May 199 [] H Edelstein Faster ata Warehouses Information Week, pages 77, ecember 199 [] C French One Size Fits All atabase Architectures do not work for SS In SIGMO 9, pages 9, San Jose, California, May 199 [7] G Graefe Query Evaluation Techniques for Large atabases Computing Surveys, (:73 17, 1993 [] P O Neil Model Architecture Performance In Proceedings of the nd International Workshop on High Performance Transactions Systems, pages 9, Asilomar, CA, 197 SpringerVerlag In Lecture Notes in Computer Science 39 [9] P O Neil G Graefe MultiTable Joins Through Bitmapped Join Indices ACM SIGMO Record, pages 11, September 199 [] P O Neil Quass Improved Query Performance with Variant Indexes In SIGMO 97, pages 3 9, Tucson, Arizona, May 1997 [11] TF Tsuei, AN Packer, KT Ko atabase Buffer Size Investigation for OLTP Workloads In SIGMO 97, pages 11 1, Tucson, Arizona, May 1997 [1] J Winchell Rushmore s Bald Spot BMS, (:, September 1991 [13] HKT Wong, JZ Li, F Olken, Rotem, L Wong Bit Tranposition for Very Large Scientific Statistical atabases Algorithmica, 1(3:9 39, 19 [1] HKT Wong, HF Liu, F Olken, Rotem, L Wong Bit Tranposed Files In VLB, pages 7, Stockholm, 19 1
Indexing Techniques for Data Warehouses Queries. Abstract
Indexing Techniques for Data Warehouses Queries Sirirut Vanichayobon Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK, 739 sirirut@cs.ou.edu gruenwal@cs.ou.edu Abstract Recently,
More informationPartJoin: An Efficient Storage and Query Execution for Data Warehouses
PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2
More informationLecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches ColumnStores Horizontal/Vertical Partitioning Horizontal Partitions Master Table Vertical Partitions Primary Key 3 Motivation
More informationInMemory Data Management for Enterprise Applications
InMemory Data Management for Enterprise Applications Jens Krueger Senior Researcher and Chair Representative Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University
More informationBitmap Index an Efficient Approach to Improve Performance of Data Warehouse Queries
Bitmap Index an Efficient Approach to Improve Performance of Data Warehouse Queries Kale Sarika Prakash 1, P. M. Joe Prathap 2 1 Research Scholar, Department of Computer Science and Engineering, St. Peters
More informationIndex Selection Techniques in Data Warehouse Systems
Index Selection Techniques in Data Warehouse Systems Aliaksei Holubeu as a part of a Seminar Databases and Data Warehouses. Implementation and usage. Konstanz, June 3, 2005 2 Contents 1 DATA WAREHOUSES
More information2.3 Scheduling jobs on identical parallel machines
2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed
More informationInMemory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
InMemory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
More informationlowlevel storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: lowlevel storage structures e.g. partitions underpinning the warehouse logical table structures lowlevel structures
More informationEvaluation of Bitmap Index Compression using Data Pump in Oracle Database
IOSR Journal of Computer Engineering (IOSRJCE) eissn: 22780661, p ISSN: 22788727Volume 16, Issue 3, Ver. III (MayJun. 2014), PP 4348 Evaluation of Bitmap Index Compression using Data Pump in Oracle
More informationBitmap Indices for Data Warehouses
Bitmap Indices for Data Warehouses Kurt Stockinger and Kesheng Wu Computational Research Division Lawrence Berkeley National Laboratory University of California Mail Stop 50B3238 1 Cyclotron Road Berkeley,
More informationBig Data Technology MapReduce Motivation: Indexing in Search Engines
Big Data Technology MapReduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationAn Efficient Bitmap Encoding Scheme for Selection Queries
An Efficient Bitmap Encoding Scheme for Selection Queries CheeYong Chan Department of Computer Sciences University of WisconsinMadison cychan@cs.wisc.edu Yannis E. Ioannidis* t Department of Computer Sciences
More informationEfficient Iceberg Query Evaluation for Structured Data using Bitmap Indices
Proc. of Int. Conf. on Advances in Computer Science, AETACS Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices Ms.Archana G.Narawade a, Mrs.Vaishali Kolhe b a PG student, D.Y.Patil
More informationLecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs
CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like
More informationRCFile: A Fast and Spaceefficient Data Placement Structure in MapReducebased Warehouse Systems CLOUD COMPUTING GROUP  LITAO DENG
1 RCFile: A Fast and Spaceefficient Data Placement Structure in MapReducebased Warehouse Systems CLOUD COMPUTING GROUP  LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
More informationDATA WAREHOUSING II. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 23
DATA WAREHOUSING II CS121: Introduction to Relational Database Systems Fall 2015 Lecture 23 Last Time: Data Warehousing 2 Last time introduced the topic of decision support systems (DSS) and data warehousing
More informationOracle Database InMemory The Next Big Thing
Oracle Database InMemory The Next Big Thing Maria Colgan Master Product Manager #DBIM12c Why is Oracle do this Oracle Database InMemory Goals Real Time Analytics Accelerate Mixed Workload OLTP No Changes
More informationBitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse
Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse Zainab Qays Abdulhadi* * Ministry of Higher Education & Scientific Research Baghdad, Iraq Zhang Zuping Hamed Ibrahim Housien**
More informationResearch Statement Immanuel Trummer www.itrummer.org
Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses
More informationENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #2610771
ENHANCEMENTS TO SQL SERVER COLUMN STORES Anuhya Mallempati #2610771 CONTENTS Abstract Introduction Column store indexes Batch mode processing Other Enhancements Conclusion ABSTRACT SQL server introduced
More informationInvestigating the Effects of Spatial Data Redundancy in Query Performance over Geographical Data Warehouses
Investigating the Effects of Spatial Data Redundancy in Query Performance over Geographical Data Warehouses Thiago Luís Lopes Siqueira Ricardo Rodrigues Ciferri Valéria Cesário Times Cristina Dutra de
More informationLinear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
More informationInformation management software solutions White paper. Powerful data warehousing performance with IBM Red Brick Warehouse
Information management software solutions White paper Powerful data warehousing performance with IBM Red Brick Warehouse April 2004 Page 1 Contents 1 Data warehousing for the masses 2 Single step load
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationA Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling JingChiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University
More informationLoad Balancing in MapReduce Based on Scalable Cardinality Estimates
Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationImproved Query Performance with Variant Indexes
Improved Query Performance with Variant Indexes Patrick O Neil Dallan Quass Department of Mathematics and Computer Science Department of Computer Science University of Massachusetts at Boston Stanford
More informationMultiDimensional Database Allocation for Parallel Data Warehouses
MultiDimensional Database Allocation for Parallel Data Warehouses Thomas Stöhr Holger Märtens Erhard Rahm University of Leipzig, Germany {stoehr maertens rahm}@informatik.unileipzig.de Abstract Data
More informationPerformance Tuning for the Teradata Database
Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting  i  Document Changes Rev. Date Section Comment 1.0 20101026 All Initial document
More informationIntegrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, Inbase Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT Inmemory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationA Constraint Programming Application for Rotating Workforce Scheduling
A Constraint Programming Application for Rotating Workforce Scheduling Markus Triska and Nysret Musliu Database and Artificial Intelligence Group Vienna University of Technology {triska,musliu}@dbai.tuwien.ac.at
More informationEfficient Processing of Joins on Setvalued Attributes
Efficient Processing of Joins on Setvalued Attributes Nikos Mamoulis Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road Hong Kong nikos@csis.hku.hk Abstract Objectoriented
More informationChapter 12 File Management. Roadmap
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access
More informationChapter 12 File Management
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access
More informationOracle EXAM  1Z0117. Oracle Database 11g Release 2: SQL Tuning. Buy Full Product. http://www.examskey.com/1z0117.html
Oracle EXAM  1Z0117 Oracle Database 11g Release 2: SQL Tuning Buy Full Product http://www.examskey.com/1z0117.html Examskey Oracle 1Z0117 exam demo product is here for you to test the quality of the
More informationEFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com
More information1Z0117 Oracle Database 11g Release 2: SQL Tuning. Oracle
1Z0117 Oracle Database 11g Release 2: SQL Tuning Oracle To purchase Full version of Practice exam click below; http://www.certshome.com/1z0117practicetest.html FOR Oracle 1Z0117 Exam Candidates We
More informationData mining knowledge representation
Data mining knowledge representation 1 What Defines a Data Mining Task? Task relevant data: where and how to retrieve the data to be used for mining Background knowledge: Concept hierarchies Interestingness
More informationLecture 3: Linear Programming Relaxations and Rounding
Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can
More informationDistributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu
Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.
More informationThe Cubetree Storage Organization
The Cubetree Storage Organization Nick Roussopoulos & Yannis Kotidis Advanced Communication Technology, Inc. Silver Spring, MD 20905 Tel: 3013843759 Fax: 3013843679 {nick,kotidis}@actus.com 1. Introduction
More informationProTrack: A Simple Provenancetracking Filesystem
ProTrack: A Simple Provenancetracking Filesystem Somak Das Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology das@mit.edu Abstract Provenance describes a file
More informationData Warehouse: Introduction
Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of base and data mining group,
More informationRethinking SIMD Vectorization for InMemory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for InMemory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationA COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES
A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES ULFAR ERLINGSSON, MARK MANASSE, FRANK MCSHERRY MICROSOFT RESEARCH SILICON VALLEY MOUNTAIN VIEW, CALIFORNIA, USA ABSTRACT Recent advances in the
More informationGTC 2014 San Jose, California
GTC 2014 San Jose, California An Approach to Parallel Processing of Big Data in Finance for Alpha Generation and Risk Management Yigal Jhirad and Blay Tarnoff March 26, 2014 GTC 2014: Table of Contents
More informationOPRE 6201 : 2. Simplex Method
OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2
More informationStorage Layout and I/O Performance in Data Warehouses
Storage Layout and I/O Performance in Data Warehouses Matthias Nicola 1, Haider Rizvi 2 1 IBM Silicon Valley Lab 2 IBM Toronto Lab mnicola@us.ibm.com haider@ca.ibm.com Abstract. Defining data placement
More informationMultidimensional index structures Part I: motivation
Multidimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for
More informationStorage Management for Files of Dynamic Records
Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. jz@cs.rmit.edu.au Alistair Moffat Department of Computer Science
More informationLecture Notes on Data Warehousing by examples
Lecture Notes on Data Warehousing by examples Daniel Lemire and Owen Kaser October 25, 2004 1 Introduction Data Warehousing is an applicationdriven field. The Data Warehousing techniques can be applied
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationRoots of Polynomials
Roots of Polynomials (Com S 477/577 Notes) YanBin Jia Sep 24, 2015 A direct corollary of the fundamental theorem of algebra is that p(x) can be factorized over the complex domain into a product a n (x
More informationWho am I? Copyright 2014, Oracle and/or its affiliates. All rights reserved. 3
Oracle Database InMemory Power the RealTime Enterprise Saurabh K. Gupta Principal Technologist, Database Product Management Who am I? Principal Technologist, Database Product Management at Oracle Author
More informationFCE: A Fast Content Expression for Serverbased Computing
FCE: A Fast Content Expression for Serverbased Computing Qiao Li Mentor Graphics Corporation 11 Ridder Park Drive San Jose, CA 95131, U.S.A. Email: qiao li@mentor.com Fei Li Department of Computer Science
More informationArithmetic Coding: Introduction
Data Compression Arithmetic coding Arithmetic Coding: Introduction Allows using fractional parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation
More informationarxiv:1112.0829v1 [math.pr] 5 Dec 2011
How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly
More informationData Management, Analysis Tools, and Analysis Mechanics
Chapter 2 Data Management, Analysis Tools, and Analysis Mechanics This chapter explores different tools and techniques for handling data for research purposes. This chapter assumes that a research problem
More informationA Catalogue of the Steiner Triple Systems of Order 19
A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of
More informationSAP HANA  Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA  Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why InMemory? Information at the Speed of Thought Imagine access to business data,
More informationFairness in Routing and Load Balancing
Fairness in Routing and Load Balancing Jon Kleinberg Yuval Rabani Éva Tardos Abstract We consider the issue of network routing subject to explicit fairness conditions. The optimization of fairness criteria
More informationOptimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC 10.1.3.4.1
Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC 10.1.3.4.1 Mark Rittman, Director, Rittman Mead Consulting for Collaborate 09, Florida, USA,
More informationImproved Query Performance with Variant Indexes
Improved Query Performance with Variant Indexes Patrick O Neil Dallan Quass Department of Mathematics and Computer Science Department of Computer Science University of Massachusetts at Boston Stanford
More informationBenchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flashbased Storage Arrays Version 1.0 Abstract
More informationRandom vs. StructureBased Testing of AnswerSet Programs: An Experimental Comparison
Random vs. StructureBased Testing of AnswerSet Programs: An Experimental Comparison Tomi Janhunen 1, Ilkka Niemelä 1, Johannes Oetsch 2, Jörg Pührer 2, and Hans Tompits 2 1 Aalto University, Department
More informationIntroduction to Decision Support, Data Warehousing, Business Intelligence, and Analytical Load Testing for all Databases
Introduction to Decision Support, Data Warehousing, Business Intelligence, and Analytical Load Testing for all Databases This guide gives you an introduction to conducting DSS (Decision Support System)
More informationMS SQL Performance (Tuning) Best Practices:
MS SQL Performance (Tuning) Best Practices: 1. Don t share the SQL server hardware with other services If other workloads are running on the same server where SQL Server is running, memory and other hardware
More informationMINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan
More informationAdaptive Tolerance Algorithm for Distributed TopK Monitoring with Bandwidth Constraints
Adaptive Tolerance Algorithm for Distributed TopK Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of WisconsinMadison Department of Computer Sciences {bauer, srini}@cs.wisc.edu
More informationJUSTINTIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004
Scientiae Mathematicae Japonicae Online, Vol. 10, (2004), 431 437 431 JUSTINTIME SCHEDULING WITH PERIODIC TIME SLOTS Ondřej Čepeka and Shao Chin Sung b Received December May 12, 2003; revised February
More informationOracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.
Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse
More informationAmadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator
WHITE PAPER Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com SAS 9 Preferred Implementation Partner tests a single Fusion
More informationIBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop
IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop Frank C. Fillmore, Jr. The Fillmore Group, Inc. Session Code: E13 Wed, May 06, 2015 (02:15 PM  03:15 PM) Platform: Crossplatform Objectives
More informationEstimating Deduplication Ratios in Large Data Sets
IBM Research labs  Haifa Estimating Deduplication Ratios in Large Data Sets Danny Harnik, Oded Margalit, Dalit Naor, Dmitry Sotnikov Gil Vernik Estimating dedupe and compression ratios some motivation
More informationAn OnLine Algorithm for Checkpoint Placement
An OnLine Algorithm for Checkpoint Placement Avi Ziv IBM Israel, Science and Technology Center MATAM  Advanced Technology Center Haifa 3905, Israel avi@haifa.vnat.ibm.com Jehoshua Bruck California Institute
More information2. Basic Relational Data Model
2. Basic Relational Data Model 2.1 Introduction Basic concepts of information models, their realisation in databases comprising data objects and object relationships, and their management by DBMS s that
More information8. Query Processing. Query Processing & Optimization
ECS165A WQ 11 136 8. Query Processing Goals: Understand the basic concepts underlying the steps in query processing and optimization and estimating query processing cost; apply query optimization techniques;
More informationThe Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle
The Curious Case of Database Deduplication PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle Agenda Introduction Deduplication Databases and Deduplication All Flash Arrays and Deduplication 2 Quick Show
More informationData Warehousing Concepts
Data Warehousing Concepts JB Software and Consulting Inc 1333 McDermott Drive, Suite 200 Allen, TX 75013. [[[[[ DATA WAREHOUSING What is a Data Warehouse? Decision Support Systems (DSS), provides an analysis
More informationAn Architecture for Using Tertiary Storage in a Data Warehouse
An Architecture for Using Tertiary Storage in a Data Warehouse Theodore Johnson Database Research Dept. AT&T Labs  Research johnsont@research.att.com Motivation AT&T has huge data warehouses. Data from
More informationSingleLink Failure Detection in AllOptical Networks Using Monitoring Cycles and Paths
SingleLink Failure Detection in AllOptical Networks Using Monitoring Cycles and Paths Satyajeet S. Ahuja, Srinivasan Ramasubramanian, and Marwan Krunz Department of ECE, University of Arizona, Tucson,
More informationComputer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction
Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals Modified from the lecture slides of Lami Kaya (LKaya@ieee.org) for use CECS 474, Fall 2008. 2009 Pearson Education Inc., Upper
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationEfficient Algorithms for Masking and Finding QuasiIdentifiers
Efficient Algorithms for Masking and Finding QuasiIdentifiers Rajeev Motwani Stanford University rajeev@cs.stanford.edu Ying Xu Stanford University xuying@cs.stanford.edu ABSTRACT A quasiidentifier refers
More informationDesigning an Object Relational Data Warehousing System: Project ORDAWA * (Extended Abstract)
Designing an Object Relational Data Warehousing System: Project ORDAWA * (Extended Abstract) Johann Eder 1, Heinz Frank 1, Tadeusz Morzy 2, Robert Wrembel 2, Maciej Zakrzewicz 2 1 Institut für Informatik
More informationParquet. Columnar storage for the people
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala Outline Context from various
More informationKey Attributes for Analytics in an IBM i environment
Key Attributes for Analytics in an IBM i environment Companies worldwide invest millions of dollars in operational applications to improve the way they conduct business. While these systems provide significant
More informationFUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 3448 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
More informationEncoded Bitmap Indexes and Their Use for Data Warehouse Optimization
Encoded Bitmap Indexes and Their Use for Data Warehouse Optimization Fachbereich Informatik der Technischen Universität Darmstadt Dissertation zur Erlangung des akademischen Grades eines Doktors der Ingenieurswissenschaften
More informationA set is a Many that allows itself to be thought of as a One. (Georg Cantor)
Chapter 4 Set Theory A set is a Many that allows itself to be thought of as a One. (Georg Cantor) In the previous chapters, we have often encountered sets, for example, prime numbers form a set, domains
More informationData Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A
Data Warehousing and Decision Support Chapter 23, Part A Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1 Introduction Increasingly, organizations are analyzing current and historical
More informationData Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing GridFiles Kdtrees Ulf Leser: Data
More informationCompact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
More informationLecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()
CS61: Systems Programming and Machine Organization Harvard University, Fall 2009 Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc() Prof. Matt Welsh October 6, 2009 Topics for today Dynamic
More informationCopyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 291
Slide 291 Chapter 29 Overview of Data Warehousing and OLAP Chapter 29 Outline Purpose of Data Warehousing Introduction, Definitions, and Terminology Comparison with Traditional Databases Characteristics
More informationWeek 5 Integral Polyhedra
Week 5 Integral Polyhedra We have seen some examples 1 of linear programming formulation that are integral, meaning that every basic feasible solution is an integral vector. This week we develop a theory
More informationMain Memory Data Warehouses
Main Memory Data Warehouses Robert Wrembel Poznan University of Technology Institute of Computing Science Robert.Wrembel@cs.put.poznan.pl www.cs.put.poznan.pl/rwrembel Lecture outline Teradata Data Warehouse
More information