Institut für Informatik der Technischen Universität München. MISTRAL: Processing Relational Queries using a Multidimensional Access Technique

Size: px
Start display at page:

Download "Institut für Informatik der Technischen Universität München. MISTRAL: Processing Relational Queries using a Multidimensional Access Technique"

Transcription

1 Insttut für Informatk er Technschen Unverstät München MISTRAL: Processng Relatonal Queres usng a Multmensonal Access Technque Volker Markl

2

3 Preface Classcal one mensonal B-trees have been the stanar access metho of all commercal atabase systems for many years. Ths ssertaton s a very promsng effort to ntrouce unversal B-trees (UB-trees), the multmensonal varant of B- trees, as a basc access metho nto the core of funamental DB-technology. The unversal relevance of UB-trees s a consequence of the fact, that every relaton can be consere as a set of ponts n multmensonal space. UB-trees organze ths space for effcent processng of the ata, resultng for many types of queres n orers of magntue mprovement over classcal methos. The thess lays the theoretcal founatons of UB-tress, prects ther performance by analytcal moels an valates these moels by experments usng large real worl atabases an real lfe applcatons an queres from the fel of atawarehousng. The nee to sort ata an ntermeate results frequently s an annoyng rawback of toay s query processng methos. In combnaton wth the Tetrs algorthm UBtrees allow to avo sortng n most cases, leang to ramatc mprovements of response tme, storage requrement an overall query processng tme. Ang a new access metho requres to conser all aspects of atabase systems: - archtecture of subsystems - query optmzaton - query processng - multuser operaton an synchronzaton - bulk loang - storage requrement - parallelsm, etc. Snce UB-trees rely on classcal B-trees for ther mplementaton, all of these ssues can be solve n a satsfactory way an can be ealt wth elegantly. The performance experments reporte n ths thess were carre out wth an mplementaton of UB-trees as a mleware layer on top of SQL. Atonal performance mprovements can be gane by ntegratng the UB-tree technology n the kernel of atabase systems. Ths thess s a cornerstone of the MISTRAL project. MISTRAL has the goal to ntrouce UB-trees as a new access metho nto atabase systems wth the fascnatng vson, to exten funamental atabase technology n an essental way. MISTRAL s fnancally supporte by SAP, Tejn, NEC, Htach, the European Commsson, Project MDA, TAS, Gfk an Mcrosoft. Munch, July 5, 999 Prof. Ruolf Bayer, Ph.D.

4

5 Insttut für Informatk er Technschen Unverstät München MISTRAL: Processng Relatonal Queres usng a Multmensonal Access Technque Volker Markl Vollstänger Abruck er von er Fakultät für Informatk er Technschen Unverstät München zur Erlangung es akaemschen Graes enes Doktors er Naturwssenschaften genehmgten Dssertaton. Vorstzener: Prüfer er Dssertaton: Unv.-Prof. B. Brügge, Ph.D.. Unv.-Prof. R. Bayer, Ph.D.. Unv.-Prof. J.-C. Freytag, Ph.D., Humbolt-Unverstät zu Berln De Dssertaton wure am be er Technschen Unverstät München engerecht un urch e Fakultät für Informatk am angenommen.

6

7 Ttle of the Thess: MISTRAL: Processng Relatonal Queres usng a Multmensonal Access Technque Author: Dpl. Inform. Volker Markl Supervsors:. Prof. Ruolf Bayer, Ph.D., Technsche Unverstät München. Prof. Johann Chrstoph Freytag, Ph.D., Humbolt Unverstät Berln Keywors: relatonal atabase management systems, query processng, query optmzaton multmensonal access methos, nexng ata structures, B-Tree, UB-Tree ata warehousng, relatonal OLAP, OLTP benchmarkng, TPC-D Abstract: A multmensonal access metho offerng sgnfcant performance ncreases by ntellgently parttonng the query space s apple to relatonal atabase management systems (RDBMS). We ntrouce a formal moel for multmensonal parttone relatons an scuss several typcal query patterns. The moel entfes the sgnfcance of multmensonal range queres an sort operatons. The scusson of current access methos gves rse to the nee for a multmensonal parttonng of relatons. A etale analyss of space parttonng focussng especally on Z-orerng llustrates the prncple benefts of multmensonal nexes. After escrbng the UB-Tree an ts stanar algorthms for nserton, eleton, pont queres, an range queres, we ntrouce the spral algorthm for nearest neghbor queres wth UB-Trees an the Tetrs algorthm for effcent access to a table n arbtrary sort orer. We then escrbe the complexty of the nvolve algorthms an gve solutons to selecte algorthmc problems for a prototype mplementaton of UB-Trees on top of several RDBMSs. A cost moel for sort operatons wth an wthout range restrctons s use both for analyzng our algorthms an for comparng UB-Trees wth state-of-the-art query processng. Performance comparsons wth tratonal access methos practcally confrm the theoretcally expecte superorty of UB-Trees an our algorthms over tratonal access methos: Query processng n RDBMS s accelerate by several orers of magntue, whle the resource requrements n man memory space an sk space are substantally reuce. Benchmarks on some queres of the TPC-D benchmark as well as the ata warehousng scenaro of a frut juce company llustrate the potental mpact of our work on relatonal algebra, SQL, an commercal applcatons. The results of ths thess were evelope by the author managng the MISTRAL project, a jont research an evelopment project wth SAP AG (Germany), Tejn Systems Technology Lt. (Japan), NEC (Japan), Htach (Japan), Gesellschaft für Konsumforschung (Germany), an TransActon Software GmbH (Germany).

8

9 Acknowlegements I thank my supervsor, Prof. Ruolf Bayer, Ph.D., for hs support an the many frutful scussons an nspratons. He helpe me a lot wth hs eas an hs confence, especally n the begnnng, when the feasblty of the work an the roa ahea were stll unclear. My master stuents an nterns a large porton of the prototype mplementaton. In aton they helpe wth analyss, an carre out many of the performance measurements. I especally thank Nls Frelnghaus, who was the frst stuent who are to be supervse by me. Our scussons prouce eas, whch were crucal for the success of the MISTRAL project. The same hols for the work of Rolan Pernger, who helpe to prove the practcal benefts of our approach by portng our plot mplementaton to Oracle an ong performance measurements at SAP n Wallorf. When the MISTRAL project grew, I was not alone anymore: My team members Robert Fenk, Frank Ramsak, Stefan Sxl, an Martn Zrkel worke on the project wth the same enthusasm as I. We ha numerous nsprng scussons. For ths thess, my colleagues also a lot of proofreang, whch certanly mprove the qualty of the work. For funng the research work I thank Karl-Henz Hess an Klaus Majenz from SAP AG, Ichro Arma, Shuch Osak an Toshshro Satom from Tejn Systems Technology, Japan, Kuntosh Tsuruoka from NEC s C&C Research Lab, Japan, Kazuo Masa an Shunch Tor from Htach Lt., Japan, Dr. Klaus Elhart an Dr. Chrstan Roth from TransActon Software, an Dr. habl Thomas Ruf from GfK. I especally thank Klaus Majenz from SAP, Dr. Klaus Elhart an Dr. habl Thomas Ruf for the scussons at our project meetngs, whch rase an answere many theoretcal an practcal questons. In aton, these scussons prove nsght n atabase mplementaton an the practcal emans of atabase users. I also thank many further stuents, colleagues an frens for scussons, eas an proofreang of the thess. Here I especally thank Markus Blaschka, Bern Defel, Prof. Dr. Wolfgang Kowarschck, Peter Müller, Carsten Sapa, Mchael Walther, an Wolfgang Wohner. Many thanks also to Prof. Johann Chrstoph Freytag, Ph.D., whose comments an eas were a valuable contrbuton to my work. Thanks also to my famly an frens for helpng me to forget my work, whenever I was stuck an neee some tme to relax. I also thank Anrea for the 0.5l tea mug whch was a goo companon urng some long nghts n the offce. Last but not least I thank the Deutsche Bahn (German Ral), snce the soluton to many algorthmc problems (lke the range query algorthm, multmensonal herarchcal clusterng) an many new eas (lke the names MISTRAL an Tetrs) were nvente on the tracks.

10

11

12

13 TABLE OF CONTENTS Table of Contents INTRODUCTION MOTIVATION Data Warehousng Applcatons Non-Stanar DBMS Relatonal DBMS an DBMS n General OBJECTIVE RELATED WORK RDBMS an Query Processng Multmensonal Access Methos OUTLINE....5 HOW TO READ THE THESIS... TERMINOLOGY AND BASIC CONCEPTS THE MULTIDIMENSIONAL SPACE: DUALITY OF POINTS AND TUPLES Relatons, Tuples, Attrbutes an the Multmensonal Space Intervals Volumes Statstcal Functons Parttone Relatons QUERY TYPES..... Partal Match Query Range Queres Range Query Sets an Arbtrary Query Spaces Nearest Neghbor Queres Further Queres Query Processng ACCESS METHODS Characterstcs of Seconary Storage Clusterng Non-Clusterng Inexes ANSWERING RANGE QUERIES IN PRESENT RDBMS Compoun B-Trees Multple Seconary B-Trees or Btmap Inexes Inexes for Processng Range Queres use n Present RDBMS MULTIDIMENSIONAL SPACE PARTITIONING SPACE FILLING CURVES AND TOTAL MULTIDIMENSIONAL ORDERINGS PROPERTIES OF SPACE FILLING CURVES SYMMETRY OF SPACE FILLING CURVES REGIONS COVERED BY SPACE FILLING CURVES DISCONNECTED Z-REGIONS... 57

14 TABLE OF CONTENTS 3.6 GEOMETRIC VIEW OF Z-REGION PARTITIONING Aress Representaton Volumes of Z-Areas an Z-Regons INDEPENDENT DIMENSIONS Unformly Dstrbute Data Gaussan Dstrbute Data Iealze Unform Parttonng DEPENDENT DIMENSIONS Lnear Depenency Further Depenences HIGH DIMENSIONALITIES UTILIZATION OF THE MULTIDIMENSIONAL SPACE THE UB-TREE CONCEPT OF THE UB-TREE INSERTION INTO UB-TREES THE POINT QUERY ALGORITHM DELETION FROM UB-TREES THE RANGE QUERY ALGORITHM THE SPIRAL ALGORITHM FOR NEAREST-NEIGHBOR QUERIES THE TETRIS ALGORITHM FOR SORTED PROCESSING OF QUERY BOXES... 9 PROTOTYPE IMPLEMENTATION OVERVIEW OF THE PROTOTYPE BASIC IMPLEMENTATION CONCEPTS IMPLEMENTING THE ADDRESS CALCULATION Bt-Interleavng Aress Calculaton for Arbtrary Data Types Dealng wth low Carnalty Domans: Enumeraton Types Multmensonal Herarchcal Clusterng Dealng wth non-unformly strbute Data wth Quantles IMPLEMENTING THE INSERTION ALGORITHM IMPLEMENTING THE DELETION ALGORITHM IMPLEMENTING THE POINT-QUERY ALGORITHM IMPLEMENTATION OF THE RANGE QUERY ALGORITHM Determnng the next Intersecton Dealng wth Sets of Query Boxes THE UB-TREE LIBRARY... 6 PERFORMANCE ANALYSIS THE COST OF UB-TREE RANGE QUERIES A Cost-Functon for Perfect Iealze Unform Parttonng A Cost Functon for Sem-perfect Iealze Unform Parttonng A Cost Functon for Probablstc Iealze Unform Partonng Cost Functon an Selectvty... 35

15 TABLE OF CONTENTS 6..5 The Cost Functon: Theory an Practce A COST MODEL THE COST OF RANGE QUERIES Cost Functons Smulaton Results THE COST OF RANGE QUERIES WITH SORT OPERATIONS Cost Functons for Seconary Storage Cost Functons for Response Tme Cost Functons for Interactve Response Tmes Smulaton Results SUMMARY: COST ANALYSIS PERFORMANCE MEASUREMENTS INSERT PERFORMANCE EXACT MATCH QUERY PERFORMANCE STORAGE REQUIREMENTS RANGE QUERY PERFORMANCE Types of Measurements Comparatve Performance Measurements Partal Range Queres IMPACTS ON RELATIONAL QUERY PROCESSING RELATIONAL OPERATIONS WITH UB-TREES AND THE TETRIS-ALGORITHM Asymptotc complexty of the UB-Tree Range Query Algorthm Asymptotc complexty of the Tetrs Algorthm Asymptotc complexty of Operatons of Relatonal Query Processng COMPLEX QUERIES ON GENERATED DATA: THE TPC-D BENCHMARK Jons an Restrctons Jons, Groupng, an Restrctons Mult-attrbute Restrctons Summary: TPC-D Performance Measurements A REAL WORLD DATA WAREHOUSE: THE JUICE & MORE BENCHMARK Queres on the Juce & More Schema Data Dstrbuton Multmensonal Herarchcal Clusterng of Juce & More Performance Measurements Summary: Data Warehouse Performance Measurements SUMMARY CONCLUSIONS FUTURE WORK REFERENCES INDEX...

16 v TABLE OF CONTENTS

17 Part I Prelmnares

18

19 CHAPTER : INTRODUCTION 3 Of course a flm shoul have a begnnng, a mle an an en. But not necessarly n that orer. (Jean-Luc Goar) Chapter Introucton ultmensonal access methos are useful for a broa range of atabase M applcatons, e.g., ata warehousng, geographcal nformaton systems, ata mnng, archvng systems, lfecycle management atabases. In general these access methos are benefcal, f queres efne (an thus process) some part of a large ata set by restrctons wth respect to several categores (e.g., the sales for one year for a certan prouct, all ctes near Munch wth a populaton larger than people). We focus on usng multmensonal nexes for query processng n relatonal atabase management systems (DBMS). Ths ntroucton efnes the scope an objectve of the thess. The applcablty of multmensonal access methos for query processng s motvate by examples from ata warehousng, non-stanar DBMS applcatons an relatonal atabases n general. The ntroucton also surveys relate work n the fels of multmensonal access methos, relatonal DBMS an query processng, ata warehousng an benchmarkng. The chapter s conclue by gvng a roamap on the followng chapters of the thess.

20 4 CHAPTER : INTRODUCTION. Motvaton Complex busness applcatons lke SAP R/3 [SAP99], ata warehousng (DW) an ata mnng as well as non-stanar DBMS applcatons lke geographcal nformaton systems (GIS) an statstcal atabases have create a strong eman for effcent processng of complex queres on huge atabases. These complex queres set new requrements on the query processng an access metho algorthms of a DBMS. The man reason of nexng a table of a atabase s to accelerate query executon. Ths s acheve by utlzng the restrctons mpose by a query n orer to reuce the number of sk accesses. We focus on usng multmensonal nexes to organze any table of a atabase. In the followng we use the termnology of relatonal DBMS because of ther commercal mportance. Snce access methos are relevant to any DBMS regarless of the DBMS paragm, the methoology escrbe n ths thess can also be use to accelerate euctve DBMS (DDBMS, e.g., [Ull89, Spe9]), object-orente DBMS (OODBMS, e.g., [BDK9]) or multmensonal DBMS (MDBMS, e.g., [PDF+98]). We conser each tuple of a table to be a pont n multmensonal space. Then a table escrbes a certan subspace of a multmensonal space efne by the cross prouct of the omans of all attrbutes. We use a multmensonal access metho to organze that multmensonal space, whch allows to cluster the ata wth respect to multple attrbutes at the same tme. As we wll show, ths multmensonal organzaton of a relatonal table s n many cases superor to one-mensonal clusterng (.e., organzng the ata on sk for effcent access wth respect to one attrbute). However, current commercal DBMS o ether not support multmensonal nexes at all or only use them as an a-on n the context of geo-spatal applcatons [Ora97, IBM97]. In the followng we escrbe how ata warehousng, non-stanar DBMS as well as stanar DBMS may beneft from the technques escrbe n ths thess... Data Warehousng Applcatons Data-warehousng applcatons [Inm96, Km96, Dev97] cope wth enormous amounts of ata rangng n Ggabytes an Terabytes. Whle transactonal (OLTP, onlne transacton processng) DBMS lke bank applcatons usually use smple query patterns to retreve a very small part of a atabase (usually one recor) by a prmary key access, ata processng n ata warehousng (OLAP, onlne analytcal processng) nvolves complex queres that usually access a large porton of the atabase [CD97, WB97, GBL+96]. On the conceptual level a multmensonal (MD) vew on the ata moels has been establshe by acaema an the nustry for OLAP applcatons [CD97]. In the MD moel the numerc (quanttatve) ata (measures) (e.g., sales, cost), whch s the focus of the analyss, s organze along multple mensons. The mensons prove categorcal

21 SECTION.: MOTIVATION 5 (qualtatve) ata (e.g., contaner sze of a prouct), whch etermnes the context of the measures. Therefore the measures can be seen as a value n a multmensonal space one often refers to ths moel as a multmensonal cube. An mportant concept of OLAP ata moels s the noton of menson herarches. Herarches are use to prove structure to the otherwse flat mensons. Often the ata n the mensons can be categorze/classfe accorng to some atonal characterstcs (e.g., shops coul be classfe accorng to ther locaton). Usually OLAP users are not ntereste n the sngle measures but n some form of summarze ata (e.g., sales n a certan area). Herarches prove an approprate metho of escrbng the level of aggregaton for a menson. Data processng n DW applcatons retreves aggregate measures organze or classfe accorng to several mensons or herarches over the mensons (e.g., the total sales for all coffee shops n Bavara n 999). For ths reason multmensonal ata moels [BSH+98], multmensonal query languages (e.g., MDX [MS98b] or the OLAP Councl approach [OLA98]) an even multmensonal DBMS (MDBMS) have been evelope by the research communty an mplemente as commercal proucts. Typcal OLAP operatons are rll-own, roll-up an slce-an-ce [Km96] an usually multple mensons are restrcte at the same tme. In general one can state that these operatons n a MD moel lea to range restrctons on the lowest herarchy level of each menson [Sar97]. To a large extent, relatonal DBMS are use for ecson support applcatons, snce these systems are well researche an are reporte to prove more effcency for huge atabases than MDBMS. Regarless whether a multmensonal or relatonal paragm s use to moel an query OLAP ata, queres result n multmensonal range restrctons n combnaton wth sort operatons an aggregatons. Therefore any DBMS storng OLAP ata must effcently hanle ths typcal query pattern. Pre-computaton, clusterng an nexng are common technques to spee up query processng. Pre-computaton results n the best query response tme at the expense of loa performance an seconary storage space. For DW applcatons, pre-computaton s mostly scusse for aggregaton operatons [CD97]. However, one requrement of DW s to effcently eal wth a-hoc queres. Decng whch queres to pre-compute becomes extremely ffcult then. Pre-computaton also leas to a vew mantenance problem. In Secton 8. we wll llustrate the applcablty of our technque to OLAP scenaros by usng a multmensonal nex to cluster the ata cube of a ata warehouse n orer to effcently process OLAP queres. We present a methoology to cluster ata organze along multple herarchcal mensons. For clusterng the fact table of a relatonal OLAP mplementaton we present performance measurements for a frut juce company usng a star schema wth a fact table of 6 mllon recors (an overall sze of 7 GB). On ths real-worl ata we experence a performance ncrease up to a factor of ten compare to tratonal technques.

22 6 CHAPTER : INTRODUCTION.. Non-Stanar DBMS Non-stanar DBMS eal wth complex objects whch usually are not smply store n tables or relatons ue to ther nteracton, relatonshp or sze. Typcal examples of non-stanar DBMS are multmea DBMS, CAD/CAM DBMS or geographcal nformaton systems (GIS). The systems usually am towars specfc applcatons, lke smlarty search, for example searchng a pcture or soun smlar to an example pcture/soun specfe by a query. In many applcaton areas lke CAD [BKK97], computer vson [Jag9], multmea archves [SH94], mecal magng [KSF+96], molecular bology [AGM+90] or tme sequence analyss [AFS93], these problems can usually be reuce to search problems by feature transformaton. However, often feature transformaton results n hgh-mensonal ata space whch shoul be nexe wth nex structures specfcally esgne for these ata spaces [Böh98, WSB98]. Ths thess oes not eal wth hgh mensonal ata spaces (see [Böh98] for a survey on nexng methos on hgh mensonal spaces). However, often proper ata moelng allows to reuce mensonalty to below 0 mensons. In ths case, the methos escrbe n ths thess may also be apple to non-stanar DBMS applcatons. Moreover, snce we am to ntegrate our approach nto stanar DBMS, our technology may enable stanar DBMS to eal wth some subset of non-stanar applcatons...3 Relatonal DBMS an DBMS n General Snce tuples n a relaton an ponts n multmensonal space are merely two fferent vews on a ata set, the ata store n atabases can be consere to be of multmensonal nature. The expressveness of query languages allows to wrte complex queres whch mostly eal wth more than one attrbute. SQL queres, for nstance, often nvolve mult-attrbute restrctons an aggregatons. These restrctons usually map onto ranges n several mensons. In Secton..4 we wll show some SQL range queres aganst a atabase schema whch s typcal for ecson support applcatons. Numerc attrbutes an tme values are prme canates to be restrcte to ranges. Yet also restrctons n herarchcal ata types wll map onto ranges, f the ata moelng of Secton s use. The nex structures use n present commercal atabase systems are mostly base on B-Trees, whch o not support multmensonal range queres effcently. In Chapters 7 an 8 we wll show the beneft of multmensonal nex structures for processng complex queres wth mult-attrbute restrctons. A sort operator utlzng mult-attrbute restrctons s one of the most mportant operatons for the physcal layer of a DBMS, snce sortng s requre for an effcent mplementaton of most operatons of the relatonal algebra (lke for orerng a table, for groupng an aggregaton, for projecton as well as for sort-merge jons of several tables). In Secton 4.7 we escrbe a sngle operator for effcently processng sort operatons wth mult-attrbute restrctons, the so-calle Tetrs algorthm [MB98]. The Tetrs algorthm s a generalzaton of a multmensonal range query algorthm that effcently combnes sort operatons wth the

23 SECTION.: OBJECTIVE 7 evaluaton of mult-attrbute restrctons. The basc ea s to use the partal sort orer mpose by a multmensonal parttonng n orer to process a table n some total sort orer. Wth the Tetrs algorthm a multmensonal nex can reuce resource requrements for vrtually any operaton of the relatonal algebra. Compare to the natve access methos of a commercal DBMS, our prototype mplementaton of the Tetrs algorthm shows sgnfcant speeups for queres of the TPC-D benchmark. In aton, temporary storage requrements for the sortng process are reuce an frst results of a sort operaton are avalable much earler for further processng. In Secton 6.4 we wll see how multmensonal nex structures an the Tetrs algorthm can be use to accelerate vrtually any operaton of the relatonal algebra. Sort operatons an mult-attrbute restrctons are not only useful for RDBMS mplementaton, but for DBMS n general. Query processng n OODBMS, DDBMS or MDBMS also reles on effcent access to ata n some sort orer (e.g., for uplcate elmnaton or set operatons). Thus the results of ths thess are applcable to any DBMS.. Objectve The goal of ths work s to show the potental of ntegratng multmensonal nexes as frst class nexes nto the kernel of a atabase system. We nten to prove a eeper nsght nto the problems an chances of multmensonal nexng. Therefore ths thess nclues two analytcal chapters: Chapter 3 stues multmensonal space parttonng an Chapter 6 erves a cost moel for range queres wth an wthout sort operatons. Ths cost moel s further use to analyze the range query performance n orer to have a benchmark to juge our practcal measurements. Another objectve s to show the practcal feasblty of our approach. We therefore escrbe the man challenges of our prototype mplementaton n Chapter 4 an gve real performance fgures for both artfcal an real-worl ata n Chapters 6 an 7. In orer to llustrate further work n ths area an gve hnts for further reang we survey relate work n Secton.3. In aton we am at ntroucng two new technologes, whch were evelope by the author urng hs work on the thess, namely: the Tetrs algorthm for processng queres wth mult-attrbute restrctons an sort operatons (see Secton 4.7) multmensonal herarchcal clusterng for clusterng ata organze accorng to multple herarchcal mensons (see Secton 5.3.4) These technques may sgnfcantly spee up query processng an DBMS an therefore may be of great commercal value. Whle the Tetrs algorthm s a technque whch extens the alreay patente UB-cache ea [Bay97b], a patent applcaton for multmensonal herarchcal clusterng s planne by the author an hs supervsor.

24 8 CHAPTER : INTRODUCTION Many lterature has been publshe about surveyng or comparng multmensonal access methos. We just brefly escrbe the man papers an man approaches n the next secton. Instea, our focus s to stuy the mpact of multmensonal nex structures on relatonal query processng. For performance comparsons of multmensonal nexes we also refer to the relate work lste n Secton.3..3 Relate Work In the followng we survey relate work n the fel of multmensonal access methos, relatonal DBMS an query processng, whch may be use for reference or further reang..3. RDBMS an Query Processng An mportant task of query processng n RDBMS s to effcently mplement algorthms for the basc operatons of the relatonal algebra [Co70]. Usually, these algorthms apply to partcular storage structures or access methos. [Gra93] gves a concse survey of query processng. The selecton operaton s ether mplemente by a table scan, or, f an nex s avalable, by an nex scan. Inexng s use to effcently process a query f the result set efne by the query restrctons s farly small. Most OLTP applcatons use B-Trees [BM7, Com79] as ther stanar nexng scheme. For pont-restrctons t s also possble to use hash nexes [FNP+79, ].Favorng retreval response tme over upate response tme allows to bul several nexes on one table or ata cube of a DW. Btmap nexes (e.g., [OQ97, CI98, WB98]) are wely scusse as an mprovement over B-Trees for DW applcatons, snce they effcently evaluate queres wth mult-attrbute restrctons. However, the overall result set stll must be relatvely small. Ths s a major rawback of btmap nexes, snce usually a relatvely large part of a cube must be accesse n orer to calculate aggregate measures. Clusterng places ata that s lkely to be accesse together physcally close to each other. The goal of clusterng s to lmt the number of sk accesses requre to process a query by ncreasng the lkelhoo that query results have alreay been cache. Clusterng has been well researche n the fel of fle structures an access methos (e.g., [Sal88, GR97, Sto94]. B-Trees, for nstance, prove one-mensonal clusterng. Multmensonal clusterng has been scusse n the fel of multmensonal access methos (e.g., [GG97, Sam90] A great eal of research has been one n the fel of access methos an access path selecton. Especally for DW envronments, specalze access methos have been propose [Inf97, OQ97]. Often several nexes are create on one table n orer to spee up query processng [Lum70, Mul7, MHW+90, Re97, GHR+97]. If the selecton conton specfes a range n a sngle attrbute, a clusterng nex greatly spees up query processng. Conjunctve selecton contons are effcently processe by

25 SECTION.3: RELATED WORK 9 composte nexes, ntersecton of recor ponters or multmensonal nexes. We wll nvestgate processng of mult-attrbute restrctons n more etal n Secton.4 an Chapter 6. One of the most mportant operatons n RDBMS s the jon operaton, whch s use to combne several relaton whch were normalze urng ata moelng. The jon operaton s usually mplemente by neste loop algorthms, jon nexes, sort-merge algorthms, or hash algorthms. [ME9] surveys jon processng n relatonal atabases. The relatonal operatons of projecton, unon, ntersecton an set fference are effcently mplemente by processng a relaton n some sort orer an then ether use an nex scan or merge-sort algorthm [Gra93]. Effcent sort operatons an the use of restrctons to lmt result sets are crucal to many query processng algorthms. Very often queres combne several operatons of the relatonal algebra lke jon an restrcton..3. Multmensonal Access Methos Multmensonal access methos are well researche n the fel of spatal atabases. [GG97] an [Sam90] prove excellent surveys of almost all of these methos. Multmensonal nexes are use to utlze spatal restrctons (e.g., range restrctons, ntersecton, overlap) an to effcently compute spatal jons [Rot9, Gün93, BKS93]. Multmensonal ata structures are usually classfe as pont access methos storng ponts n multmensonal space an spatal access methos storng multmensonal extene objects of arbtrary volume, e.g., boxes, spheres, etc. Usually one also stngushes between man memory structures, whch are use to manage multmensonal ata n man memory, an seconary storage structures, whch are use for effcently accessng large multmensonal atabases on seconary storage. Thus, a lot of effort has been put nto nexng spatal ata. It s argue that these ata structures can also be use to nex pont ata. However, pont ata has specal propertes that shoul be explote by a multmensonal nex. The most mportant property s that the multmensonal space can be parttone nto sjont subspaces wthout ntroucng reunancy when storng objects. Thus reunancy conseratons as requre for multmensonal extene objects [Ore89] are not necessary. A sjont parttonng of a multmensonal space s very esrable snce t enables to gve logarthmc performance guarantees for nserton, eleton an exact match queres [Bay96]. Furthermore, any relatonal table n a RDBMS stores multmensonal pont ata. Therefore ths type of ata s of great commercal nterest. So multmensonal nexes for pont ata are useful n a large market segment. Therefore n ths thess we focus on nexng multmensonal pont ata. [GG97] stngushes three categores of multmensonal pont access methos: Technques base on hashng (gr fles [NHS84], EXCELL [Tam8], mult-level gr fles [Hn85], twn gr fles [HSW88b] an multmensonal hashng [Fal85, Fal88]), herarchcal access methos (K-D-B-Tree [Rob8], LSD-Tree [HSW89], Buy Tree [SK90], BANG Fle

26 0 CHAPTER : INTRODUCTION [Fre87], hb-tree [LS90], R-Trees [Gut84, SRF87, BKS+90, BKK96]) an space fllng curves n combnaton wth one-mensonal access methos [TH8, OM84, Jag90, AS83, FR89]. Yet another access technque s to use a combnaton of several one-mensonal access methos lke nverte fles [Lum70, MHW+90] or btmap nex ntersecton [OQ97]. We just brefly sketch the problems of the man approaches here. A etale escrpton s foun n the orgnal papers, a etale comparson s foun n [GG97]. Gr-fles gve a two-access-guarantee for retreval, but have an extremely ba worst-case behavor for upates: Insertng a pont may result n a non-local splt of the gr an thus requre a reorganzaton of the gr-fle. Furthermore, gr fles have problems wth epenences n the multmensonal ata strbuton. For lnearly epenent ata the gr may requre more storage than the tuples store n the gr. k--b-trees exhbt a force splt effect, whch oes not allow one to gve any space utlzaton guarantees. In worst case a large amount of pages may be completely empty. hb-trees have a complex organzaton an extremely ffcult algorthms, snce they are a hybr ata structure. In aton hb-trees may store several references of a noe to the same chl noe, whch may result n a superlnear growth of the nex noes wth respect to the number of regons n space. R-Trees cannot gve any performance guarantee for the basc operatons, snce they o not partton the multmensonal space n sjont parts, but allow overlappng rectangles. Successors of the R-Tree lke the R*-Tree [BKS+90] an the X-Tree [BKK96] use complcate algorthms or even ntrouce buckets of varyng sze to mnmze overlaps. However, complcate algorthms cannot overcome ths problem n general. Introucng buckets of varyng sze may cause the nex to egenerate. So the basc problem of R-Trees stll remans. Yet a very promsng approach to store multmensonal pont ata s to map the ata onto a one-mensonal space fllng curve [Sag94] lke the Z-curve or the Hlbert curve an use the propertes of ths curve for effcent retreval. The bggest avantage of space-fllng curves over the technques escrbe before s that they allow a sjont parttonng of the multmensonal space. In aton the storage requrements o not egenerate for any ata strbuton. Well known one-mensonal nexng methos can be apple an multmensonal search problems are reuce to lnear search problems. Hence multmensonal nserton, eleton an pont query algorthms nhert the complextes of the corresponng one-mensonal access metho. Usng B-Trees as one-mensonal access metho allows to gve logarthmc performance guarantees for the basc operatons of nserton, eleton an pont queres. Our approach also reles on space fllng curves. Our plot mplementaton has some smlarty to the zk-b-tree approach escrbe n [OM84]. We also use the Z-curve to to transform

27 SECTION.4: OUTLINE multmensonal pont ata nto one-mensonal ata. In contrast to zk-b-trees, we o not store tuples n Z-representaton, but only use Z-aresses to efne a parttonng concept for the multmensonal space. Ths yels a better space parttonng: Our approach has more freeom of choce to pck a sutable splt-pont for the parttonng. In aton t allows more effcent tuple extracton algorthms, snce t s not necessary to store the Z-representaton of every tuple. The major problem of multmensonal nex methos n general s the curse of mensonalty (e.g., [WSB98]), snce the number of possble parttonngs exploes exponentally wth the number of mensons. Ths mensonal curse forbs clusterng of hgh-mensonal ata. However, often problems that seem hgh-mensonal at frst glance have a reuce mensonalty, f proper ata moelng s use. In ata warehousng applcatons one selom has more then 0 nepenent mensons, usually even much fewer. If a ata warehouse utlzes more mensons, there are usually some epenences between the mensons. Customers, for example, ten to buy the same prouct. In the same way, many customers exst only for a certan tme. If a ata structure s to organze the multmensonal space, t woul be benefcal f those epenences are taken nto account n orer to lmt the number of parttons..4 Outlne Ths thess s ve nto three parts. Part One conssts of the frst three chapters an escrbes prelmnares of our work. The secon part escrbes our approach to relatonal query processng wth multmensonal nexes. In the thr part we nvestgate our technque of Part Two both theoretcally by a cost analyss an practcally by examples an performance measurements. In the followng we brefly sketch the contents of each chapter. Chapter gves basc efntons whch wll be use throughout the thess. In aton we prove an overvew on query processng concepts wth a specal focus on multmensonal range queres. Chapter 3 gves a etale analyss of multmensonal space parttonng. We entfy space fllng curves to create a lnear orerng of a multmensonal space. Relyng on one specal kn of space fllng curve, namely the Z-curve, we then ntrouce the concept of Z-regons to efne a parttonng of the multmensonal space. We nvestgate an prove several mportant propertes of Z-regons, especally ther local proxmty an connecton n space. We also nvestgate the lmtatons of our approach to multmensonal space parttonng wth respect to ata strbutons an ncreasng mensonalty. The UB-Tree, a multmensonal nex base on space fllng curves an B-Trees, s escrbe n Chapter 4 together wth ts basc algorthms for nserton, eleton, pont queres an range queres. In aton we ntrouce two new algorthms for UB-Trees n ths chapter, namely the spral algorthm for the evaluaton of nearest neghbor queres an the Tetrs algorthm for processng sort-operatons wth mult-attrbute restrctons.

28 CHAPTER : INTRODUCTION Chapter 5 escrbes selecte algorthmc problems of our prototype mplementaton on top of several commercal RDBMS. In ths chapter we also nvestgate how to eal wth varous ata types an ata strbutons. We especally conser how to organze a multmensonal space, whose mensons can be organze herarchcally. Ths concept of multmensonal herarchcal clusterng s especally applcable to ata warehousng applcatons. In Chapter 8 t s use to cluster the fact table of a relatonal ata warehouse. In Chapter 6 we erve a cost functon for mult-attrbute restrctons n multmensonally parttone unverses. We further efne a cost moel an cost functons for the technques that preval n current RDBMS for the processng of mult-attrbute restrctons. We then analyze the performance of the range query algorthm an the Tetrs algorthm for UB-Trees an compare t to other access methos that preval n present RDBMS. Performance measurements n a laboratory envronment wth generate, unformly strbute test ata are presente n Chapter 7. Although we manly nvestgate the range query performance, we also brefly sketch pont queres an nserton there. Ths chapter s more of analytcal nterest an s ame at provng a better practcal unerstanng of the effects of multmensonal clusterng. In aton, by actual performance measurements t unermnes the correctness of the results that were erve n Chapter 6 usng our cost moel. Chapter 8 escrbes the mpact of our approach on query processng. We lst transformaton rules whch may be use to mplement the basc relatonal operators (selecton, projecton, orerng, groupng an aggregaton, equ-jon, set operatons) wth UB-Trees. We use these transformaton rules to apply our multmensonal query processng technques n two real worl applcatons scenaros, the TPC-D benchmark an a star schema ata warehouse. Performance measurements an comparsons for the TPC-D schema wth GB of generate ata an the ata warehouse wth 7 GB of real worl ata are also presente n that chapter. Chapter 9 conclues the thess wth a summary an an outlook on future work..5 How to Rea the Thess We tre to use well-accepte techncal terms whenever possble n orer to avo confuson wth other research work n the fel of multmensonal nexes an RDBMS. Thus experence reaers may skp most of Chapter, whch essentally gves basc mathematcal efntons as well as a quck overvew of the state of the art n query processng an efnes a basc termnology. However, n orer to fully juge our approach of multmensonal clusterng we recommen to rea Secton.3 n any case. Chapter 3 s manly relevant to the mathematcally ntereste reaer, who wants to get a eeper nsght nto the fel of space fllng curves an ther relevance to multmensonal ata processng. It s also helpful to juge the chances an lmtatons of multmensonal nexng.

29 SECTION.5: HOW TO READ THE THESIS 3 Chapter 4 s to unerstan how UB-Trees an query processng wth UB-Trees work. UB- Trees are efne va space fllng curves n ths thess. Therefore t mght be helpful to have rea the sectons 3. an 3.4 before reang Chapter 4. Chapter 5 escrbes some ptfalls that occurre when mplementng a prototype of the UB- Tree access metho on top of several RDBMS. Ths chapter mght also be useful to anyone wantng to apply UB-Trees for a specfc atabase schema, snce t also ntrouces the concept of transformaton functons, whch n the case of varable UB-Trees an multmensonal herarchcal clusterng has relevance to physcal ata moelng. In Chapter 6 the mathematcally ntereste reaer fns a formal treatment of range query performance. Reaers wth a specal focus on the fel of query processng an query optmzaton mght also have a look at ths chapter, snce t efnes an analyzes a cost moel for UB-Trees as well as other access structures, The performance measurements of Chapters 7 an 8 are of nterest to anyboy wantng to evaluate the qualty of our approach. Chapter 7 s to unermne the correctness of our cost moel la out n Chapter 6. The performance fgures an query processng proposals of Chapter 8 are especally relevant to practtoners n the fel of RDBMS evelopment an RBDMS applcatons.

30

31 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS 5 Every scence, lke a recurrng ecmal, has a begnnng an no en. (Anton Chekov) Chapter Termnology an Basc Concepts n relatonal atabase schemas, tables often bear composte prmary keys I concatenate of several attrbutes. For effcent query processng ths composte key s use as prmary nex to physcally organze the table on seconary storage. In the same way, seconary nexes for a table often consst of a concatenaton of attrbutes. Ths thess nvestgates the usablty of a multmensonal access metho for mult-attrbute keys. Ths chapter ntrouces the ualty of ponts n multmensonal space an tuples of a relaton. We explan our termnology an gve basc efntons whch wll be use n the followng chapters of ths thess. We also prove a classfcaton for query types common n toay s atabase applcatons. We aress the problem of nexng n general an look on range queres more closely.

32 6 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS. The Multmensonal Space: Dualty of Ponts an Tuples Relatonal atabase management systems (RDBMS) have a large market share for commercal applcatons. Although the technques escrbe n ths thess are very well applcable to spee up atabase management systems n general (e.g., herarchcal atabases, CODASYL atabases, or object orente atabases, see e.g., [Ull88]), we use the termnology of RDBMS n the followng. In the relatonal worl ata s store n a set of relatons. We use the term table synonymously to relaton. Each table s organze nto rows an columns. We use the terms tuple synonymously to row an attrbute synonymously to column. The number an type of the attrbutes s fxe for each relaton; thus each tuple has the same number of type attrbutes. We call ths number of attrbutes arty of a relaton. The type of each attrbute represents the set of permssble values calle oman. In ths thess we conser a tuple to be a pont n multmensonal space. We use the term unverse to enote the multmensonal space efne by the omans of the attrbutes. Each attrbute etermnes one menson. The value of an attrbute of a tuple s therefore the coornate of a pont n multmensonal space. The ualty of ponts an tuples causes the followng terms to be use synonymously n ths thess: (multmensonal) oman, unverse, multmensonal space relaton, table, subset of multmensonal space row, tuple, pont attrbute, column, coornate, menson arty, mensonalty In general we wrte S to enote the carnalty of any set S. For any strng s we wrte s to enote the length of s... Relatons, Tuples, Attrbutes an the Multmensonal Space Let R be a relaton havng attrbutes A,..., A of omans,...,. R s a set of tuples x (x,..., x ). Let < be a total orer on an λ resp. υ the mnmum resp. maxmum value of. We call the arty of R. R s the carnalty of R,.e., the number of tuples store n R. When wrtng tuple lterals we sometmes omt the brackets an commas. For nstance, we wrte ab as abbrevaton for the tuple lteral (a, b). For notatonal convenence we efne the set of menson nces as D {,..., }. To smplfy theoretcal analyss, we conser the oman of A, D to be mappe to Ω, a set of non-negatve nteger numbers. Thus, for each value x of A we get: x Ω {0,..., r -} R Thus for Ω we get λ 0 an υ r -.

33 SECTION.: THE MULTIDIMENSIONAL SPACE: DUALITY OF POINTS AND TUPLES 7 In aton we requre r v for some arbtrary v R. Ths s not a restrcton, snce every fnte totally orere oman ( {a,...,a r },< ) wth r r can be mappe monotoncally to Ω by : ƒ: Ω, so that ƒ(a j ) < ƒ(a k ) a j < a k Defnton - (multmensonal oman, Ω): The multmensonal oman relaton s the cross prouct Ω Ω... Ω {0,...,r -}... {0,...,r -}. RIWKHHQWLUH We call Ω the base space of R. Thus R s a fnte subset of Ω,.e., R Ω The carnalty of Ω then can be calculate as: Ω r Defnton - ( -orer of Ω): For the multmensonal space Ω we efne a partal orer: For x, y Ω x y x y for all D x y x < y for all D In the followng we use the symbol to enote a oman of values. Defnton -3 (<-neghbors): For any orerng relaton < an an orere oman (, <) two values a, b are <-neghbors, f an only f, a < b an there exsts no c wth a < c < b. For a efne: <-neghbors(a) {b b an a are <-neghbors or a an b are <-neghbors} If the orerng relaton < s obvous, we just wrte neghbors(a) nstea of <-neghbors(a). Lemma - ( neghbors): Two ponts x, y Ω wth x y are neghbors, f an only f, there exsts an nex so that x j y j for all j D\{} an x an y are <-neghbors. Proof: The proof s a rect consequence of Defnton - an Defnton -3. Theoretcally, the approach escrbe n ths thess coul also operate on nfnte omans. Then one only nees a splt functon whch parttons an nterval of the oman nto two sjont ntervals (see n Secton 3.6). However, t suffces to conser fnte omans, snce n a computer every oman lke real numbers, nteger numbers, character strngs, etc. s represente by a fnte oman.

34 8 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS Lemma -: A pont x n -mensonal space has at most Proof: The proof s a rect consequence of Lemma -. neghbors. Defnton -4 (stance of two values): For any totally orere set (, <) an a, b a < b we efne { c a < c < b } +, a b stance( a, b, < ) 0, a b wth We often are not ntereste n the specfc omans of the attrbutes. For an easy mathematcal treatment we therefore efne ^, a normalzaton operaton of each oman to a value of the nterval [0, ]. Defnton -5 (scalar normalzaton): If a s a value n a oman [a mn, a max ] R, then we normalze n the followng way: a amn + aˆ : a a + For any attrbute value x Ω {0,...,r -}, D, of A of a tuple x Ω we get: max mn x + xˆ : r Defnton -6 (tuple normalzaton): For a tuple x Ω we efne x : ( xˆ,..., xˆ )... Intervals ˆ Defnton -7 (one-mensonal ntervals): For any totally orere oman (, <) an a par of values a, b wth a < b we efne the one-mensonal nterval: [a, b] {c a c b} If the pont escrbng the lower boun (or the upper boun or both bouns) s not nclue n the nterval, we wrte: ]a, b] {c a < c b} [a, b[ {c a c < b} ]a, b[ {c a < c < b} Defnton -8 (mult-mensonal nterval): For two ponts y, z Ω wth y Defnton -7 to multmensonal ntervals: [[y, z]] [y, z ]... [y, z ] {(x,...,x ) y x z for all D}. z we exten We efne ]]y, z]], [[y, z[[ an ]]y, z[[ analogously.

35 SECTION.: THE MULTIDIMENSIONAL SPACE: DUALITY OF POINTS AND TUPLES 9 Note that multmensonal ntervals are so-orente wth respect to all mensons,.e., ther faces are parallel to the coornate axes. Because of ther rectangular nature we also use the term box to enote a multmensonal nterval...3 Volumes We use the term volume for all mensonaltes nstea of usng the terms length for lnear spaces, area for two-mensonal spaces, volume for three-mensonal spaces or Jorancontent for hgher mensonal spaces. For smplcty by volume of a part of a space we mean the normalze volume wth respect to the entre space. Because of our screte moel we conser a sngle value of a oman to have the volume /. Defnton -9 (volume of a lnear nterval): For a lnear (one-mensonal) nterval [x, y ] Ω {0,...,r -} we efne ts volume: vol([ x, y ]) yˆ xˆ + r The volume of the empty set s zero: vol( ) 0 Thus a volume s a normalze number between 0 an. We generalze ths one-mensonal volume efnton to mult-mensonal ntervals [[x, y]]: Defnton -0 (volume of a multmensonal nterval): vol([[ x, y]]) vol([ x, y ]) Defnton - (volume of a set of multmensonal ntervals): We efne the volume of a unon of k sjont multmensonal ntervals S j, j {,..., k}, as the sum of the volumes of each nterval: vol( U k k ) j j S j vol( S j ) Snce each fnte mult-mensonal pont set can be ecompose nto sjont multmensonal ntervals, the volume of any fnte multmensonal pont-set can be calculate by ths formula. Lemma -3: The volume of the fference of two pont-sets S an Q S s the fference of the volumes of S an Q,.e., vol(s\q) vol(s) vol(q).

36 0 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS Proof: S can be ecompose nto a set of sjont mult-mensonal ntervals {S,...,S k, Q,...,Q j }, such that Q Q... Q j. Then, vol(s) vol(q,...,q j ) + vol(s,...,s k ) vol(q) + vol(s\q)...4 Statstcal Functons Defnton - (average): For any set S of numbers we efne ts average: avg( S) S a S a Defnton -3 (stanar evaton): For any set of numbers S we efne ts stanar evaton: st( S) avg ({( s avg( S) ) s S }) ( a avg( S) ) S a S..5 Parttone Relatons In the followng we efne the terms page an regon. We use these terms to enote the parttonng of a relaton wth respect to the parttonng of the corresponng base space. Relatons are physcally store on pages of the seconary storage. A page s a physcal unt of seconary storage that stores a certan capacty of tuples of a relaton. A regon s a subspace of the multmensonal base space of the relaton. Because of the ualty of ponts an tuples, a set of regons that parttons the base space Ω correspons to a parttonng of the relaton R nto a set of pages. Defnton -4 (regon):a regon s a subspace of Ω. Regons are nether requre to be rectangular nor connecte. We wrte ρ, ρ, ρ 3.,... for regons. Defnton -5 (page; page capacty): A page s a fxe sze byte contaner to store tuples of a relaton. We wrte p, p, p 3,... for pages. The capacty of a page s the maxmum number of tuples (or bytes for varable length tuples) that a page may hol. We enote the page capacty by C. Defnton -6 (corresponence between pages an regons): A page p correspons to a regon ρ ( p ρ), f all tuples store on p are locate n the regon ρ,.e., p ρ (x p x ρ p )

37 SECTION.: QUERY TYPES To physcally store a relaton R n a DBMS, R s parttone nto P R {p,..., p k }, a fnte set of sjont pages. Each page p, {,..., k} stores a lmte number of tuples. Defnton -7 (regon parttonng): A regon parttonng of Ω for a parttone relaton P R {p, p,..., p k } s a set of regons Θ {ρ,..., ρ k } wth U k ρ Ω an j,,...,k an j ρ ρ j an,...,k p ρ For B-Trees as use n stanar RDBMS the regon parttonng usually takes place wth respect to one attrbute or wth respect to several attrbutes n some lexcographc orer. Our regon parttonng s more general an wll be use n Chapter 3 to efne a multmensonal parttonng of a relaton.. Query Types Relatonal queres are expresse by operators of the relatonal algebra an ether eal wth a sngle table or combne multple tables [Co70]. Sngle table queres restrct, re-arrange or aggregate the tuples of one relaton [Ull88]. :KHQDVNLQJDUHVWULFWLRQTXHU\GHQRWHGE\WKHUHODWLRQDORSHUDWRU ZHDUHLQWHUHVWHGLQD subspace of a multmensonal unverse. Depenng on the shape an the volume of the query space we can stngush fferent types of restrcton queres. A large set of these queres can be reuce to partal range queres. Re-arrangng means sortng (enote by ω), projectng (enote by π) or groupng (enote by γ) an aggregatng (enote by an aggregaton functon lke sum or count) the tuples of a relaton. The most frequent operaton to combne multple tables s the jon operaton (enote by ), mostly the natural jon. In ths thess we wll present a new processng technque for the operatons llustrate n Fgure -. Defnton -8 (query, result set): A query s a precate ϕ (x) over the tuples x of a relaton R. The result set RS of a query s the subset of tuples store n R satsfyng the query precate: RS(R,ϕ) {x R ϕ (x)}. The result set sze s the carnalty of the result set RS(R,ϕ ).

38 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS queres sngle table queres multple table queres restrcton re-arrangement jon projecton sortng natural jon partal range queres groupng an aggregaton partal match query range query exact match query Fgure -: Query Categores Wth the ualty of multmensonal ponts an tuples, every query precate ϕ (x) efnes a query space ϕ (x) Q Ω. The result set s the set of tuples of the relaton that s locate n Q,.e., RS {x R x Q}. The result set sze s the number of tuples of R locate n Q. Defnton -9 (selectvty): The selectvty of a query ϕ s the number of tuples n the result set of a query compare to the number of tuples store n the relaton: selectvty( R, ϕ ) RS( R, ϕ ) R Defnton -0 (restrcton): The query space of a query Q s efne by a restrcton n none, some or all attrbutes. Typcal restrctons are pont restrctons, nterval restrctons, restrctons epenng on another attrbute or restrctons epenng on the ata strbuton n the atabase. We mostly conser nepenent restrctons,.e., the restrcton n one attrbute oes not epen on the restrctons of the other attrbutes or the ata strbuton of the relaton (see Secton 3.8 for a treatment of epenent mensons). Inepenent restrctons, for nstance, are pont restrctons an nterval restrctons, f the ponts or ntervals for each menson are nepenent. We moel queres wth nepenent restrctons by an nterval n each attrbute. If an attrbute s not restrcte, we use the nterval [-, ] resp. [λ, υ ]. A pont restrcton n an attrbute s consere to be an nterval wth entcal lower an upper bouns.

39 SECTION.: QUERY TYPES 3 Defnton - (restrcton nterval, query box): A restrcton nterval s a one-mensonal nterval efnng the restrcton of one attrbute n a query. A query box Q s a multmensonal nterval efne by the restrcton ntervals of a query,.e., Q [[y, z]] [y, z ]... [y j, z j ]... [y, z ] Wthout loss of generalty we wll use normalze restrcton ntervals for an easer mathematcal treatment,.e., both the lower boun an the upper boun of the restrcton nterval are values between 0 an. In the followng we wll often entfy a restrcton by the volume of the corresponng restrcton nterval. Defnton - (nex canate, nex attrbute, result attrbute): We call an attrbute x of a relaton R to be an nex canate of a query ϕ (x) over R, f x s restrcte an/or orere urng the query processng of ϕ. We call an nex canate nex attrbute, f t s actually nexe by some nex I (cf. Secton.3) on R. We call an attrbute of R result attrbute, f t s not restrcte or orere when processng ϕ. In ths thess the terms nex canate an nex attrbute enote physcal concepts of a ata moel (whch are use to erve seconary storage structures lke nexng, parttonng, clusterng or query materalzaton). In contrast to that the terms canate key an key attrbute as use n relatonal ata moelng (e.g., [Ull88]) enote logcal concepts of a ata moel. Although result attrbutes are of great practcal relevance, apart from ncreasng the sze of the tuple they often o not nfluence the behavor of a multmensonal nex. Typcally result attrbutes are projecte (wthout uplcate elmnaton) or aggregate urng processng a query. For an easer notaton from now on we wll omt result attrbutes of a tuple whenever possble. To be more specfc, we wll conser (or R, when talkng about several relatons) to be the number of the nex canates of a relaton R. If a relaton has atonal result attrbutes, we enote the arty of a relaton by (or R ). The result attrbutes are x +,...,x. Note that the terms nex canate an result attrbute are query epenent. For some query an attrbute mght be an nex canate, whle t s a result attrbute n another query. However, for many applcaton scenaros lke ata warehousng the set of nex canates/attrbutes an the set of result attrbutes are constant an both sets are sjont over a qute large set of typcal queres. We use the star-schema of the TPC-D benchmark [TPCD97] to llustrate fferent categores of queres by examples. For the examples of ths secton we use the ORDER an LINEITEM relatons. Fgure - shows the entre schema of the TPC-D benchmark. The value SF s the scalng factor an etermnes the number of tuples of each relaton. The TPC-D benchmark was manly evelope to evaluate the capabltes of RDBMS for complex queres whch are frequently foun n ecson support applcatons. Next to the schema an ata strbuton efnton the specfcaton conssts of 7 pre-efne queres an two upate functons. Beses ncreasng the tuple sze result attrbutes are ecsve for the performance of non-clusterng nexes, snce these attrbutes are not store n the nex but requre one atonal ranom access to the ata fle. Ths s analyze n etal n Secton.3..

40 4 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS For llustraton purposes we efne some atonal queres n the next sectons. We assume that O_TOTALPRICE, O_ORDERDATE, L_SHIPDATE, L_DISCOUNT an L_QUANTITY are nex attrbutes whch are use for the multmensonal organzaton of the corresponng table. Thus LINEITEM 3, LINEITEM 6, ORDER an ORDER 9.We state each query both n a text verson an n an SQL-verson. ORDER(O_) 500k*SF tuples ORDERKEY CUSTKEY ORDERSTATUS TOTALPRICE ORDERDATE ORDERPRIORITY CLERK SHIPPRIORITY COMMENT CUSTOMER (C_) 50k*SF tuples CUSTKEY NAME ADDRESS NATIONKEY PHONE LINEITEM (L_) 6000k*SF tuples ORDERKEY PARTKEY SUPPKEY LINENUMBER QUANTITY EXTENDEDPRICE DISCOUNT TAX RETURNFLAG LINESTATUS SHIPDATE COMMITDATE RECEIPTDATE SHIPINSTRUCT SHIPMODE COMMENT PARTKEY NAME MFGR BRAND TYPE SIZE PART (P_) 00k*SF tuples CONTAINER RETAILPRICE COMMENT PARTSUPP (PS_) 800k*SF tuples PARTKEY SUPPKEY AVAILQTY SUPPLYCOST COMMENT ACCTBAL MKTSEGMENT COMMENT REGIONKEY NAME REGION (R_) 5 tuples COMMENT NATIONKEY NAME NATION (N_) 5 tuples REGIONKEY COMMENT Fgure -: Schema of the TPC-D Benchmark SUPPLIER (S_) 0k*SF tuples SUPPKEY NAME ADDRESS NATIONKEY PHONE ACCTBAL COMMENT

41 SECTION.: QUERY TYPES 5.. Partal Match Query A partal match query restrcts some mensons to a pont, whle other mensons are left unspecfe. Formally, for a gven set of nces S D an a pont p Ω the result set of the partal match query PM(p, S, R) s: PM(p, S, R) {x R x p for all S} Example -a: lst all orers of (Fgure -3a) SELECT O_ORDERKEY FROM ORDER WHERE O_ORDERDATE Example -b: lst all parts shppe on (Fgure -3b) SELECT L_PARTKEY FROM LINEITEM WHERE L_SHIPDATE A specal form of a partal match query s an exact match query (also calle pont query) that restrcts all mensons to a pont. The result set EM(p, R) of an exact match query efne by a multmensonal pont p Ω s: EM(p, R) {x R x p for all D} PM(p, D, R) Exact match queres are use to retreve the result attrbutes of a pont efne by equalty restrcton n the nex attrbutes. These queres are often use for exstence checkng or referental ntegrty checkng. Example -c: lst all orers of wth a total prce of 0000 (Fgure -3c) SELECT O_ORDERKEY FROM ORDER WHERE O_ORDERDATE AND O_TOTALPRICE 0000 Example -: lst all parts that have been shppe on wth a quantty of 500 an a scount of 3% (Fgure -3) SELECT L_PARTKEY FROM LINEITEM WHERE L_SHIPDATE AND L_DISCOUNT 0.03 AND L_QUANTITY 500 TOTALPRICE ORDERDATE SHIPDATE DISCOUNT QUANTITY ORDERDATE SHIPDATE (a) (b) (c) () Fgure -3: Partal Match Queres (a, b) an exact match queres (c, ) TOTALPRICE DISCOUNT QUANTITY

42 6 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS.. Range Queres A range query restrcts all nex canates to an nterval. For a par of tuples y, z Ω we construct the query box Q [[y, z]]. The result set RQ(Q, R) of a partal range query s: RQ(Q, R) {x R (x,...,x ) Q)}. Note that query boxes are multmensonal ntervals an therefore so-orente. Moreover, the restrcton n one menson s nepenent of the restrcton n all other mensons. Example -a: return the number of orers n 997 wth a total prce between 0000 an 0000 (Fgure -4a) SELECT COUNT(O_ORDERKEY) FROM ORDER WHERE O_ORDERDATE >..997 AND O_ORDERDATE < AND O_TOTALPRICE > 0000 AND O_TOTALPRICE < 0000 Example -b: calculate the revenue for all shpments n 997 wth a quantty of less then 500 parts an a scount between 3% an 5% (Fgure -4b) SELECT SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)) AS REVENUE FROM LINEITEM WHERE L_SHIPDATE BETWEEN..97 AND AND L_DISCOUNT BETWEEN 0.03 AND 0.05 AND L_QUANTITY < 500 A range query s calle partal range query, f some attrbutes are not restrcte,.e. y - an z + for some D. Snce we are ealng wth fnte omans, t suffces to use the maxmum bounares nstea of + an -,.e, y λ an z υ. Example -c: calculate the total prce of all orers of 997 (Fgure -4c) SELECT SUM(TOTALPRICE) FROM ORDER WHERE O_ORDERDATE >..997 AND O_ORDERDATE < Example -: lst all shpments n May 97 wth a scount between 3% an 5% (Fgure -4) SELECT L_ORDERKEY FROM LINEITEM WHERE L_SHIPDATE BETWEEN.5.97 AND AND L_DISCOUNT BETWEEN 0.03 AND 0.05 TOTALPRICE ORDERDATE SHIPDATE DISCOUNT QUANTITY SHIPDATE (a) (b) (c) () Fgure -4: Range queres (a, b) an partal range queres (c, ) TOTALPRICE ORDERDATE DISCOUNT QUANTITY

43 SECTION.: QUERY TYPES 7 Partal match queres an exact match queres as escrbe n Secton.. are specal cases of partal range queres. For partal match queres ether y z or both y - an z + for all D. For pont queres y z hols for all D. Thus partal range queres can be classfe epenng upon whether the restrcton n each menson s a pont, an nterval or whether ths menson s left unspecfe. Table - shows ths classfcaton for the fferent types of partal range queres escrbe n the prevous sectons, where s the number of mensons an m an n are nteger numbers such that m + n. pont nterval unspecfe exact match query - - partal match query n - -n range query - - partal range query m n -m-n Table -: Number of mensons restrcte to ponts, ntervals or unrestrcte for several types of partal range queres..3 Range Query Sets an Arbtrary Query Spaces A range query set conssts of a unon of query boxes S Q... Q n. The result of the range query set S s the set of tuples RQS(S, R) {x R (x,...,x ) S)}. Range query sets can be use to moel arbtrary query spaces, snce every non so-orente query space can be ecompose nto or approxmate (covere) by a set of so-orente query boxes. Fgure -5(c an ) shows such query volumes. For certan precates, ths set mght become very large. In these cases t mght be useful to construct a cover consstng of a fxe number of query boxes that nclue the query space. ORDERDATE TOTALPRICE SHIPDATE DISCOUNT QUANTITY SHIPDATE (a) (b) (c) () Fgure -5: Sets of query boxes (a, b) an arbtrary query spaces (c, ) TOTALPRICE ORDERDATE DISCOUNT QUANTITY

44 8 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS Example -3a: lst all orers wth a total prce between 0000 an 0000 n 997 together wth all orers wth a total prce between 5000 an 5000 between Oct. 97 an Sept. 98 together wth all orers between Oct. 98 an Mar. 99 wth a total prce between 5000 an 5000 (Fgure -5a) (SQL omtte ue to the length of the statement) Example -3b: lst all parts that were shppe n May 997 wth a quantty between 000 an 0000 pcs. per shpment an scounte wth ether 3% or 5% (Fgure -5b) SELECT L_PARTKEY FROM LINEITEM WHERE L_SHIPDATE BETWEEN.5.97 AND AND L_DISCOUNT IN (0.03, 0.05) AND L_QUANTITY BETWEEN 000 AND Nearest Neghbor Queres A nearest neghbor query returns the tuples n the ata base that have the least stance to a gven pont wth respect to a specfe stance functon. 3 The stance functon epens on the applcaton. Typcal stance functons for geometrc nearest neghbor queres nclue the Euclean stance functon an the Manhattan chessboar stance. Another mportant applcaton of nearest neghbor queres are preference queres, where personal preferences are specfe by a pont an one s ntereste to fn the tuples n the atabase that best match our preferences. Formally, a nearest neghbor query for a certan stance functon can be escrbe by NNQ(x, R) {y R stance(x, y) s mnmal} Nearest neghbor queres ffer from the queres n the prevous sectons, snce nstea of a query box only one pont s specfe as nput parameter. The space that nees to be retreve from the atabase to answer such a query epens on the mult-mensonal strbuton of the ata. The result set of a nearest neghbor query often s a sngle pont an n general s much smaller than that of range queres. However, n Secton 4.4 we wll show, that nearest neghbor queres may also be effcently answere usng a range query algorthm. A one mensonal nearest neghbor query to fn the nearest neghbor to value b for attrbute A of relaton R wth the stance functon < can be expresse n SQL by: SELECT MIN(A) FROM SELECT MIN(A) FROM R WHERE A > b UNION SELECT MAX(A) FROM R WHERE A < b 3 Note that the neghbor concept use n the context of nearest neghbor queres ffers from the neghbor concept efne n Secton..: Whereas Secton.. efnes a neghbor n space (space neghbor), here we mean a neghbor that s actually store n the relaton (ata neghbor). Only for ense relatons, where each pont of the base space s actually store as tuple n the relaton, the space neghbor an the ata neghbor of a tuple are entcal.

45 SECTION.: QUERY TYPES 9 Example -4a: show the orer that ha a total prce aroun 0000 an was shppe aroun (Fgure -6a) SELECT O_ORDERKEY FROM ORDER WHERE O_ORDERDATE AROUND AND O_TOTALPRICE AROUND 0000 Example -4b: show the shpment that shppe a part wth a scount aroun 4% an a quantty of aroun 000 peces on (Fgure -6b) SELECT L_ORDERKEY FROM LINEITEM WHERE L_SHIPDATE AND L_DISCOUNT AROUND 0.04 AND L_QUANTITY AROUND 000 TOTALPRICE ORDERDATE SHIPDATE DISCOUNT QUANTITY (a) (b) Fgure -6: Nearest neghbor query..5 Further Queres Especally n geometrc applcatons, further queres are of nterest, e.g., the ntersecton between two sets of objects or the unon of two sets of objects. For two sets of extene objects, other examples are enclosure queres, contanment queres an ajacency queres. [GG97] argue, that etermnng the ntersecton between two sets of objects proves an effcent flter step for answerng any of these queres. Snce the ntersecton between two sets of extene objects can effcently be answere by range queres takng the extene objects as query volume, these problems can also be reuce to range queres. Complex queres nvolve several tables, whch are jone by some jon conton. The most frequent metho of jonng tables s the equ-jon, where matchng values of attrbutes of several tables efne the result of the jon. Usually equ-jons are realze by sort-merge jons or hash-jons. In multmensonal space we can easly escrbe the processng of a sortmerge jon: Each relaton s processe n slces wth respect to the jon attrbute. Ths allows to process the tuples of each relaton n sort orer of the jon attrbute. For entcal jon attrbutes n both tuples, the tuples are merge an the new tuple s ae to the result set. Processng a relaton n sort orer of any attrbute s also necessary for further operatons of the relatonal algebra such as sortng, projecton an groupng an aggregaton. We wll scuss ths processng metho n more etal n Chapter The keywor AROUND oes not exst n SQL. We just use t here for convenence as a stance functon for an attrbute.

46 30 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS Example -5 (TPC-D Shppng Prorty Query, Q3): Ths query retreves the shppng prorty an potental revenue of the orers havng the largest revenue among those that ha not been shppe as of a gven ate (cf. [TPC97]) SELECT L_ORDERKEY, SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)) AS REVENUE, O_ORDERDATE, O_SHIPPRIORITY FROM CUSTOMER, ORDER, LINEITEM WHERE C_MKTSEGMENT FOOD AND C_CUSTKEY O_CUSTKEY AND L_ORDERKEY O_ORDERKEY AND O_ORDERDATE <.5.98 AND L_SHIPDATE >.6.98 GROUP BY L_ORDERKEY, O_ORDERDATE, O_SHIPPRIORITY ORDER BY REVENUE DESC, O_ORDERDATE For Example -5 we use C_CUSTKEY, C_MKTSEGMENT, L_ORDERKEY, L_SHIPDATE, O_ORDERKEY, O_ORDERDATE an O_SHIPDATE as nex attrbutes. On each of the three tables an nterval restrcton n one attrbute s mpose by the query. We frst jon CUSTOMER an ORDER va CUSTKEY by smultaneously processng the tuples of each relaton n slces n the orer ncate by the arrows n Fgure -7. Afterwars we sort the ntermeate result of CUSTOMER ORDER on ORDERKEY an jon t wth LINEITEM. For each table we use the restrcton to reuce the sze of each slce. CUSTOMER ORDER C_CUSTKEY C_MKTSEG. O_CUSTKEY O_ORDERKEY O_ORDERDATE sort recton L_ORDERKEY LINEITEM Fgure -7: Jons an range queres n multmensonal space L_SHIPDATE

47 SECTION.3: ACCESS METHODS 3..6 Query Processng An mportant task of query processng n RDBMS s to effcently mplement algorthms for the basc operatons of the relatonal algebra. Usually these algorthms apply to partcular storage structures or access methos. [Gra93] gves a concse survey of query processng. The selecton operaton s ether mplemente by a table scan or, f an nex s avalable, by an nex scan. Common access methos for nexes are B-Trees [BM7] an hash nexes. For retreval ntensve envronments further nexng methos lke btmap nexes have been propose (e.g., [OQ97], [Inf97]). If the selecton conton specfes a range n a sngle attrbute, a clusterng nex greatly spees up query processng. Conjunctve selecton contons are effcently processe by composte nexes, ntersecton of recor ponters or multmensonal nexes. We wll nvestgate processng of mult-attrbute restrctons n more etal n Secton.4 an Chapter 6. The jon operaton s usually mplemente by neste loop algorthms, jon nexes, sort-merge algorthms, or hash algorthms. [ME9] surveys jon processng n relatonal atabases. Projecton, unon, ntersecton, an set fference are effcently mplemente by processng a relaton n some sort orer an then ether use an nex scan or merge-sort algorthm. Effcent sort operatons an the use of restrctons to lmt result sets are crucal to many query processng algorthms. Very often queres combne several operatons of the relatonal algebra lke jon an restrcton. In Chapter 8 we wll see how a multmensonal organzaton of a table wll be use to effcently process queres wth mult-attrbute restrctons an sort operatons..3 Access Methos In ths secton we brefly escrbe the man characterstcs of access methos that may be use to process the query types ntrouce n Secton.. Actual query processng strateges wll then be escrbe n Secton.4. Access methos transfer ata from seconary storage n orer to answer a query on a atabase. The most smple access metho s a full table scan (FTS), whch reas an entre relaton an for each tuple checks the precate ϕ of the query n man memory. Inexes are optonal auxlary ata structures assocate wth a table. Gven an nex key value, the rows that contan that value can be rectly locate through the nex. Generally, nexes are use to prove keye access to rows wthn a table. The goal s to use the restrctons efne by a query to reuce the number of sk pages that have to be retreve from seconary storage. A varety of nexes for effcent access to ata store n large atabases has been mplemente n commercal atabase systems or s beng nvestgate by the research communty. In commercal DBMS heap structures [Knu68, Knu73], hashng [FNP+79] an B-Trees ([BM7], [BU77], see [Com79] for a survey) are use to store tables. The most prevalent ata structure s the B-Tree famly, snce t gves logarthmc performance guarantees wth respect

48 3 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS to the number of tuples store n a table for the basc operatons of nserton, eleton an pont queres. In aton, B-Trees effcently process one-mensonal range queres..3. Characterstcs of Seconary Storage Seconary storage conssts of a stack of rotatng sks (cf. Fgure -8). Each sk conssts of a group of tracks, each of whch conssts of a number of blocks. A block s the smallest unt of transfer of the seconary storage. Each sk has a rea/wrte-hea (R/W-hea) that raally moves over the sk. To rea a page the R/W-hea moves to the track contanng that block. Snce the sk s spnnng, a block s rea when t passes the R/W-hea. Typcal block szes range from 5 Bytes to 8kB. DBMS usually efne atabase pages each of whch conssts of a constant number of consecutve sk blocks. Thus atabase pages can be rea wth a sngle postonng of the R/W-hea,.e., by one ranom access. DBMS page szes usually range from kb to 64kB. Snce tuples are often smaller than a atabase page, a page n general hols more than one tuple. block track sk stack surface Rea/wrte hea sk Fgure -8: A har sk Each ranom access to a sk page takes some tme t π to poston the R/W-hea of the sk to the corresponng page, some tme t τ to transfer the page from sk to man memory an some tme t ξ to extract the relevant tuples from the page. t π conssts of the track postonng tme t χ,.e., the tme to poston the R/W-hea to the corresponng track, an the latency tme t λ,.e., the tme the sk nees to spn untl the correct page appears uner the R/W-hea. t π an t τ are tmes spent for I/O, whle t ξ s tme spent n the CPU. To stngush between I/O-tme an CPU-tme we efne t I/O t π + t τ an t CPU t ξ. All n all we get the tme to retreve a page from seconary storage: t PAGE t I/O + t CPU t π + t τ + t ξ. For current har sks an CPUs typcal values are t π 0 ms, t τ 0,6ms an t ξ 0,4ms [IBM97]. So the postonng tme of a har sk an therefore the number of ranom page accesses s the lmtng factor when retrevng a tuple.

49 SECTION.3: ACCESS METHODS Clusterng The ea of clusterng s to store ata that s lkely to be use together, n physcal proxmty to reuce the number of I/Os necessary to retreve the ata. If the physcal proxmty s restrcte to the tuples on a page, we speak of tuple clusterng. If physcal proxmty n aton also hols between pages, we speak of page clusterng. In orer to spee up jons, clusterng mght store the jon partners of two relatons physcally close together. For range queres n attrbute A, tuples shoul be store n orer of attrbute A. Then one I/O s lkely to retreve a sk page storng several tuples that are necessary to create the result set of a query. Thus clusterng reuces the number of ranom page accesses that are necessary to answer a query. Defnton -3 (tuple clusterng; page clusterng): Tuple clusterng stores tuples of one or several relatons on one sk page, f the tuples are lkely to be use together to create the result set of a query. If the tuples o not ft on one page, the tuples have to be store on several pages. Normally new pages are physcally place on sk n nserton orer. Page clusterng n aton to tuple clusterng also mantans physcal clusterng between sk pages. The left part of Fgure -9 shows unclustere ata, tuple clustere ata, an page clustere ata. Page clusterng s harly feasble n OLTP envronments, snce each nserton or eleton requres a reorganzaton of the entre fle. Therefore page clusterng only exsts statcally, e.g. after mass loang process or after the reorganzaton of a relaton. Clusterng nees a groupng functon,.e., an equvalence relaton an possbly an orerng on the equvalence classes, to arrange the ata on sk. In the followng we only conser clusterng the tuples of one relaton. In ths case clusterng s useful for range restrctons. For sngle attrbute range queres the clusterng orer s efne by the sort orer on the restrcte attrbute. For multmensonal range queres there are several possbltes for orerngs. Here we just gve some orers as examples. The specfc characterstcs of these orers wll be nvestgate n Chapter 3. Example -6 (compoun orerng, concatenate orerng): We can create a multmensonal orerng of attrbutes by concatenatng the attrbutes n some orer an usng a lexcographc orerng between the attrbutes, whereas the orerng < of each attrbute s use to compare values of ths attrbute. Ths orerng s asymmetrcal, snce t favors the leftmost attrbutes of the concatenaton. We call the orerng efne above compoun orerng (or concatenate orerng). For x {a,b,c,} an x {a,b,c,} the - mensonal compoun orerng A A (see Secton 3. for a efnton of compoun orerng) s: aa A A ab A A ac A A a A A ba A A bb A A bc A A b A A ca A A cb A A cc A A c. A A a A A b A A c A A.

50 34 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS Example -7 (Z-orerng): Another way to orer the two-mensonal tuples s Z-orerng efnton of Z-orerng): (see Secton 3. for a aa ab ba bb ac a bc b ca cb a b cc c c. Z-orerng wll be escrbe an analyze n more etal n Chapter 3. tuple clusterng cb,cc,c,ca ba,bc,b,bb page clusterng aa,ac,a,ab ba,bc,b,bb unclustere ata ba,ab,ca,b bb,aa,cb,b bc,ac,,cc c,a,a,c onemensonal multmensonal b,a,c, aa,ac,a,ab compoun orerng tuple clusterng ca,cb,cc,c ba,bb,bc,b a,b,c, aa,ab,ac,a cb,cc,c,ca b,a,c, page clusterng aa,ab,ac,a ba,bb,bc,b ca,cb,cc,c a,b,c, lnear physcal storage orer on sk tuple clusterng capacty: 4 tuples page Z-orerng cc,c,c, ac,a,bc,b ca,cb,a,b aa,ab,ba,bb page clusterng Fgure -9: Clusterng an mensonalty aa,ab,ba,bb ac,a,bc,b ca,cb,a,b cc,c,c, Clusterng an mensonalty are llustrate n Fgure -9. Here 6 two-mensonal tuples {a,b,c,} {a,b,c,} are store on 4 sk pages. Each page s surroune by a box n the fgure, the ata on each page s store n the orer as rea from left to rght. Whle the ata are not orere for unclustere ata, they are at least orere n the frst attrbute for onemensonal clusterng. For compoun orerng an Z-orerng the ata are clustere lnearly wth respect to these multmensonal orerngs. The mportance of clusterng can mmeately be seen from the followng retreval cost estmatons (we just conser one-mensonal clusterng now. We wll eal wth the multmensonal case n the later chapters of ths thess). In accorance wth [HR96] we use a cost moel that takes I/O-tme for ranom page accesses an I/O-tme an CPU-tme for page transfers nto account. We assume that the prefetchng strategy of the fle system reas a

51 SECTION.3: ACCESS METHODS 35 physcal cluster of L consecutve pages from sk wth one ranom access nto the rea-ahea cache. Ths takes tme t π + (t ξ + t τ )L. Reang k pages n consecutve orer therefore takes: c scan (k) k/l t π + max(k, L) ( t τ + t ξ ).3.. Non-clustere Access / Ranom Access Wthout clusterng one ranom access to a page s necessary for each tuple. If we o not take cachng of pages n man memory nto account, each tuple requres a ranom access to the sk. For each ranom access t s necessary to poston the R/W-hea, to transfer the page an to extract the tuples from the page. Thus reang k pages wth ranom access takes: c ranom (t π, t τ, t ξ,k) k (t π + t τ + t ξ ) If man memory cache s use, the performance of ranom accesses may be sgnfcantly mprove: One way to acheve ths s to keep the ata pages n cache after ther retreval. Then a page oes not nee to be fetche agan from sk f a further tuple of ths page also has to be retreve. Ths requres a bg porton of man memory whose sze n worst case s equvalent to the number of tuples n the result set multple wth the sze of a page (f each tuple s store on a separate page). Another strategy s to entfy each tuple by ts row entfer, whch s the physcal locaton of the tuple on the sk. Frst the row s of all tuples n the result set are etermne by the nex access. Then these row s are sorte, an the pages are retreve from sk n ths sorte sequence of row s. Ths ensures that each page s only accesse once. Ths strategy requres no man memory for sk pages, but here the locatons nee to be cache an sorte. Both strateges are only applcable f the result set of the corresponng query s small. Otherwse not enough man memory wll be avalable an the system wll start swappng. Ths wll lea to a performance that s worse than the tme neee for scannng the entre relaton..3.. Tuple Clustere Access If tuples are store n a tuple clustere way, except for the frst page an the last page all tuples of a page o contrbute to a result set specfe by a range n the clusterng orer. If C s the capacty of one page n tuples, tuple clusterng reuces the number of ranom accesses by a factor /C. Thus for k tuples at most k/c+ pages nee to be ranomly accesse an k tuples are extracte from these pages. Therefore the theoretcal retreval tme for k pages by tuple clustere access s: c tuple (C, t π, t τ, t ξ,k) mn( k/c+, k) (t π +t ξ + t τ )

52 36 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS Tuple clusterng s use by B-Trees to spee up range queres. In ths case a B-Tree s use to physcally organze a relaton on seconary storage. Because of ths organzaton such a relaton s calle nex organze table (IOT) n many commercal DBMS [Ora97, IBM97]. We consequently also use ths term to enote a clusterng B-Tree Page Clustere Access If tuples are store n a page clustere way, prefetchng technques can be use to further reuce the number of ranom accesses. Each ranom access retreves not only one, but L consecutve pages an stores them n cache memory. The next L - page accesses o not requre any I/O, snce the ata s alreay avalable n man memory. Thus the number of ranom accesses gets reuce by a factor of L: c page (C, L, t π, t τ,, t ξ,k) mn( k / C / L +, k) t π + max( k / C, L) (t τ + t ξ ) Example -8: We nvestgate the theoretcal retreval tmes for ranom access, tuple clustere access an page clustere access for k, 5, 0, 50, 00, 000 an 0000 tuples. We assume a seconary storage wth an average postonng tme t π 0 ms an an average ata transfer tme t τ 0.6 ms per page. We further assume that C 50 tuples ft on one ata page. In the case of page-clusterng we further assume that L 6 pages are pre-fetche wth one ranom access. Fgure -0 shows the tremenous avantage of both types of clusterng over ranom access. An nterestng fact s that page clustere access gets CPU-boun very quckly. If we assume that processng the tuples on a page (extracton of the tuples from the retreve page an transfer to the aress space of the user process) takes t ξ 0,4µs, the CPU component of the formula for page clustere access excees the I/O component when retrevng more than k 544 tuples. However, ranom access an tuple clusterng are generally I/O-boun CPU Transfer Postonng tme n ms Page 0000 Tuple 0000 Ranom 000 Page 000 Tuple 000 Ranom 00 Page 00 Tuple 00 Ranom 50 Page 50 Tuple 50 Ranom 0 Page 0 Tuple 0 Ranom 5 Page 5 Tupe 5 Ranom Page Tuple Ranom Fgure -0: Theoretcal performance of ranom access, tuple clusterng an page clusterng

53 SECTION.3: ACCESS METHODS 37 Snce a prefetchng factor L bascally means an ncrease page sze, movng to larger page szes may sgnfcantly spee up retreval of large result sets. Therefore some atabase venors offer atabase page szes of up to 64kB. Exact match queres suffer from ths strategy, however, snce n ths case a large amount of unnecessary ata nees to be transferre to just retreve a sngle tuple. Although page clusterng s harly mantanable, t s mportant for one specal case: A full table scan (FTS) wthout usng any nex s one n ths way. Our performance measurements ncate that wth a prefetchng factor of L 6 an FTS s about 0 tmes faster than an nex scan of an entre relaton [Pe98]. Thus, for queres wth a selectvty of more than 0% an FTS s the best access metho n a sngle user envronment. Wth multple transactons, t may not look so ba for nexes even n ths case: An FTS puts an enormous loa on the system, both n CPU tme an I/O-tme. In aton concurrent users may be prevente from upatng tuples urng an FTS..3.3 Non-Clusterng Inexes Only one physcal clusterng orer s possble for a table on seconary storage wthout ntroucng reunancy for storng a table several tmes. Seconary nexes are use for accessng tuples, when the restrcte attrbutes are not nclue n the clusterng orer of the clusterng prmary nex. Seconary nexes are a replca of a table that stores the nex attrbutes,.e., a certan subset of the attrbutes of a table, n some clusterng orer together wth one atonal attrbute, whch s a reference to the physcal locaton of the entre tuple. We call ths reference row entfer or tuple entfer (TID). Thus, when just restrctng an retrevng nex attrbutes, seconary nexes can also be regare to be clustere. However, as soon as at least one non-nex attrbute nees to be retreve, a ranom access to a page va the row entfer s necessary for each tuple. Row entfers of seconary nexes are often mplemente as ponters to physcal storage,.e., a concatenaton of a page number an an offset to a tuple on that page. In most cases a concatenaton of nteger numbers s suffcent to represent row s. Oracle 8, for nstance, uses row s of 6 Bytes, whch consst of three concatenate parts: ata fle, page number n fle, an row number on page [Ora97]. So calle btmap nexes use a btmap representaton for the set of row entfers havng the same nex key value [OQ97]. Thus, one btmap s store together wth each fferent nex key value an conssts of as many bts as there are rows n the table. Each bt of the btmap correspons to one row an s set, f the corresponng row possesses that key value. The physcal locaton of a tuple store n a table organze by a clusterng B-Tree (IOT) may change because of page splts an page merges [BM7]. Therefore creatng a seconary nex on an IOT prevents usng physcal locatons for tuple entfers. Prmary keys or some surrogate of them must be use nstea. Ths requres one atonal step of nrecton (an often one atonal page access), f a seconary nex on an IOT s use to retreve a tuple. Because of these complcatons some DBMS venors o not allow to create a seconary nex on an IOT [Ora97].

54 38 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS.4 Answerng Range Queres n present RDBMS Most commercal RDBMS [Ora97, Inf98, IBM97, TAS98] o not use multmensonal nexes to process mult-attrbute range queres. Thus we survey how tratonal sngleattrbute access methos can be use to answer multmensonal range queres. For almost all commercal DBMS ths means usng B-Trees n combnaton wth specal query processng strateges..4. Compoun B-Trees Many RDBMS venors use a concatenaton of multple attrbutes to create an asymmetrcal multmensonal clusterng nex wth compoun orerng (see Example -6). Ths results n extremely unbalance query response tmes, when restrctng fferent attrbutes (wth restrctons of entcal selectvtes). The concatenaton orer of the attrbutes s a crucal factor for the query performance, snce the frst attrbute n the concatenaton orer s preferre as the man clusterng attrbute. The asymmetry coul only be cure by storng an mantanng at least, for optmal performance factoral( ) nex replca, each of whch has a fferent attrbute concatenaton orer. Because of ts storage requrements ths s harly feasble for > 4, snce for 4 mensons t s alreay necessary to store 4 replca of the nex. So a query profle s necessary to ece whch nexes to create. [GHR+97] analyzes ths very ffcult nex selecton problem for ata warehousng scenaros. In aton to the tremenous storage requrements, t s not possble to use ths approach n OLTP settngs because of the manyfol response tmes for tuple nserton an eleton. Ths type of nex s calle concatenate nex or compoun nex by Informx, Oracle, DB, TransBase or star nex by ReBrck [Re97]. In the followng we wll use the term compoun B-Tree for ths nex type. Very often ths nex s use as a non-clusterng nex, Oracle an TransBase may also use ths type of nex as an IOT to organze a relaton..4. Multple Seconary B-Trees or Btmap Inexes Another approach s the so-calle nverte fle or multple seconary nexes approach. Here a non-clusterng seconary nex s place on each attrbute. For answerng a multmensonal range query [[y, z]], the query s ve nto one-mensonal ntervals [y,z ],..., [y,z ], one for each attrbute. Each of these ntervals s then answere by the corresponng seconary nex, resultng n a set of tuple entfers (TIDs). The ntersecton of these sets of TIDs etermnes the answer of the range query. For each TID n the ntersecton the corresponng tuple nees to be retreve (see also Fgure -). Ths approach has many performance problems: For attrbutes, nexes nee to be mantane. Ths results n enormous storage requrements an ncreases OLTP response tmes for tuple nserton an eleton. Thus one usually selects a subset of these nexes, whch then results n a ffcult nex selecton problem. For OLAP atabases ths nex

55 SECTION.4: ANSWERING RANGE QUERIES IN PRESENT RDBMS 39 selecton problem (n combnaton wth materalze vew selecton problem) s scusse n [GHR+97]. Moreover, ntersectng tuple entfers s a very expensve operaton: If an attrbute s not very selectve, the set of TIDs for ths attrbute s very large. Each nvual set must be store an sorte. To make the query performance even worse: Snce the ata s not clustere, for small result sets an average of one ranom access to seconary storage s necessary to retreve each tuple n the result set by ts TID. To mprove the performance some atabase system venors use btmap nexes, snce btmaps prove a more compact representaton of tuple entfers for nex attrbutes wth low selectvty. Btmaps avo the sortng process, snce each btmap by efnton s a lst of tuple entfers sorte n the orer of the physcal storage locaton. However, the necessty to access btmap nexes an the unclustere access to the tuples stll reman as major performance bottlenecks. In aton, the nex selecton an mantenance problems stll exst. When nsertng a tuple, the length of each btmap nees to be upate. Ths expensve nserton operaton heavly lmts the usablty of btmap nexes to applcatons wth bulk upates. seconary B-Trees Dmenson Dmenson... Dmenson TID Sort TID Sort TID Sort TID- Merge ata fle Fgure -: Usng multple B-Trees for answerng range queres

56 40 CHAPTER : TERMINOLOGY AND BASIC CONCEPTS.4.3 Inexes for Processng Range Queres use n Present RDBMS In ths secton we survey the access methos that, beses an FTS, commercal RDBMS use to answer multmensonal range queres. Oracle mplements both IOTs an non-clusterng B*-Tree nexes an also offers a btmap representaton of row s for non-clusterng nexes. Btmap nexes can also be use to nex foregn columns,.e., columns that are not part of the table, but are somehow jone to the table. As mentone before, Oracle 8 oes not allow one to create seconary nexes on tables organze as an IOT. TransBase offers IOTs an non-clusterng B*-Trees. All nexes can ether be use to nex a sngle attrbute or to combne several attrbutes n a compoun B- Tree. DB only allows seconary B-Trees, whch are n general not clustere. Clusterng n DB can only be acheve statcally by a reorganzaton tool. Although Oracle an Informx prove exteners for spatal ata relyng on k-trees [Ben75] or R-Trees [Gut84], these are not applcable for general nexng. Multmensonal access methos are not ntegrate n the core of any of these DBMS. Table - lsts the stanar nex types of the DBMS Oracle, TransBase an DB. DBMS Oracle 8 TransBase 4.3 DB UDB nex type seconary nex nex organze table btmap nex prmary nex seconary nex seconary nex orerng sngle, compoun sngle, compoun sngle sngle, compoun sngle, compoun sngle, compoun clusterng no 5 yes no yes no 5 no 5 row 6 Bytes concatenate nteger number - btmap - prmary key or nteger number 4 Bytes concatenate nteger number nex concept B*-Tree B*-Tree B*-Tree B*-Tree B*-Tree B*-Tree Table -: Inexes n present RDBMS 5 Clusterng of the entre ata tuple s only acheve by mass loang. Ths page clusterng s estroye when nsertng tuples. For an optmal performance raw evces shoul be use, snce the clusterng of the DBMS mght otherwse be estroye by the fle system. However, the nex parts of the tuples are clustere n the nex noes of the B-Tree.

57 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING 4 There s no excellent beauty that hath not some strangeness n the proporton. (Francs Bacon) Chapter 3 Multmensonal Space Parttonng pace fllng curves create a lnearzaton of a multmensonal space an thus can be use for multmensonal space parttonng. Each pont s S nexe by ts ornal number on a space fllng curve. Therefore we use space fllng curves to aress each pont n multmensonal space. A specal subspace (regon) s constructe by ntervals of these aresses. The regon concept efnes a totally orere sjunctve parttonng of Ω, whch may be use to efne a clusterng nex for multmensonal ata. In orer to get a symmetrcal multmensonal nex, the multmensonal clusterng of spatal neghbors shoul be preserve by the space fllng curve. To preserve spatal proxmty a space fllng curve must be self smlar,.e., a fractal curve shoul be use. [Jag90] nvestgates several space fllng curves for nexng. [Sag94] gves a concse mathematcal treatment of space fllng curves. In ths chapter we nvestgate several space fllng curves an some of ther characterstcs whch are relevant to ata processng. Snce we rely on the Z-curve for our approach of multmensonal clusterng an nexng, we nvestgate the Z-curve an ts propertes n more etal.

58 4 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING 3. Space Fllng Curves an Total Multmensonal Orerngs In the followng we efne two space fllng curves, the compoun curve (C-curve) an the Lebesgue curve (Z-curve). Wthout etale analyss we also state some facts about the Hlbert curve (H-curve). We then nvestgate the characterstcs contnuty an monotoncty for one-mensonal orerngs of a multmensonal space, as these propertes are most relevant to ata processng. Defnton 3- (space fllng functon): We call a functon f : S R Ω a space fllng functon, f f(s) Ω s a bjectve functon. 6 Lemma 3-: A space fllng functon efnes a one-mensonal orerng for a multmensonal space. Proof: The multmensonal space s orere by f(0), f(), f(),... Space fllng curves are usually create by teratng some letmotf to nfnty [Sag90]. For practcal applcatons n computer scence t suffces to conser fnte teratons. We efne two curves for Ω wth mensons of entcal carnaltes r r... r an s log r by the followng formulas: Defnton 3- (C-value, C-aress): For x Ω an the bnary representaton of each attrbute x x,s- x,s-...x,0 we efne the compoun lexcographc value (C-value or C-aress) C(x): C( x) s ( ) s+ j x, j j 0 Defnton 3-3 (Z-value, Z-aress): For x Ω an the bnary representaton of each attrbute x x,s- x,s-...x,0 we efne the Z-value (or Z-aress) Z(x): Z( x) s j 0 j + x, j Wthout formal efnton we call the values of the Hlbert-curve H-values or H-aresses. 6 In mathematcal treatments a space fllng curve s the mage of a contnuous space fllng functon. We rop ths requrement for our treatment of fnte teratons of space fllng functons for screte fnte spaces Ω.

59 SECTION 3.: SPACE FILLING CURVES AND TOTAL MULTIDIMENSIONAL ORDERINGS 43 Lemma 3-: For a {0,..., s -} wth the bnary representaton a a s-...a 0 the nverse functon x C - (a) s calculate as: s,..., x ) wth x a ( ) + j j 0 x ( x j Proof: Accorng to Defnton 3-: a a s-...a 0 C( x) s ( ) s+ j x, j j 0 Thus the ual numbers at the postons a (-)... a (-)+s- n that orer form the ual numbers (.e., bnary strng) representaton of attrbute x. Therefore s a ( ) + j j 0 x j Applyng the above formula for attrbutes x,..., x buls the nverse (x,..., x ) C - (a) Lemma 3-3: For a {0,..., s -} wth the bnary representaton a a s-...a 0 the nverse functon x Z - (a) s calculate as: s,..., x ) wth x a j + j 0 x ( x j Proof: Accorng to Defnton 3-3 a a s-...a 0 Z( x) s j 0 j + x, j Thus the ual numbers at the postons a - a +- a a (s-)+- n that orer form the ual numbers (.e., bnary strng) representaton of attrbute x. Therefore s a j + j 0 x j Applyng the above formula for attrbutes x,..., x buls the nverse (x,..., x ) Z - (a)

60 44 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING Lemma 3-4: C(Ω) Z(Ω) {0,..., s -} R Proof: () C(Ω) {0,..., s -} From Defnton 3- we erve that C(0,...,0) 0. For x Ω \ (0,...,0) the C-value C(x) s larger than zero, snce there s at least one attrbute x whose bnary representaton has at least one bt set. Thus C(0,...,0) s the mnmum of C(x). Wth the same argument, the maxmum value of C(x) s obtane for x (r-,..., r-) an r C(r-,..., r-) log s. As a consequence of Lemma 3-, C(x) s a bjectve functon. Snce Ω r log r s, the co-oman of C(x) conssts of s fferent values. Wth the mnmum an maxmum values for C(x) we get: C(Ω) {0,..., s -} () Z(Ω) {0,..., s -} In analogy to () we erve Z(0,...,0) 0 an Z(r-,..., r-) s from Defnton 3-3. Z(x) s also bjectve. Thus we get Z(Ω) {0,..., s -} Defnton 3-4( compoun curve an Z-curve): We call the mage of C - compoun curve (C-curve) an the mage of Z - Lebesgue curve (Z-curve). Unverses wth fferent carnaltes for each menson result n a more complex formula for C-values an Z-values. However, the basc propertes of compoun curves an Z-curves reman val n ths case as well. Wthout loss of generalty we use a unverse wth entcal carnaltes for all mensons, because t s easer to unerstan the basc eas an the formulas get less complex. Fgure 3- shows the orerng efne by three fferent space fllng functons, the C-curve (a), the Z-curve (b), an the Hlbert curve (c). (a) (b) Fgure 3-: Space fllng curves (c)

61 SECTION 3.: SPACE FILLING CURVES AND TOTAL MULTIDIMENSIONAL ORDERINGS 45 Lemma 3-5: The C-curve creates the orerng A... A A on the multmensonal space Ω. Proof: Accorng to Defnton 3- C-values result n a bnary representaton of each attrbute n the form C(x) x,s- x, s-...x,0 x -,s-...x -,0...x,s-..x,0 For x Ω ths s entcal to the bnary concatenaton of the attrbutes n the orer x... x x. Thus, for x, y Ω C(x) < C(y) x... x x < y... y y x x... x x y Defnton 3-5 (Z-orerng, ): We call the orerng of the multmensonal space efne by Z-values Z-orerng an use to enote Z-orerng. We call the lexcographc orerng on the steps of Z-aresses Z-orerng [OM84], snce a path through ornal numbers n each step reflects the letter Z n the two-mensonal case (see Fgure 3-b). Lemma 3-6 (C-stance of two ponts): For two ponts x, y Ω wth x stance on the C-curve s: A A,..., A y ther C-stance(x,y) stance(x, y, A A,..., A) C(y) C(x) Proof: The proof s a rect consequence of Defnton -4 an Lemma 3-5. Lemma 3-7 (Z-stance of two ponts): For two ponts x, y Ω wth x the Z-curve s: y ther stance on Z-stance(x,y) stance(x,y, ) Z(y) Z(x) Proof: The proof s a rect consequence of Defnton -4 an Defnton 3-5.

62 46 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING 3. Propertes of Space Fllng Curves After efnng the terms contnuty an monotoncty for space fllng curves, we present proof that the compoun curve an the Z-curve are not contnuous but monotonous. Defnton 3-6 (contnuty): A lnear orerng of a multmensonal space (Ω, contnuous, f, an only f: for every x, y Ω: aress(x) an aress(y) are -neghbors x an y are -neghbors ) s Lemma 3-8: The compoun curve an the Z-curve are not contnuous. Proof: () C-Curve For Ω {0,..., r-}... {0,..., r-} the C-values r- an r are neghborng compoun aresses. However, C - (r-) (r-, 0,..., 0) an C - (r) (0,, 0,..., 0) are not neghbore n Ω. () Z-Curve The Z-aresses an are neghborng Z-aresses. However, Z - () (, 0, 0,..., 0) an Z - () (0,, 0,..., 0) are not neghbore n Ω. Defnton 3-7 (monotoncty): A one-mensonal < orerng of a multmensonal space (Ω, ) s monotonc, f an only f: for every x, y Ω: x y aress(x) aress(y) Lemma 3-9: The compoun curve an the Z-curve are monotonc. Proof: If x y, then for all mensons j D ether x j y j or x j < y j. If x < y for menson, then there exsts a bt poston k, so that for k < a an x,k < y,k an x,a y,a. Wth Defnton 3- ths mmeately yels C(x) < C(y), wth Defnton 3-3 we obtan Z(x) < Z(y).

63 SECTION 3.3: SYMMETRY OF SPACE FILLING CURVES Symmetry of Space Fllng Curves The Z-curve has the mportant property that n many cases the spatal proxmty of ponts s preserve. However, sometmes the stance of two ponts n Z-values s shorter than the actual spatal stance (stance shrnkng). 7 In other cases the Z-stance may be larger than the spatal stance (stance enlargement). (a) (b) Fgure 3-: Dstance shrnkng an stance enlargement Note that for the omans Ω, D, the stance between two neghborng values s. Defnton 3-8 (successor an preecessor of a value): For Ω, D, an a Ω we efne the functons succ(a) a +, f a < max( ) pre(a) a -, f a > mn( ) Defnton 3-9 (successor an preecessor of a tuple n one menson): Snce the oman of each attrbute of a tuple s totally orere, for D we can exten the efnton of successors an preecessors to multmensonal tuples x Ω: succ (x)(x,..., x -, succ(x ), x +,..., x ) pre (x)(x,..., x -, pre(x ), x +,..., x ) Defnton 3-0 (neghbors n one menson of Ω): For attrbute A, D, of a tuple x Ω we efne: {pre ( x)}, x r neghbors ( x) {succ ( x)}, x 0 {pre ( x)} {succ ( x)}, otherwse Note that U neghbors( x) neghbors ( x). 7 Note that stance shrnkng oes not occur for the Hlbert curve, snce a neghbor on the Hlbert curve s always a neghbor n multmensonal space.

64 48 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING Defnton 3- (average neghbor stance for a pont n one menson): For a pont x we efne average neghbor stance of a neghbor wth respect to menson for the compoun curve resp. Z-curve: 8 C-n (x) avg({c-stance(x, y) y neghbors (x)}) resp. Z-n (x) avg({z-stance(x, y) y neghbors (x)}) Lemma 3-0: For a pont x the average stance of a neghbor on the C-curve wth respect to menson s: C-n (x) r - Proof: The proof s a rect consequence of Defnton 3-. For the followng proofs we efne a noton how to extract a bt from a value or an expresson: Defnton 3- (bt of an expresson): For any expresson (e) yelng a scalar attrbute we enote the j th rghtmost bt of the result of (e) by (e) j. For any expresson (e) yelng a tuple we use (e),j to enote the j th rghtmost bt of A of the result of (e) for any D an j {0,..., s-}. Example 3-: (+3) (0 ) (+) 0 (0 ) 0 0 Defnton 3-3 ( ): For multmensonal omans we efne (x,..., x ) wth x j 0 for all j D\{} an x. Lemma 3-: For a pont x the average stance of a neghbor on the Z-curve wth respect to menson s s Z-n (x) ( x + ) j ( x ) j ) j 0 j,,, x x 0 x {,..., r } r 8 Note that accorng to Defnton 3-0 each pont has one or two neghbors per menson, epenng on whether t meets the borer wth respect to that menson or not.

65 SECTION 3.3: SYMMETRY OF SPACE FILLING CURVES 49 Proof: A pont x (x,..., x ) has at most two neghbors n menson, namely the two ponts x - (x,...,x -,x -,x +,...,x ) an x + (x,...,x -,x +,x +,...,x ) If x 0 or x r- there s only one neghbor, otherwse two neghbors exst. For x 0 only one neghbor exsts. Thus we get: Z-n (x) 0 for 0, 0 0, 0 0, 0, ) ( ) ( ) ( ) ( ) ( x s j j j s j j j s j j j s j j j s j j j s j j j x x x x x x x Z x Z Analogously, for x r - only one neghbor wth respect to menson exsts. Thus: Z-n (x) Z(x) Z(x - ) Z(x,..., x -, r-, x +,..., x ) Z(x,..., x -, r-, x +,..., x ) - For x {,..., r-} two neghbors wth respect to menson exst. We therefore get: Z-n (x) ( ) , 0, ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( s j j j j s j j j s j j j s j j j s j j j x x x x x x x Z x Z x Z x Z x Z x Z

66 50 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING Example 3-: Fgure 3-3 (a an b) shows the C-values an Z-values of the space fllng curves for the [0, 7] [0, 7] unverse of Fgure 3-. Although we have not erve a close formula for the calculaton of the values of the Hlbert curve, we call these values H-values an also nclue them n Fgure 3-3c. Next to the rght borer an below the lower borer of Fgure 3-3 (a an b) we lst the values of C-n (x) an Z-n (x). As the formulas suggest, Z-n (x,x ) s constant, when varyng x for a fxe x, Z-n (x,x ) s constant when varyng x for a fxe x an Z-n (x) Z-n (x). C-n (x) an C-n (x) are constant over the entre unverse. If one efne H-n (x) n the same way as C-n (x) an Z-n (x), H-n (x) s nether constant n the sense of Z-n (x) nor C-n (x). C-curve Z-curve Hlbert curve ornal numbers n (x) n (x) ± ± ± ± ± ± ± ±8 ±8 ±8 ±8 ±8 ±8 ±8 (a) () (g) ± ± ± ± ± ± ± ± ±3 ± ±± ±3 ± (b) (e) (h) Fgure 3-3: C-values, Z-values an H-values (c) (f) ()

67 SECTION 3.3: SYMMETRY OF SPACE FILLING CURVES 5 Defnton 3-4 (cumulate neghbor stance for one menson): The cumulate neghbor stance for menson s Ω x x ) ( n Z - ) n( Z resp. Ω x x ) ( n C - ) n( C- Lemma 3-: Z-n() + r r Proof: For the proof we efne 0 ) ( s j j a j a Z for a R Then the average neghbor stance for the Z-curve s: ( ) ( ) ( ) Ω 0 ) ( ) ( (0) () ) ( ) ( ) ( ) ( ) ( ) ( ), ( n Z - ) ( n Z - ) n( Z s s j j s j j s j j s j j s j j j s j j j r x r x s j j j j r x r x r x x r r r r r r r r r Z Z r Z r Z r x Z x Z r x x r x x x K L

68 5 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING Lemma 3-3: C-n() r r - Proof: C- n( ) C - n ( x) r x Ω x Ω Ω r r r Defnton 3-5 (egree of symmetry of a space fllng curve): The egree of symmetry of a space fllng curve s the negatve stanar evaton of the neghbor stances for each menson,.e., C-symmetry(Ω) -st({c-n() D}) resp. Z-symmetry(Ω) -st({z-n() D}) Example 3-3: The compoun curve for the unverse of Fgure 3- has a cumulate neghbor stance of C- n() 64 an C-n() 5. The Z-curve of Fgure 3- has the followng cumulate neghbor stances: Z-n() 76 an Z-n() 35. Ths results n a C-symmetry of 36,8 an a Z-symmetry of 4,5. Wthout further analyss we lst the corresponng numbers for the Hlbert Curve: H-n() 30, H-n() 36, resultng n an H-symmetry of 93,3.

69 SECTION 3.3: SYMMETRY OF SPACE FILLING CURVES 53 Theorem 3- (symmetry theorem): For practcal values for {,..., 0} an r {,..., 0 9 } the Z-curve has a hgher egree of symmetry than the compoun curve. Proof: Wth Ψ(, r) + r r the cumulate neghbor stance for menson of the Z-curve s: Z-n() - Ψ(, r) We calculate the average µ Z an the stanar evaton σ Z of the cumulate neghbor stance over all mensons: { } ( ) { } ( ) ( ) ( ) ( ) ( ) + Ψ Ψ + Ψ + Ψ Ψ Ψ Ψ r r r r r r r r D r D Z Z Z ), ( 3 4 ), ( 3 4 ), ( 4 ), ( ) ( ), ( ), ( ) n( Z st ), ( ) n( Z avg σ σ µ

70 54 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING The cumulate neghbor stance for menson of the compoun curve s: C-n() r - r We calculate the average µ C an the stanar evaton σ C of the cumulate neghbor stance over all mensons: { } ( ) ( ) { } ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( st ) ( ) ( avg r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r D n C r r r D n C C C C σ σ µ For all practcal values of {,..., 0} an r {,..., 0 9 } the rato σ Z /σ C s less than. Due to ts length, we omt the formal proof. The basc ea of the proof s, that σ Z /σ C shrnks monotonously, whch s proven by bulng the ervatons. We just show the graph of σ Z /σ C n Fgure 3-4. The graph even suggests, that for r σ Z /σ C converges to a fxe value for each mensonalty. Overall, for {,..., 0} an r {,..., 0 9 } the egree of symmetry of the Z-curve s hgher than the egree of symmetry of the compoun curve.

71 SECTION 3.4: REGIONS COVERED BY SPACE FILLING CURVES 55 0,8 0,6 0 mensons 9 mensons 8 mensons 7 mensons 6 mensons 5 mensons 4 mensons 3 mensons mensons σz/σc 0,4 0, E E r (carnalty of each oman) Fgure 3-4: Relatve egree of the symmetry σ z /σ c Hlbert curves are out of the scope of ths thess. Wthout proof we just state the followng facts: The Hlbert curve s monotonous, contnuous an has a hgher egree of symmetry than the Z-curve. For our plot mplementaton of the UB-Tree, we use the Z-curve, because of ts easy mplementaton (see Secton 5.3.). Snce the Hlbert curve exhbts better propertes than the Z-curve, mplementng the UB-Tree usng a Hlbert curve woul result n a better space parttonng an therefore n a better performance of the multmensonal nex. Ths mght be future work n ths fel. However, compare to compoun clusterng both the Hlbert curve an the Z-curve are far superor because of ther hgher egree of symmetry. 3.4 Regons covere by Space Fllng Curves In the followng we efne the regon concept, a concept whch efnes a subspace of a multmensonal space whch s covere by some part of a space fllng curve. Defnton 3-6 (C-Regon): A C-regon [α : C β ] s the space covere by an nterval of the C-curve an s efne by two C-aresses α an β. Defnton 3-7 (Z-Regon): A Z-regon [α : Z β ] s the space covere by an nterval of the Z- curve an s efne by two Z-aresses α an β. We often wrte [α : β ] nstea of [α : Z β ], f t s clear from the context that we enote a Z-regon. For an 8 8 unverse,.e., s 3 an, Fgure 3-5a shows the C-regon [7 : C ] an Fgure 3-5b shows the Z-regon [4 : Z 0]. Fgure 3-5c shows a C-regon parttonng wth 5 C-regons

72 56 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING [0 : C 6], [7 : C ], [ : C 3], [3 : C 5] an [5 : C 53]. Fgure 3-5 shows a parttonng wth fve Z-regons [0 : Z 3], [4 : Z 0], [ : Z 35], [36 : Z 47] an [48 : Z 63]. Fgure 3-5e shows ten ponts whch for page capacty of two ponts per page mght be store n a parttone relaton wth the C-regon parttonng of Fgure 3-5c or the Z-regon parttonng of Fgure 3-5. (a) (b) (c) () (e) Fgure 3-5: C-regons an Z-regons Lemma 3-4: A Z-regon [α : β ] covers the multmensonal nterval efne by the bounares Z - (α) an Z - (β) wth Z - (α) Z - (β),.e., {Z(x) x [[Z - (α ), Z - (β)]]} [α : β ] but an thus Proof: () {Z(x) x [[Z - (α), Z - (β)]]} [α : β] {Z(x) x [[Z - (α ), Z - (β)]]} [α :β ] [[x, y]] {Z - (α ) α [Z (x) : Z (y)]} The Z-regon [Z(y) : Z(z)] [4 : 0] (Fgure 3-6c) efne by y (,0) an z (6,0) s a superset of the space [[y, z]] [,6] [0,0] (Fgure 3-6) (a) (b) (c) () Fgure 3-6: Z-regons an query spaces

73 SECTION 3.5: DISCONNECTED Z-REGIONS 57 () {Z(x) x [[Z - (α), Z - (β)]]} [α :β] an [[x, y]] {Z - (α) α [Z (x) : Z (y)]] Ths s a rect consequence of the monotoncty of the Z-curve n multmensonal space (Lemma 3-9). The above Lemma also hols for C-regons: Lemma 3-5: A C-regon [α : C β ] covers the multmensonal nterval efne by the bounares C - (α ) an C - (β ),.e., {C(x) x [[C - (α), C - (β)]]} [α : C β ] Proof: but an thus In analogy to the proof of Lemma Dsconnecte Z-Regons {C(x) x [[C - (α), C - (β)]]} [α : C β ] [[x, y]] {C - (α) α [C (x) : C C(y)]} Snce the Z-curve s not contnuous, two neghborng ponts on the Z-curve may not be neghborng ponts n the multmensonal space. Ths means that a Z-regon can consst of spatally sconnecte subsets of ponts. Defnton 3-8 (connecton n space): We call two sets of ponts Q an P spatally connecte, f there exsts a par of ponts x Q an y P that only ffers n one attrbute an the stance between x an y s the lmt of resoluton,.e., stance(x, y ). (a) (b) Fgure 3-7: Connecte an sconnecte Z-regons

74 58 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING Theorem 3- (connecton theorem): Any Z-regon conssts of at most two spatally sconnecte sets of ponts an such Z-regons exst. Proof: For ths proof we use the term pont an aress nterchangeably. Ths s val snce there s a one-to-one mappng between aresses an ponts. We conser a Z-aress to be a seres of bts, whereby α enotes the th bt of aress α rea from left to rght. If a Z-regon conssts of sconnecte sets of ponts, we can group the Z-aresses belongng to that Z-regon nto ntervals each of whch only contans a connecte set of ponts. For connecte sets we consequently obtan of these ntervals. We assume that a Z-regon [α : γ ] conssts of three not connecte sets of ponts. By ntroucng the aresses β an γ, we obtan three spatally sconnecte ntervals [α : α ], [β : β ], [γ : γ ] wth β α * + an γ β * +. For Z s -aresses wth length k the nterval bouns are unquely represente by α * β... β j 0..., β β... β j an γ γ... γ k. Now we stngush two cases:. β... β j γ... γ j : Snce γ β, there exsts a poston a > j +, so that γ a an β a 0. %\VXEWUDFWLQJ IURPWKHDWWULEXWHEHORQJLQJWRELWSRVLWLRQa of the tuple cartesan(γ), we obtan a Z-aress smaller than γ. However, by the subtracton we never reset a bt n a poston greater than a. Snce a > j, the resultng Z-aress s always less then γ, but larger than β. Thus we always obtan Z-aresses n the nterval [β, β * ]. Ths n contracton to the assumpton means that [β : β * ] an [γ : γ * ] are spatally connecte.. β... β j γ... γ j : Ths means that β * β... β j.... So every Z-aress wth the prefx β... β j s contane n [β : β WRWKHDWWULEXWHEHORQJLQJWRELWSRVLWLRQj+ of the tuple cartesan(α * ) sets bt j+ of the Z-aress of that tuple (an possbly resets some of the bts wth a poston greater than j + ). Ths results n a Z-aress greater than or equal to β, whch has the prefx β... β j. Ths Z-aress s contane n the nterval [β, β * ]. Ths n contracton to the assumpton means that [α : α * ] an [β : β * ] are spatally connecte. Summng up, the assumpton of three sconnecte sets of ponts n a Z-regon s false for both cases. Combnng ths wth Fgure 3-7 proves the lemma. Spatal connecton s an mportant property of Z-regons, snce t guarantees spatal proxmty nepenently of the mensonalty. Ths allows a Z-regon parttonng to organze the mult-mensonal space whle preservng multmensonal neghborshp of ata even for skewe ata strbutons. It enables to construct effcent algorthms for range queres an sorte reang.

75 SECTION 3.6: GEOMETRIC VIEW OF Z-REGION PARTITIONING Geometrc Vew of Z-Regon Parttonng In Secton 3. we algorthmcally efne Z-regon parttonng. Z-regon parttonng can also be efne geometrcally by the concepts of Z-areas an Z-regons [Bay96]. In ths secton we present ths fferent vew on Z-regons an prove the equvalence to the Z-regon parttonng efne n Secton 3.. We therefore ntrouce a secon notaton of Z-aresses complementng the stanar notaton (Z s -aress, Z-aress) as efne n Secton 3.. For the geometrc vew we efne the concept of ncrement Z-aresses (Z -aress). In aton we ntrouce the noton of step an length of a Z-aress as well as the terms volume of Z- areas an Z-regons. Defnton 3-9 (Z-area; ncrement Z-aress (Z -aress)): We teratvely efne a Z-area Λ as a specal subspace of a -mensonal cube as follows: Splt the cube wth respect to every menson n the mle, resultng n subcubes numbere from to. A Z-area Λ of level s the unon of the frst α close subcubes. α etermnes Λ unquely. We call α the Z-aress of Λ an wrte Λ area(α ). The empty Z-area has the aress ε. area(ε ). To enlarge a Z-area, we teratvely a a Z-area wth Z-aress α {0,,..., -} of the next subcube wth number α +. The Z-aress of ths enlarge Z-area Λ s α.α, whch s lexcographcally larger then the aress α of area Λ. Next we may enlarge Λ by ang an Z-area of the next subcube α + of α, etc. We call the Z-aress representaton efne by ths schema ncrement representaton an appen a subscrpte to aress lterals to enote that the lterals are n ncrement representaton. We wrte Z -aress to enote a Z-aress n ncrement representaton. 9 In Secton 3.6. we wll efne a stanar representaton for Z-aresses. If a stncton of stanar representaton an ncrement representaton s not necessary n our explanatons, we wll use the term Z-aress. Otherwse we wll use Z -aress or Z s -aress to enote a Z- aress n ncrement representaton respectvely stanar representaton. Example 3-4: The left part of Fgure 3-8 shows four Z-areas area(0.0. ), area(.3. ), area(. ), an area(3 ) of a two-mensonal unverse. The shae subcubes of the two-mensonal unverse belong to the corresponng Z-area. area(0.0. ) conssts of 0 subcubes of the frst level, 0 subcubes of the secon level an subcube of the thr level. area(.3. ) conssts of subcube of the frst subvson level, 3 subcubes of the secon subvson level an subcubes of the thr subvson level. In the same way area(. ) conssts of subcubes of the frst subvson level an subcube of the secon subvson level. area(3 ) merely conssts of 3 subcubes of the frst subvson level. 9 Actually the Z-aresses must be constructe epenng on the multmensonal oman. That s, the Z- aress representaton epens on the ata type (ratonal, rratonal, complex, etc.). However, n practcal computer systems rratonal omans o not exst, but any ata type can be mappe to a fnte set of natural numbers. Thus t suffces to conser Z-aresses wth nteger numbers for each level.

76 60 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING area(0.0.) area(.3.) Z-regons 3 area(.) area(3) Pont Data 3 Fgure 3-8: Z-areas an Z-regons In the followng we suppress tralng zeros of Z -aresses an enote Z-aresses by α,β,γ Defnton 3-0 (step an length of a Z -aress): We call α j the j th step of the Z -aress α α.α..... α k.. We call k the length of Z -aress α. By α j, we mean the th bt of the j th step of α n bnary representaton. Note that the volume of a subcube ecreases exponentally wth ts step number. We therefore obtan a fne parttonng of the multmensonal space wth relatvely short Z -aresses. For the followng Lemma we loosen the efnton of Z -aresses an allow the value of to occur n the last step of an Z -aress. In ths case some Z-areas have two Z -aresses. Lemma 3-6: If for any > 0 α, the Z -aresses α α.... α + an α* α.....α. efne the same area. Proof: The unon of the subcubes at subvson level + equals the α + th subcube at level. Snce the prevous steps of α an α* are entcal an thus escrbe the same area, the overall area s entcal. Lemma 3-6 means that we can a steps to each Z -aress untl we reach the lmt of resoluton wthout changng the Z-area efne by ths Z -aress. The Z -aresses. an 3 n Fgure 3-8 are merely a short form of the aresses.0.4. an.3.4 respectvely. Removng tralng subcube numbers can be thought of lke an overflow arthmetc n aresses takng place upon the completon of an entre subcube of the prevous subvson level when enlargng an area. Note that t s not possble to a steps to ε. We wll use ths relaxe Z -aress efnton as an auxlary construct n some proofs of ths chapter.

77 SECTION 3.6: GEOMETRIC VIEW OF Z-REGION PARTITIONING 6 Lemma 3-7: The lexcographc orerng of Z-aresses (enote by contanment of areas n space (enote by ) are somorphc: ) an set area(α) area(β) α β Proof: If an only f α β, then the same subcubes are ae to area(α) an area(β) at each step. Ths results n area(α) area(β). () area(α) area(β) α β If α β, then there exsts an nex so that α j β j for all j < an α < β. At subvson pont we therefore a more subcubes to area(β..... β - ) than to area(α..... α - ). All subcubes that are ae to α at later subvson ponts are a subset of the subcube efne by β. Overall we get area(α) area(β). () area(α) area(β) α β If area(α) area(β ) there s a subvson level j such that area(α.....α j- ) area(β..... β j- ) an area(α.....α j ) area(β..... β j ). Thus a subvson level j area(β..... β j ) must consst of more subcubes than area(α.....α j ). Because of the subcube numberng β j then must be larger than α j. Overall we get α β. Defnton 3- (tuple, Z-aress of a tuple): A tuple (or pxel) s a smallest possble subcube at the lmt of the resoluton, but the resoluton may be chosen as fne as esre. The Z-aress of a tuple s entcal to the Z-aress of the area efne by nclung the tuple as the last an smallest subcube contane n ths area. Theorem 3-3 (mappng between tuples an aresses): There exsts a one-to-one map between Cartesan coornates of a tuple an Z-aresses. Proof: The one-to-one mappng between Cartesan coornates (x,x,..., x ) of a -mensonal tuple an ts Z-aress α s efne by the aressng scheme of Defnton 3-9. As state n Defnton 3-, a tuple s entfe by the area contanng the tuple as the last pont. The mappng from tuples to Z-aresses can rectly be erve from Defnton 3-9. Smlarly, the area corresponng to α an ts last pont x are calculate by bulng the unon of subcubes for Z-aress α accorng to Defnton 3-9. These two algorthms efne the functons of the one-to-one map between Z-aresses an tuples.

78 6 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING Defnton 3- (mappng between tuples an aresses): We use the followng notatons for the mappng between a -mensonal tuple x (x,...,x ) an ts Z-aress α: Z-aress(x) α an cartesan(α) x Lemma 3-8: Snce the two maps are nverses of each other we get: cartesan(z-aress(x)) x an Z-aress(cartesan(α)) α Defnton 3-3 (Z-regon): A Z-regon s the fference between two Z-areas: If α β then we efne the Z-regon between α an β as: ]α : β ] : area(β) \ area(α), where "\" means "set fference". Note that Z-regons as efne by Defnton 3-3 are open below an close above. Thus a set of orere Z -aresses {α,..., α n } buls a set of Z-regons {]α : α ],..., ] α n- : α n ]}. These Z-regons are sjont an therefore partton or tle the unverse. Example 3-5: The areas n Fgure 3-8 are use to create fve Z-regons: ]ε : 0.0. ], ]0.0. :.3. ], ].3. :. ], ]. : 3 ], ]3 : 4 ]. Each Z-regon s shae wth a fferent gray. Note that the lower boun aress α of a Z-regon ]α : β ] represents a pont cart(α) that oes not belong to ]α : β ],.e., α ]α : β ] an β ]α : β ]. Thus a Z-regon can be regare to be a one-mensonal nterval of Z-aresses wth respect to the Z-orerng : ]α : β ] ]α, β ] Thus a Z-regon represents the space corresponng to all ponts locate n the Z-nterval ]α, β ] Aress Representaton The subcube enumeraton efne n the prevous secton creates varable length Z -aresses, where the length of the aress enotes the number of subvsons. Ths aress representaton s useful when storng Z-aresses, snce fewer subvsons result n shorter aresses an thus n a better memory utlzaton. For aress calculaton an theoretcal conseratons, however, a fferent Z-aress representaton, the so-calle stanar representaton of Z-aresses, s esrable: To calculate the Z s -aress for a tuple x the aress calculaton has to procee as follows: For each menson D an each subvson step s {,..., log r } the current spltpont r / s must be compare wth x. The comparson s a bnary ecson between x < r / s an x r / s. It s suffcent to use one bt to store each of these ecsons n the aress. If x r / s, r / s must be subtracte from x to correctly process the next step of the subvson. The

79 SECTION 3.6: GEOMETRIC VIEW OF Z-REGION PARTITIONING 63 subvson process contnues up to the lmt of resoluton n each attrbute. A Z s -aress then conssts of a sequence of subvson steps, each of whch conssts of one bt for each menson. It s easy to ncorporate varyng attrbute lengths,.e., attrbutes wth fferent resolutons, nto ths algorthm: When the lmt of the resoluton n one attrbute s reache at a certan subvson step, ths attrbute s not use for the subvson any further. The number of bts n each further step s reuce n ths case. Defnton 3-4 (number of steps for an attrbute): The number of steps for attrbute A of a oman wth carnalty 0 r s etermne by ts resoluton: steps() log r Defnton 3-5 (length of a step): The number of mensons n step k (.e., the length of step k n bts) s: steplength(k) {steps() steps() k an D} Defnton 3-6 (stanar representaton of a Z-aress (Z s -aress)): We call the representaton of Z-aresses obtane by Algorthm 3- stanar representaton an appen a subscrpte s to aress lterals to enote that the stanar representaton s use. We wrte Z s -aresss to enote aresses n stanar representaton. Each pass through the outer loop of the algorthm efnes a step of a Z s -aress. Input: x (x,..., x ): -mensonal tuple r (r,..., r ): carnalty vector for the oman of the tuple Output: α: Z-aress(x) n stanar representaton // the algorthm requres the mensons to be // sorte accorng to ther resoluton // n escenng orer for s to steps() for steplength(s) to f x < r / s then α s,- 0 else α s,- x x r / s en f en for en for Algorthm 3-: Z s -aress calculaton by subvson 0 Note that we efne r to be v for some v R

80 64 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING Note that steps of a Z-aress are numbere begnnng wth from left to rght n ths thess, whereas the bts of a step are numbere begnnng wth 0 from rght to left. So the leftmost bt of a step correspons to the rghtmost menson n the orer of mensons. Z s -aresses always escrbe an area wth postve volume,.e., at least the pont at the lmt of the resoluton must be nclue n the Z-area. Thus the empty Z-area oes not have a Z s - aress,.e., there s no counterpart to ε. Example 3-6: For a 3-unverse wth r 4, r 4, r 3 8 we obtan steps(), steps() an steps(3) 3. The length of the steps are: steplength() 3, steplength() 3 an steplength(3). Calculatng the aress for the tuple (x, x, x 3 ) (3,, 7) yels the splt-pont (,, 4) for step. Thus the frst step of the Z s -aress conssts of the three bts 0. The next splt-pont (,, ) compare to the mofe p (,, 3) yels. At ths step the resoluton of menson an 3 s exhauste. Thus the thr step just conssts of one menson (.e., one bt). The splt-pont () compare to the mofe an now one-mensonal tuple p () results n for step 3. The entre Z s -aress s Z-aress(p) 0.. s 5.7. s. Defnton 3-7 (contrbuton of a menson to a Z s -aress): The contrbuton contrb (α) of menson to a Z s -aress α s: contrb steps( ) steps( ) j ( α) α j, j Example 3-7: The contrbutons for α Z-aress(p) of Example 3-6 are contrb (α) 3, contrb (α) 0 an contrb 3 (α) 7. Lemma 3-9: For Z s -aresses the contrbuton of an attrbute s just the attrbute tself. Proof: Accorng to Algorthm 3- bt c j of the contrbuton c contrb (α) c steps()-...c 0 represents a bnary ecson between less than or equal or greater than wth respect to the splt-pont steps()--j. Summng up all bts of the contrbuton from rght to left an weghtng each bt j wth the value of ts poston j then results n attrbute A wth x cartesan(α). However, n Secton we wll escrbe a varant of the aress calculaton, where contrb (α) x. Lemma 3-0 (oman of a step of a stanar aress): Z s -aresses result n a subcube numberng between 0 an -.

81 SECTION 3.6: GEOMETRIC VIEW OF Z-REGION PARTITIONING 65 Proof: Each subvson step of the Z s -aress calculaton conssts of bnary ecsons, one for each menson. Snce each ecson ves the space n the mle, after ecsons the space got ve n every menson. We thus have obtane a subcube of the multmensonal space. By storng the bnary ecson of each menson n a bt, we obtan a bt strng wth bts. Each combnaton of bts enotes a fferent subcube. Wth bts subcubes numbere from 0 to - exst. Lemma 3- (length of a Z s -aress): The length n bts of Z s -aresses s entcal for each tuple of a gven unverse. If the oman of menson conssts of r stnct values, ths length s calculate as: Proof: aresslength( Ω) steps( ) steps() j steplength( j) For each menson Algorthm 3- ves the oman n the mle untl the lmt of resoluton s reache. Each of these subvsons s represente by one bt. To reach the lmt of resoluton for a oman consstng of r stnct values, steps() log r subvsons are necessary. Dong ths for all mensons results n steps( ) subvsons, each of whch s reflecte by one bt of the aress. Ths proves the frst part of the equaton. Accorng to the efnton of Z s -aresses the secon part of the formula escrbes the length n bts as the sum of the length of the steps of the aress n bts. Example 3-8: The 3-unverse of Example 3-6 wth r 4, r 4, r 3 8 yels the number of bts for each menson: log r, log r an log r 3 3. Thus Z s -aresses for ths unverse have a length of 7 bts. Defnton 3-8 (aress ncrementaton an ecrementaton): For a Z-aress α we efne the Z-aresses α (an α ) as follows: Conser α as a bt strng an a to α (or subtract from α) the bnary number. The resultng bt strng s then splt nto steps agan an thereby efnes the ncremente Z-aress α (or ecremente Z-aress α ). Example 3-9: Incrementng an ecrementng α Z-aress(p) 5.7. s of Example 3-6 yels: α 5.7. s s α 5.7. s s Snce Z-aresses are calculate wth mensons sorte n escenng orer of ther resoluton, steps() enotes the number of steps of a Z-aress.

82 66 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING Theorem 3-4 (somorphsm between Z s -aresses an Z -aresses): For Z-aresses ε the stanar representaton an the ncrement representaton of Z-aresses are somorphc. Proof: For ths proof t s mportant to remember that Z -aresses can be enlarge to the length of Z s -aresses (Lemma 3-6). Snce there exsts no counterpart to ε n Z s -aresses, we must exclue ε from the proof. The subcubes of a Z-area efne by a Z -aress are then numbere from to, whereas the subcubes of Z s -aresses are numbere from 0 to -. The funamental fference between stanar representaton an ncrement representaton s that, wth respect to further subvson steps, n stanar representaton a subcube number enotes the upper left corner of a subcube, whereas the lower rght corner of the subcube s enote n ncrement representaton. In orer to aress ponts whch are (n Z-orerng) less than the lower rght corner, the subcube number has to be ecremente by n ncrement representaton. Therefore the oman of the subcube numbers of Z -aresses must be enhance by an atonal subcube number zero for each but the last subvson step (at the lmt of resoluton no further subvsons take place, thus a step number of zero s not possble for the last step of a Z -aress).ths s llustrate n Fgure 3-9. Conserng the area contanment up to the last subvson step there s no fference between stanar representaton an ncrement representaton, snce by ang the atonal subcube number zero both representatons enote the same subcube by the same number. Because of the fferent subcube numberngs at the lmt of resoluton (.e., for the last step) the subcube number n ncrement representaton s by one larger than the subcube number n stanar representaton (a) stanar representaton (b) ncrement representaton Fgure 3-9: Aress representaton The proof of Theorem 3-4 mmeately yels a bjectve functon to swtch between Z - aresses an Z s -aresses: If we enlarge the Z -aress to the length of the Z s -aress an wrte t as a bnary strng: In the same way, we calculate: α ncrement α stanar. α stanar α ncrement.

83 SECTION 3.6: GEOMETRIC VIEW OF Z-REGION PARTITIONING 67 Note that n ncrement representaton the very frst step conssts of + bts, snce t must be possble to represent the number n ths step. Example 3-0: The pont p (3,,7) of Example 3-6 has the Z s -aress 5.7. s, whch yels the bnary strng (0)0. Bnary ncrementng ths Z s -aress by, yels the strng (0) Splttng ths bnary strng n steps yels (0) Removng tralng zeroes yels the Z -aress 6. Example 3-: In the same way we calculate the Z s -aresses for the Z-areas area(0.0. ), area(.3. ), area(. ), area(3 ) an area(4 ) of the two-mensonal unverse wth resoluton 8 of Fgure 3-8: Z -aress btstrng ecremente bts splttng n steps Z s -aress 0.0. (0)00000 (0) (0) s.3. (0)00 (0)00 (0) s..0 (0)0000 (0)000 (0) s (0)0000 (0)0 (0) s () (0) (0) s Table 3-: Transformaton from Z -aresses to Z s -aresses Lemma 3- (open regons an close regons): ] α : β ] [α : β ] Proof: ] α : β ] enotes the space area(β ) \ area(α ). Ths s the subspace of multmensonal space covere by the Z-aress nterval ] α, β ]. The pont corresponng to the Z-aress α s nclue n ] α : β ],.e., ] α, β] [α, β ]. Thus the Z-regon [α : β ] as efne n Secton 3.4 s entcal to the Z-regon ] α : β ] as efne n ths secton. Lemma 3-3: The orerng of Z-values of Secton 3. s entcal to the orerng efne by Algorthm 3-. Proof: Algorthm 3- calculates the ornal numbers for by storng the result of a bnary ecson between x < -j- an x -j- for each menson D n bt α j+- of the bnary representaton of the aress α α s-...α 0. Accorng to Defnton 3-3 Z(x) s calculate exactly n the same way. Therefore Z-values are merely another nterpretaton of the aresses calculate by Algorthm 3-.

84 68 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING 3.6. Volumes of Z-Areas an Z-Regons The volume of a Z-area Λ s the percentage of the entre space that s covere by Λ. Snce area(ε ), vol(area(ε )) 0. If a Z-area has a postve volume, ths volume can easly be calculate from the Z s -aress of the Z-area: Lemma 3-4: If β...β k s the bt-sequence of a Z s -aress numbere from left to rght, the volume of the corresponng area s calculate as Proof: vol k k ( area( β... β k )) + β Accorng to the efnton of Z s -aresses, each bt β of a stanar aress represents a ecson between two parts of space wth equal volume. If β s set, the frst half of the space s completely contane n the area. Otherwse, only a part of the frst half of the space s contane n the area. Ths part s escrbe by the followng bts of the aress. Snce each subvson step ves the prevous subspace nto two spaces of equal volume, β escrbes a bnary ecson n a subspace wth a volume of - : If β s set, a subspace wth volume - s nclue n the area. Thus β - escrbes the contrbuton of the subspace of the volume - to the overall volume of area(β). At the lmt of resoluton ether one or two subcubes wth volume -k are nclue. If β k s zero, only the frst subcube s nclue, otherwse both subcubes are nclue. We mmeately get the above formula by weghtng each set bt by the volume of the corresponng subspace an ang up these volumes an correctvely ang -k for the last step. Example 3-: The volumes of the Z-areas n Fgure 3-8 are: vol(area(ε )) 0 vol(area(0.0. )) vol(area( s )) /64 vol(area(.3. )) vol(area(0..0 s )) ¼+ /8 + /6 + /64 + /64 30/64 vol(area(. )) vol(area(0.00. s )) ½ + /3 + /64 + /64 36/64 vol(area(3 )) vol(area(0.. s )) ½+/8+/6+/3+/64+/64 48/64 vol(area(4 )) vol(area(.. s )) ½+/4+/8+/6+/3+/64+/64

85 SECTION 3.7: INDEPENDENT DIMENSIONS 69 Lemma 3-5: If α...α k an β...β k are the bt sequences of the aresses α an β, the volume of the Z-regon ]α : β] s calculate as Proof: vol(] α : β ]) vol(area( β )) vol(area( α)) k ( β α ) ]α : β ] area(β ) \ area(α ). Applyng Lemma -3 an Lemma 3-4 mmeately yels the proof. Example 3-3: The volumes of the Z-regons n Fgure 3-8 are: vol(]ε : 0.0. ]) vol(area(0.0. )) vol(area(ε )) /64 0 /64 vol(]0.0. :.3. ]) vol(area(.3. )) vol area(0.0. ) 30/64 /64 9/64 vol(].3. :. ]) 6/64 vol(]. : 3 ]) /64 vol(]3 : 4 ] 6/ Inepenent Dmensons In the followng we conser multmensonal unverses whch are parttone nto Z-regons by pont ata whch s strbute nepenently n each menson. Defnton 3-9 (epenence an nepenence of mensons): We call two mensons x an x j of a multmensonal ata strbuton epenent, f there exsts a mappng f such that f(x ) x j. Otherwse we call x an x j nepenent. In the followng P(x) s the probablty of the occurrence of a tuple x Ω n the relaton R. By P (x ), D, we enote the probablty of the occurrence of the attrbute value x {,...,r }. By F (x ) we enote the cumulate strbuton for the attrbute value x {,...,r } of menson,.e., F ( x ) x c P ( c) For nepenent ata strbutons of each attrbute the probablty P(x,...,x )) s the prouct of the probabltes P (x ),.e., P( x,..., x ) P ( x )

86 70 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING In the followng we conser several cases of nepenent ata strbutons for each menson, namely unformly strbute ata, Gaussan strbute ata an combnatons of both Unformly Dstrbute Data The most smple ata strbuton for a oman s unformly strbute ata. For unformly strbute ata the probablty of the occurence of a specfc value s constant an entcal for each value,.e., P (x ) / Unform strbuton s n general only an approxmaton for computer generate ata. Realworl applcatons selom create unformly strbute ata. Nevertheless, ths type of strbuton s nterestng from a theoretcal pont of vew: If the ata strbuton of an attrbute s unknown, t s often useful to assume unform strbuton n orer to not sfavor a certan value. In aton the theoretcal analyss s greatly smplfe by the assumpton of unformly strbute ata. (a) (b) (c) Fgure 3-0: Unformly an Gaussan strbute ata Fgure 3-0a shows the pont an Z-regon strbuton for a parttone relaton wth 5000 ponts store on about 00 sk pages (.e., about 00 Z-regons), where the values of both mensons are strbute unformly n the corresponng oman.

87 SECTION 3.7: INDEPENDENT DIMENSIONS Gaussan Dstrbute Data A very frequent ata strbuton for practcal applcatons s the Gaussan strbuton (or normal strbuton). The strbuton functon of the Gaussan strbuton s: P ( x ) σ F ( x ) Φ µ, σ e π ( x ) σ ( x µ ) σ π x ( uµ ) e σ Gaussan strbute ata s locate aroun the average value µ wth a stanar evaton of σ. Therefore the ata s clustere aroun the center µ. Fgure 3-0b shows a multmensonal ata strbuton, where both mensons are nepenently Gaussan strbute, an the corresponng regon parttonng of the multmensonal space. Fgure 3-0c shows a multmensonal ata strbuton, where the horzontal menson s Gaussan strbute an the vertcal menson s unformly strbute. These pctures show an unerlyng prncple of Z-regon parttonng: Z-regon parttonng aapts to the ensty of the ata, whle tryng to preserve localty n the parttons (regons). u Iealze Unform Parttonng In ths secton we nvestgate a specal case of unformly strbute ata, namely an ealze unformly parttone unverse. Defnton 3-30 (ealze unform parttonng): A set of P aresses s ealze unformly strbute, f all of these aresses only ffer n the log P leftmost bts, whereby all possble combnatons of the log P leftmost bts exst an all bts wth a poston greater than log P are set. We call the parttonng of the multmensonal space ntrouce by the sequence of these Z-aresses ealze unform parttonng. An ealze unform parttonng prouces rectangular regons, where each regon has ether the shape of a subspace wth volume P log or conssts of two of these subspaces. An ealze unform parttonng for the ata of Fgure 3-a s llustrate n Fgure 3-b. If further ata s nserte, the ealze unform parttonng conssts of even more quaratc Z- regons (Fgure 3-c). Ths contnues, untl log P log P. Then the parttonng conssts of a complete chessboar ; further nserton then halves the quaratc regons.

88 7 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING (a) (b) (c) Fgure 3-: Iealze unform parttonng Inepenent an unformly strbute key attrbutes are not very common n practce. However, the concept of Varable UB-Trees as escrbe n Secton allows to create a unform parttonng of nepenently strbute mensons regarless of ther ata strbuton. Many of the observatons erve for unformly parttone ata also hol for Varable UB-Trees. 3.8 Depenent Dmensons Depenences or correlatons between attrbutes often occur n practcal applcatons. The salary, for nstance, s very often relate to the age of a person, the horse powers of a car s engne are normally relate to the prce of the car. In the followng sectons we nvestgate several types of epenent ata strbutons Lnear Depenency Defnton 3-3 (lnear epenency): Two mensons an j are lnearly epenent, f the value of x s a lnear functon of the value of x j,.e., for m 0 an an arbtrary value c o: x mx j + c Lnearly epenent mensons can cause a epenence of partcular bts of the values b of x an b j of x j. Ths restrcts the number of bt combnatons possble for aresses an therefore reuces the number of splts that are necessary to entrely subve the unverse. Fgure 3-a an Fgure 3-b show two specal cases of lnear epenency, namely the entty (m, c 0) an the nverse entty (m -, c 0).

89 SECTION 3.8: DEPENDENT DIMENSIONS 73 (a) (b) (c) Fgure 3-: Depenent mensons 3.8. Further Depenences We use sne epenency to llustrate the behavor of Z-regon splts for non-lnear epenences. Sne epenences occur, f one menson s perocally growng an shrnkng. Ths s true for constructon workers urng summer an wnter, for stocks rates, etc. Defnton 3-3 (sne epenency): Two mensons an j are sne epenent, f the value of x can be calculate by a sne functon of the value of x j,.e., for f 0, a 0: x asn fx j As wth lnear epenences sne epenency causes correlatons between some bts of the attrbutes. Ths agan means that one splt of a Z-regon parttons the unverse wth respect to several attrbutes. For more complcate epenences ths effect may not be as powerful as for the smple cases of sne epenency an lnear epenency, snce t mght not occur globally n the entre unverse, but locally constrane to some part of the unverse. Fgure 3-c shows sne epenency an the corresponng Z-regon parttonng for a an f.

90 74 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING 3.9 Hgh Dmensonaltes Preservng spatal proxmty becomes ncreasngly ffcult wth ncreasng mensonalty. In -mensonal space each pont that s not stuate on some borer of the unverse has neghbors. When recursvely parttonng a -mensonal space s tmes n each menson, s parttons are create. Snce the Z-regons of a Z-regon parttonng correspon to atabase pages, each atonal parttonng level requres an exponental ncrease n the atabase sze. Defnton 3-33 (total splt epth; eal splt epth per menson): We call the number of complete recursve splts of a Z-regon parttonng ts total splt epth l. For a Z- regon parttonng consstng of P ealze unformly strbute Z-regons, the total splt epth s calculate as: l (P) log P l (P)/ then s the lower boun for the eal splt epth per menson. The rghtmost l (P) mo mensons have one atonal complete splt level. Thus the upper boun of the eal splt epth per menson s l (P)/ +.Therefore we can calculate the eal splt epth for menson as: l P) l ( l P) ( ( P) +, l ( P) mo >,otherwse Fgure 3-3 splays the eal splt epth per menson for Z-regon parttonngs wth up to 0 mensons. The fgure shows that mensonaltes larger than 6 result n an eal splt epth of at most 5 for tables of 80 GB. However, an eal splt epth of 5 means 5 3 subvsons n each menson. Dmensonaltes larger than never excee an eal splt epth of 3 for even huge atabases. Accorng to the Fgure mensonaltes larger than 0 almost forever consst of only one splt level for each menson an for mensonaltes larger than 3 one can expect that many of these mensons wll not be splt at all an thus o not contrbute to the space parttonng. Ths means that for average to large sze tables the Z-regon parttonng yels a sutable space parttonng for mensonaltes less than or equal to 6. The multmensonal space parttonng eterorates exponentally wth ncreasng mensonalty, snce the table sze nees to grow exponentally to compensate an ncrease n mensonalty. We have efne the aress calculaton to use the orer, -, -,...,, of mensons to calculate aresses. Therefore the rghtmost mensons are subve frst.

91 SECTION 3.0: UTILIZATION OF THE MULTIDIMENSIONAL SPACE 75 splt epth menson mensons 3 mensons 4 mensons 5 mensons 6 mensons 7 mensons 8 mensons 9 mensons 0 mensons mensons 4 mensons 6 mensons 8 mensons 0 mensons E+06 E+07 E+08 E+09 8kB 80kB 800kB 8MB 80MB 800MB 8GB 80GB 800GB 8TB table sze n pages resp. Bytes for 8kB pages Fgure 3-3: Splt epth per menson for ealze unformly strbute regons 3.0 Utlzaton of the multmensonal Space Defnton 3-34 (actual oman of a menson): For a relaton R(x,...,x ), x for all D, the actual oman of menson s: {x (x,...,x -, x, x +,...,x ) R} In most cases, the actual oman of an attrbute an the ata strbuton n that oman are not known n avance. In real worl applcatons, usually, snce not all antcpate values exst n the atabase. On the other han one easly runs nto problems when the specfe oman s too small, e.g., the currently very much scusse year-000-problem, where [0, 99] an soon. Defnton 3-35 (partton of an actual oman): For D, s {0,,..., log r } an k {,..., s } we efne the k,,s-partton of the actual oman as the values of the actual whch are locate between (k-) -s % an k -s % of the oman. partton, s (k) {x xˆ ](k-) -s, k -s ] for x } Defnton 3-36 (prefx length of a partton): The prefx length of a partton of an actual oman s the length of the common prefx of the partton n bnary representaton,.e., the number of common bts of all values n the partton: prefx-length,s (k) common-prefx(partton, s (k)), f partton, s (k)

92 76 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING Example 3-4: For the omans 3 [0, 7] we assume the actual omans of the relaton R to be {000,00,00,0,00,0,0, }, {000,00,00,0 }, 3 {000,00 } Ths yels the followng parttons: partton,0 () partton, () {000,00,00,0 }, partton, () {00,0,0, } partton, () {000,00 }, partton, () {00,0 }, partton, (3) {00,0 }, partton, (4) {0, } partton,0 () partton, () {000,00,00,0 }, partton, () partton, () {000,00 }, partton, () {00,0 }, partton, (3), partton, (4) partton 3,0 () 3 partton 3, () {000,00 } 3, partton 3, () partton 3, () {000,00 } 3, partton 3, (), partton 3, (3), partton 3, (4) Thus whle the space parttonng of the oman yels a full parttonng of the actual oman for menson, the actual oman of menson s unevenly parttone. The actual oman of menson 3 s not parttone at all. In ths case we get the followng bnary prefx lengths: prefx-length,0 () 0, prefx-length, (k) for k {,} an prefx-length, (k) for k {,,3,4} prefx-length,0 (), prefx-length, () an prefx-length, (k) for k {,} prefx-length 3,0 (), prefx-length 3, () an prefx-length 3, () All other prefx lengths are unefne ( ) Defnton 3-37 (actual splt epth): The actual splt epth l * for a menson s the number of bts of A whch are use for the space parttonng of the UB-Tree. If for any fxe D an s < log r prefx-length,s (k) s ether constant or unefne for every k {,..., s }, the actual splt epth for each menson can be calculate by Algorthm 3-.

93 SECTION 3.0: UTILIZATION OF THE MULTIDIMENSIONAL SPACE 77 Input: p : number of pages : number of mensons prefx_length,j : length of the common prefx of the frst j bts of the actual oman of attrbute r : carnalty of the oman of each attrbute Output: l,...,l : actual splt epths * * l... l 0 s l (p) j 0 repeat f prefx_length,j < j then * * l : l + s : s en f f < then : + else : j : j + en f untl s 0 or j > log r Algorthm 3-: Calculaton of the actual splt epth Provng unuse extra values at some borer of the oman or not usng all ntermeate values causes an mperfect oman exhauston for an attrbute. Snce the Z-regon parttonng performs a recursve subvson n the mle for each step, ths may result n a lower actual splt epth for some mensons. Example 3-5: For P 3 the actual splt epths of Example 3-4 for P 3 are l * 3, l * an l * 3. Ths means that the frst parttonng n menson 3 takes place after three subvson steps n menson an two subvson steps n menson. Ths yels exactly the same multmensonal orerng as the compoun orerng x x x3. Moreover, f the table sze s not large enough to splt 6 tmes,.e., P < 3, the parttonng n menson 3 s not reflecte by the Z-regon parttonng at all. Defnton 3-38 (normalze actual splt epth per menson an normalze actual splt epth): We efne the normalze actual splt epth for menson as: l * * maxl j, l log V j D * l,otherwse The normalze actual splt epth s efne as: l l j j D

94 78 CHAPTER 3: MULTIDIMENSIONAL SPACE PARTITIONING The normalze actual splt epth normalzes the actual splt epth for unverses of fferent resolutons per menson. If one attrbute s parttone entrely, ts splt epth s ncrease to the maxmum actual splt epth over all mensons. Ths yels entcal normalze splt epth, f all mensons are parttone completely. Dfferent actual splt epths mean that the actual omans o not partton the unverse evenly. Ths causes a layere multmensonal parttonng that remns us of a puff pastry. We therefore call ths effect of the ncomplete oman utlzaton puff pastry effect. Defnton 3-39 (puff pastry, puff pastry egree): If for a Z-regon parttonng of a relaton R the normalze actual splt epths are not entcal for all mensons D, we call the parttonng to be a puff pastry. The egree of the puff pastry s measure as the normalze stanar evaton of the normalze actual splt epths: puff-pastry-egree(r) st({l l D}) The puff pastry egree s a measure for the asymmetry of a Z-regon parttonng. The normalze actual splt epth l LV XVHG WR WDNH GLPHQVLRQV ZLWK YDULDEOH FDUGLQDOLW\ LQWR account. The value of the puff pastry egree s a number between 0 an, whereby 0 means a perfect Z-regon parttonng wthout any puff pastry. If splts only take place wth respect to one menson, the most extreme puff pastry egree of s acheve. Example 3-6: For P 4 6 the relaton R of Example 3-4 has a puff pastry egree of 0,75, for P 5 3 the puff pastry egree s The ata of Fgure 3-0a an Fgure 3-0b as well as the ata of Fgure 3-a an Fgure 3-b also have a puff pastry egree of 0,00. Fgure 3-0c has actual splt epth of l * an l * 6, resultng n a puff pastry egree of 0.5. Fgure 3-4 shows three examples of multmensonal space utlzaton. Fgure 3-4a an Fgure 3-4b show ata, whch s strbute unformly n the vertcal menson an Gaussan wth a very small stanar evaton n the horzontal menson. In Fgure 3-4a the average s exactly the center of the unverse. Thus, for 50% of the values n the horzontal menson the leftmost bt s set an for 50% of the value the leftmost bt s cleare. Ths results n an actual splt epth of n the frst menson versus an actual splt epth of 4 for the secon menson, yelng a puff pastry egree of 0.6. In Fgure 3-4b the average s slghtly shfte to the left. Now all values of the horzontal menson have the same prefx. Therefore the splt epth n the horzontal menson s reuce to zero,.e., no subvson wth respect to ths menson has taken place. Fgure 3-4b therefore shows a real puff pastry wth a puff pastry egree of.00. The Z-regon parttonng of Fgure 3-4c s flle wth ata of 5 two-mensonal Gaussan strbutons wth fferent average values. Here fve puff pastres for fve ata clusters supermpose each other. Snce four of these clusters are n fferent parts of the unverse, whch result n fferent prefxes for the frst two bts of the Z- regon aresses, the overall puff pastry egree s not as strong as n Fgure 3-4b.

95 SECTION 3.0: UTILIZATION OF THE MULTIDIMENSIONAL SPACE 79 (a) (b) (c) Fgure 3-4: Utlzaton of the multmensonal space In extreme cases as n Fgure 3-4b the puff pastry effect results n a splttng smlar to compoun orerng. Thus the puff pastry effect serously nfluences the multmensonal behavor of a Z-regon parttonng an may n worst case totally estroy the multmensonal orerng. Therefore t s mportant to restrct omans as far as possble. Together wth reorganzaton algorthms ths strategy can avo the puff pastry effect. If the ata strbuton of all mensons s known n avance, the Z-regon parttonng can take these ata strbutons nto account an thereby avo the puff pastry effect. Ths s acheve by the so-calle Varable UB-Tree, whch s escrbe n more etal n Chapter 6.4.

96

97 Part II Our Approach To Query Processng wth Multmensonal Inexes

98

99 CHAPTER 4: THE UB-TREE 83 Wth nformaton we can go anywhere n the worl, we are lke turtles, our houses always on our backs. (John Le Carré) Chapter 4 The UB-Tree he totally orere aresses of a regon parttonng create by a space fllng curve can be store n any varant of a B-Tree. Ths allows to create a T multmensonal nex for a unverse parttone nto regons. For our prototype mplementaton we use Z-regon parttonng, whch s mplemente easly whle showng benefcal propertes for multmensonal clusterng of tuples (see Chapter 3). Frst we ntrouce the basc concepts of the unversal B-Tree (UB-Tree). Then we escrbe algorthms for nserton an eleton. The algorthm for exact match queres s erve from the corresponng algorthm for exact match queres n B-Trees. Range queres are performe by etermnng the smallest set of Z-regons of the multmensonal parttone relaton that buls a cover for the query box. The nserton, eleton, pont-query, an range query algorthms were frst escrbe n [Bay96]. In aton to these algorthm we ntrouce two further algorthms n ths thess: Nearest neghbor queres are effcently hanle by the Spral algorthm, a refnement of the range query algorthm whch just retreves Z-regons that store a nearest neghbor canate. Our secon algorthm s the Tetrs algorthm, a technque to process a multmensonally parttone relaton n the sort orer of any attrbute.

100 84 CHAPTER 4: THE UB-TREE 4. Concept of the UB-Tree The UB-Tree [Bay96] uses a space fllng curve to create a parttonng of a multmensonal unverse whle preservng multmensonal clusterng. Usng the Lebesgue-curve (Z-curve) t s a varant of the zk-b-tree [OM84]. The UB-Tree utlzes a B * -Tree to store the Z- aresses of a Z-regon parttonng of the multmensonal space. Each Z-regon s mappe onto one sk page. At nserton tme a full Z-regon [α : β ] s splt nto two Z-regons by ntroucng a new Z-aress γ wth α γ β. γ s chosen so that the frst half (n Z-orer) of the tuples store on Z-regon [α : β ] s strbute to [α : γ ] an the secon half s store on [γ : β ]. Thus a worst case storage utlzaton of 50% s guarantee. There s some freeom of choce for the Z-regon splt. For optmal query performance the splt algorthm for UB-Trees tres to mantan rectangular regons an mnmze frnges whenever possble. The UB-Tree requres logarthmc tme (n the carnalty of the parttone relaton) for the basc operatons of nserton, pont retreval, an eleton. We wrte page(α : β ) for the page corresponng to the Z-regon [α : β ]. Depenng on the context we also use the noton page an notaton page(α:β ) for the set of tuples store n Z- regon [α : β ]. By count(α : β ) we enote the number of objects locate on page(α : β ). Defnton 4- (UB-Tree): A UB-Tree s any varant of a B-Tree n whch the keys are Z- aresses of Z-regons orere by. Snce we get γ β for two neghborng Z- regons [α : β ] an [γ : δ ], t suffces to store the upper aress of each Z-regon n the B-Tree. Each leaf page hols the tuples belongng to the corresponng Z-regon. For seconary UB-Trees only tuple entfers are store on the corresponng leaf page (see Fgure 4-). B-Tree noe storng regon aresses of Z-regons UB-nex UB-fle B-Tree leaf storng tuples belongng to one Z-regon Fgure 4-: The UB-Tree Example 4-: The fve Z-regons n Fgure 3-8 bul a UB-Tree for the pont ata splaye n the lower rght corner of Fgure 3-8. Although Z-regons ffer n volume, each Z-regon stores about the same number of tuples because of the storage utlzaton guarantees of UB-Trees. Both the upper left corner an the lower rght quarter of the unverse contan fve ponts, although the sze (volume) of the regon coverng the lower rght quarter of the unverse s 6 tmes larger.

101 SECTION 4.: INSERTION INTO UB-TREES Inserton nto UB-Trees A tuple x to be nserte nto the relaton R s specfe by ts coornates (x,x,..., x ) wth Z- aress ξ Z(x, x,..., x ). x belongs to the unque Z-regon [α:γ ] satsfyng α ξ γ. Note that ξ must be compute only to a precson whch s suffcent to etermne the proper Z-regon. x s nserte nto the leaf-page corresponng to that Z-regon, whch s foun by a pont query. Snce pages can store only a maxmum number C of ponts, pages may overflow an be splt lke n B-trees. [α : γ ] s splt by ntroucng a new Z-area wth Z-aress β such that α β γ. The Z-regon [α : γ ] s parttone by β nto [α : β ] an [β : γ ]. The objects on page(α : γ ) are strbute onto page(α : β ) an page(β : γ ) accorngly. β s constructe by ncreasng area(α) as follows: A to area α subcubes from [α : γ ] n ncreasng orer untl the number of the objects n [α : β ] s between ½ C - ε an ½ C + ε. If the next subcube n ths process contans too many objects, t s recursvely subve untl the conton can be met. The parameter ε s use to get shorter splt aresses,.e., a better space parttonng. Secton 5.4 shows how the parameter ε s use to reuce the number of Z- regons that overlap a query box an thus mproves the performance of range queres n UB- Trees. Input: x : tuple to store n the UB-Tree Output: none ξ Z(x) fn [α :γ ] n the UB-Tree, such that α ξ γ retreve page(α :γ ) nsert x nto page(α:γ) f count(α :γ ) > C choose β [α :γ ], so that ½ C - ε count(α :β ) ½ C + ε splt page(α :γ ) nto page(α :β ) an page(β : γ ) en f Algorthm 4-: Inserton algorthm for tuple x The algorthm reles on the B-Tree operatons to fn the correct regon. For n tuples store n the atabase t nherts the I/O-complexty of O(log C n). The CPU-complexty s also smlar to that of B-Trees. The only fference s the aress transformaton, whch s a very small amount of CPU tme. We gve performance fgures for Z-aress calculaton n Secton Fgure 4- shows the nserton process nto a UB-Tree. Frst the entre unverse s store on one sk page, the corresponng multmensonal space s represente by one Z-regon (Fgure 4-a). After repeate nserton of tuples that page wll overflow an s splt. The multmensonal space s splt nto two Z-regons (Fgure 4-b). After further nsertons the upper Z-regon of Fgure 4-b ncate by the black arrow s splt (Fgure 4-c). After further nsertons an atonal splt takes place (Fgure 4-). Fgure 4-e shows the unverse after several further splts. The Z-regon splt n Fgure 4-f llustrates that splts n UB-Trees are local operatons. Other regons are not affecte by a splt.

102 86 CHAPTER 4: THE UB-TREE splt splt splt (a) (b) (c) splt () (e) (f) Fgure 4-: Inserton nto UB-Trees 4.3 The Pont Query Algorthm To fn a tuple x (x,x,..., x ) we compute ts aress ξ : Z(x,x,..., x ) wth suffcent precson to fn the unque regon [α : β ] wth the property α ξ β an fetch page(α : β ) from the UB-fle. Ths s acheve by searchng the UB-nex, usng aress ξ as the search key. page(α : β ) must contan x wth the atonal nformaton or the row of x. Input: x : tuple, only nex attrbutes are specfe Output: x : tuple, all attrbutes are specfe ξ Z (x) fn [α:β] n the UB-Tree, such that α ξ β retreve page(α:β) nto man memory search content of page(α:β) to fn x Algorthm 4-: Pont query algorthm to fn tuple x Ths algorthm nherts the complexty of the unerlyng access structure for storng the aresses. The only atonal overhea s the aress calculaton algorthm. An effcent mplementaton wth a CPU-complexty lnear n the tuple sze s escrbe n Secton 5.3.

103 SECTION 4.4: DELETION FROM UB-TREES Deleton from UB-Trees When tuples are elete from [α :β ], they are remove from page(α :β ). If after ths eleton page(α :β ) contans less than ½ C - ε elements, then page(α :β ) s merge wth the followng page(β : γ ) an the Z-regon [β : γ ] sappears. If the resultng page(α : γ ) overflows, t s splt agan "n the mle" by ntroucng a new Z-area wth aress δ an the regons [α :δ ] an [δ : γ ] wth the corresponng pages page(α : δ ) an page(δ : γ ). Ths fnal splt of regons an pages s analogous to the unerflow technque between pages of B-trees [BM7]. Input: x : tuple to elete from the UB-Tree Output: none // for easy llustraton the algorthm oes not hanle the specal // case of the tuple beng store on the root page or last leaf ξ Z(x) search [α :β]n the UB-Tree, such that α ξ β retreve page(α : β ) elete x from page( α : β) f count(α : β) < ½ C - ε merge page(α : β) wth the neghborng page(β : γ ) nto page(α : γ ) f count(α : γ) > C choose δ [ α: γ] wth count( α : δ) ½ C - ε an count(δ :γ) ½ C - ε splt page(α : γ) nto page(α : δ) an page(δ : γ) en f en f Algorthm 4-3: Deleton algorthm for a tuple x The complexty conseratons of nserton also apply to the eleton algorthm. For T tuples store n the atabase the I/O-complexty s O(log C T ). The CPU-complexty s also smlar to that of B-Trees. 4.5 The Range Query Algorthm To answer a range query, only those Z-regons, whch properly ntersect the query box, must be fetche from the atabase an thus from the sk [Bay96]. Intally the range query algorthm calculates an retreves the frst Z-regon that s overlappe by the query-box. Then the next ntersectng Z-regon s calculate an retreve. Ths s repeate untl a mnmal cover for the query box has been constructe,.e., the regon that contans the enng pont of the query box has been retreve. Defnton 4- (subcube of an aress): For any Z -aress α α.....α k we efne subcube(α), the hypercube-shape multmensonal nterval, whch has Z - (α) as ts en pont an a normalze length of -k n each menson: subcube( α) { x Z ( α. K.( a k ) K.0.) < Ω x < Z ( α. K. α k )} rk

104 88 CHAPTER 4: THE UB-TREE Input: y,z : tuples that efne a query box wth Z(y) Output: X : result set of the range query Z(z) ξ Z(y); ω Z(z); X repeat fn [α:β] n the UB-Tree, such that α ξ β X X {(x,...,x ) [α:β] (x,..., x ) [[y, z]]} ξ Z-aress of the frst pont ntersectng the query box wth ξ β untl ξ ω Algorthm 4-4: Range query algorthm for a query box [[y, z]] It s mportant to note that for any Z-regon the calculaton of the next pont ntersectng the query box s performe solely n man memory. [Bay96] escrbes an algorthm that for a Z- regon [α : β ] an a query box Q calculates the largest Z-aress γ [α : β ] so that subcube(γ ) Q. The algorthm then traverses all subcubes wth greater Z-aresses, untl one of these subcubes ntersects Q. The smallest ntersectng pont p n Z-orer then efnes the next relevant Z-regon, whch s retreve by a pont query. Note that because of Lemma 3-6 the next subcube to subcube(ξ.....ξ j. ) s subcube (ξ.....ξ j +), whch allows a more effcent traversal through the multmensonal space. However, the algorthm stll has a CPU complexty whch s exponental n the number of mensons of the UB-Tree, whch lmts ths algorthm to mensonaltes below 4. In Secton 5.7. we sketch the mplementaton ea of a lnear algorthm that solely operates on a bnary representaton of Z-aresses. (a) (b) (c) () (e) (f) Fgure 4-3: Processng a range query

105 SECTION 4.5: THE RANGE QUERY ALGORITHM 89 Fgure 4-3 llustrates the orer n whch the range query algorthm calculates an retreves Z- regons for the range query llustrate by the black-borere query box. For ths example we use the notaton ]α : β ] for [α : β ]. The Z-regons overlappng the query box are shae. Intally the algorthm uses the coornate y of the query box Q [[y, z]] for a pont query an thereby retreves the Z-regon ] : ] from the UB-nex. Then the next Z-regon that overlaps the query box must be etermne (see also Fgure 4-4a): The last subcube of ] : ] overlappng the query box s subcube( ), whch contans the pont x as the last ntersectng pont wth respect to Z- orerng. The next subcube, subcube( ), has an ntersecton wth Q. The frst ntersectng pont s p. A pont query wth p yels the Z-regon ] : ]. Thus ths Z-regon may contan tuples, whch are part of the result set of the range query. Fnng the next Z-regon works n the same way: The last ntersectng subcube of ] : ] s subcube(0. ). The next ntersectng subcube s subcube(0. ). The corresponng pont query retreves the Z-regon ] : ], whch agan may contan tuples of the result set. For etermnng the next Z-regon (see also Fgure 4-4b) the algorthm starts wth subcube(0..3 ). subcube(0..4 ) has no ntersecton wth the query box. The next subcube wth a larger Z-aress s subcube(0.3 ), whch has an ntersecton wth the query box n p. The pont query then retreves the Z-regon ] : 0... ]. Here we alreay see that the algorthm skppe Z-regon ] : ], snce ths Z-regon oes not overlap the query box. In the same way the algorthm calculates an retreves the Z-regons ]0... : ] an ] : ]. x p p x subcube(0..3 ) subcube(0..4 ) subcube(0.3 ) subcube( ) (a) subcube( ) Fgure 4-4: Zoom nto Fgure 4-3 (b)

106 90 CHAPTER 4: THE UB-TREE The vsualzaton alreay shows the prncple beneft of the UB-Tree for multmensonal range queres: Only Z-regons that may contrbute to the result set of a range query nee to be retreve from sk n orer to answer the query. Thus n contrast to one-mensonal access methos as escrbe n Secton.4 the number of sk accesses of the UB-Tree range query algorthm s correlate to restrctons efne by all mensons, not just to restrcton n one menson. We wll analyze the theoretcal an practcal behavor of the range query algorthm n Chapters 6 an 7 an compare t to one-mensonal access methos. Two strateges for returnng the tuples of a query box are possble: mmeate tuple elvery: As soon as a Z-regon ρ s retreve from the UB-nex, the corresponng page(ρ) s retreve from the UB-fle. The tuples on page(ρ) are then extracte mmeately an the tuples n the query box are returne. Ths strategy allows shorter response tmes an oes not requre any cachng. eferre tuple elvery: After the upper Z-aress of Z-regon ρ has been retreve from the UB-nex, t s store n a regon result set. After all Z-aresses have been retreve, the tuples of the entre regon result set are materalze. Ths separaton of UB-nex access an UB-fle access allows one to prect the tuple retreval tme before actually ong the retreval. Ths may especally be useful for large result sets, where after retrevng the regons the user mght be aske whether he really wants to materalze the result set. Theorem 4- (range query theorem): The range query algorthm transforms the multmensonal nterval [[y, z]] nto a set of one-mensonal aress ntervals {]α, β ], ]β, γ ],...}, whch wth respect to the gven regon parttonng s the smallest cover for [[y, z]]. Proof: The proof s a rect consequence of Algorthm 4-4 an the fact that Z-regons are one-mensonal ntervals. Q multmensonal space Z-aress space Fgure 4-5: Transformng a query box nto a set of Z-ntervals

107 SECTION 4.6: THE SPIRAL ALGORITHM FOR NEAREST-NEIGHBOR QUERIES 9 Very often some of these one-mensonal ntervals are connecte an therefore mght be merge to larger one-mensonal ntervals consstng of several regons. For nstance, the sx Z-regons overlappe by the query box of Fgure 4-3 are sx one-mensonal ntervals, whch coul be merge nto the two ntervals ] , ] an ] , ], each of whch conssts of three Z-regons (see Fgure 4-5). Ths mergng coul reuce the number of I/Os necessary to retreve the pages of a query box, snce for page clustere ata only two ranom accesses are necessary versus sx ranom accesses for tuple clustere ata. 4.6 The Spral Algorthm for Nearest-Neghbor Queres For UB-Trees nearest neghbor queres can effcently be processe n the followng way: To fn the nearest neghbor of pont x we frst retreve the Z-regon ρ where x woul be locate. Then we nvestgate all nearest neghbors to x that are store on ρ. We have foun a nearest neghbor, f the nearest neghbor z on ρ s closer to x than any pont on the Z-regon borer of ρ. Otherwse we call z to be a nearest neghbor canate. We now have to retreve the closest not retreve Z-regon, snce ths Z-regon mght contan a pont z wth a lower stance to x than z. Ths process must be terate untl no pont on a borer of the retreve Z-regons has a stance to x less then the stance of the current nearest neghbor canate. We call ths algorthm spral algorthm because of the orer that s use to retreve the Z- regons n orer to fn the nearest neghbor. Input: x : tuple, for whch the nearest neghbor s wante Output: N : set of tuples contanng the nearest neghbors to x ξ Z(x) fn [α:β ] n the UB-Tree, such that α ξ β N nearest neghbors of [α:β ]to x R [α:β ] whle there s a pont y beyon the borer of R such that for some z N: stance(x,z) > stance(x,y) retreve the Z-regon [γ:δ ] borerng R, whch has a pont on ts borer that has the least stance to x N nearest neghbors of ([γ:δ ] N) to x R R [γ:δ ] en whle return the nearest neghbors of N to x Algorthm 4-5: Spral algorthm to fn the nearest neghbor to a pont x Example 4-: Two nearest neghbor queres are llustrate n Fgure 4-6a. To fn the nearest neghbor to pont x Z-regon ρ s retreve. The two ponts y an z are the nearest neghbors to x on ρ. Snce no pont on the borer of ρ has a stance less than the stance of y or z, we alreay

108 9 CHAPTER 4: THE UB-TREE have foun the nearest neghbors. To fn the nearest neghbor to pont x we frst retreve Z- regon ρ 3. On ρ 3 we fn the nearest neghbor y. Snce a pont of the borer of ρ has a closer stance to x than y, we have to retreve ρ. Snce no pont of ρ has a closer stance than y an no further borer of a regon has a closer stance to x than y, we have foun the nearest neghbor. ρ x y z ρ ρ z x a x y y a ρ 3 ρ (a) Fgure 4-6: Three nearest neghbor queres (b) To fn the nearest neghbor to pont x n the UB-Tree of Fgure 4-6b, ntally the Z-regon ρ s retreve, yelng the nearest neghbor y. Snce ponts on the borer of ρ have a stance to x that s less than the stance of y, next the regon ρ s retreve. ρ contans the pont z, whch has the stance a to x, whch s less than the stance a of y. Therefore the nearest neghbor canate now s z. Snce no ponts on the borer of not retreve Z-regons have a lower stance to x than z, the algorthm termnates wth z as the nearest neghbor. Implementaton an further analyss of the spral algorthm s performe by a master stuent supervse by the author [Str99]. Detale analyss of the algorthm can be foun there. We just stress that the algorthm n worst case results n a range query. The query box of that range query s etermne by the coverng square to a crcle aroun x. The raus of the crcle s efne by the stance of the nearest neghbor to x. 4.7 The Tetrs Algorthm for Sorte Processng of Query Boxes Tables organze by a UB-Tree can be rea n any sort orer n O(n) sk accesses where n s the number of pages of the table or the mnmal number of regons coverng a query box. [Bay97] proposes to partton the unverse n a number of equally sze slces. Algorthm 4-4

109 SECTION 4.7: THE TETRIS ALGORITHM FOR SORTED PROCESSING OF QUERY BOXES 93 s use to retreve each slce. After retreval a slce s sorte an returne to the caller of the sort operaton. For non-unformly strbute unverses we propose a mofcaton of the range query algorthm an a cachng technque, the so calle Tetrs-Algorthm. The Tetrs algorthm s a generalzaton of the multmensonal range query algorthm (cf. Secton 4.4) that effcently combnes sort operatons wth the evaluaton of mult-attrbute restrctons. The basc ea s to use the partal sort orer mpose by a multmensonal parttonng n orer to process a table n some total sort orer. Essentally a plane sweep [PS85] over a query space efne by restrctons on a multmensonally parttone table s performe. The recton of the sweep s etermne by the sort attrbute. Intally the algorthm calculates the frst Z-regon that s overlappe by the query box, retreves t an caches t n man memory. Then t contnues to rea an cache the next Z-regons wth respect to the requeste sort orer, untl a complete thnnest possble slce of the query box (n the sortng menson) has been rea. Then the cache tuples of ths slce are sorte n man memory, returne n sort orer to the caller an remove from cache. The algorthm procees reang the next slce, untl all Z-regons ntersectng the query box have been processe. Only sk pages overlappng the query space are accesse. Wth suffcent, but moest, cache memory each sk page s accesse only once: To sort 50% of a.3 GB relaton, the Tetrs algorthm just requres a cache of.6 MB, whereas a stanar merge-sort algorthm nees a cache of 750 MB (see sectons 6.4. an 8..). Input: y,z : tuples efnng a query box wth Z(y) Z(z) : sortng menson Output: stream of tuples n [[y, z]] sorte accorng to menson ξ Z(y); ω Z(z) repeat search [α:β] n the UB-Tree, such that α ξ β store all tuples from page(α:β) n the cache f a new slce n menson s complete s enpont of the slce menson sort all cache tuples output all cache tuples x where x s remove all cache tuples x where x s from cache en f ξ Z-aress of the next pont ntersectng the querybox wth respect to menson untl ξ ω Algorthm 4-6: Tetrs algorthm for a query box [[y, z]]

110 94 CHAPTER 4: THE UB-TREE Thus the Tetrs-Algorthm works smlar to the range-query algorthm. The only fference s that the calculaton of the next ntersectng Z-regon oes not return a Z-regon accorng to Z-orerng, but accorng to the specfe sort orer (Tetrs orer): Intally the algorthm calculates the frst Z-regon that s overlappe by the query-box, retreves t an caches t n man memory. Then t contnues to rea an cache the next Z-regons wth respect to the sort orer, untl a complete thnnest possble slce of the query box has been rea. Then the cache tuples of ths slce are sorte n man memory, returne n sort orer to the caller an remove from cache. The algorthm procees reang the next slce, untl all Z-regons whch ntersect the query box have been retreve an output. An example of sorte reang wth the Tetrs-Algorthm s llustrate n Fgure 4-7, where the entre unverse s sorte accorng to the vertcal menson,.e., from bottom to top. The part of each regon from whch tuples are cache s shae n ths Fgure. Note that a large volume of a cache regon oes not mean that a lot of cache memory s neee: The maxmum number of tuples n each regon an thus maxmum cache sze for each cache regon s C tuples. The current sweep lne s enote by a thn whte lne. Thus the current slce processe by the Tetrs algorthm s the shae area up to the whte lne. The algorthm starts by retrevng the Z-regon n the very left corner (Fgure 4-7a). Successve Z-regons are retreve an cache (Fgure 4-7b) untl the frst vertcal slce s complete (Fgure 4-7c). The tuples of ths slce are then sorte n man memory. All tuples up to the en coornate of the slce n the sortng menson are output n sort orer an remove from cache. The Z-regons of ths slce can not be remove from cache completely at ths pont snce each Z-regon stll mght have some tuples that have not been output yet. However, not the entre Z-regons are cache. Cachng s only necessary for those tuples that have not been output yet. In Fgure 4-7 one atonal Z-regon has been rea to complete the secon slce. Then eght more Z- regons are rea an cache untl the thr slce s complete (Fgure 4-7e). The Tetrs- Algorthm contnues processng n ths way untl the last slce of the query box has been rea (Fgure 4-7f),.e., the complete unverse has been rea n the esre sort orer. The vsualzaton also gves a hnt why we name ths algorthm Tetrs algorthm: The cachng orer of the regons remns us of the Tetrs computer game. Regons are cache, untl a slce n the sortng recton has been complete. Upon completon the slce s remove from cache.

111 SECTION 4.7: THE TETRIS ALGORITHM FOR SORTED PROCESSING OF QUERY BOXES 95 (a) (b) (c) () (e) (f) Fgure 4-7: Sorte reang wth the Tetrs algorthm Fgure 4-8 shows the frst three vertcal slces that occur whle the Tetrs algorthm reas the thck borere query box n sort orer accorng to the vertcal menson. The cache Z-regons for each slce are shae, the slces are emphasze by whte borers. (a) (b) (c) Fgure 4-8: Reang a query box n sort orer The cachng strategy of ths algorthm can be mprove even further. Fgure 4-8 shows some Z-regons that mght cache ata outse the query box. Of course t s not necessary to cache ths ata allowng for an even smaller cache sze.

112 96 CHAPTER 4: THE UB-TREE Both the Tetrs algorthm an the range query algorthm etermne the set of Z-regons overlappng a query box. The only fference s the orer n whch the Z-regons are prouce. Whle the range query algorthm elvers the Z-regons n Z-orer, the Tetrs algorthm uses an orer whch epens on the sort attrbute A j. We call ths orer Tetrs orer T j [MZB99]. Formally, Tetrs orerng prouces a compoun orerng from a Z-orerng by extractng the bts of the sort attrbute from the Z-aress an concatenatng them to the remanng Z-aress,.e., for sortng wth respect to attrbute A j the ornal number T j (a) for a Z-aress α s compute as: T j (a) (Z - (α )) j Z((Z - (α )),..., Z - (α )) j-, Z - (α )) j-,..., Z - (α )) ) The ornal number T j (x) for Cartesan tuple x s compute respectvely: T j (x) x j Z(x,..., x j-,x j+,...,x ) A A A A A A Fgure 4-9: Z-Orerng / Tetrs Orerng Fgure 4-9 shows the Tetrs orerngs T an T create from a Z-orere 8 8 unverse. Although Tetrs orerng looks lke a compoun orerng n the Fgure, ths s only true for the two-mensonal case. In general, Tetrs orerng s a concatenaton of one attrbute wth the Z-aress of a tuple reuce by one attrbute. Although the ea of the Tetrs algorthm has been evelope by the author, mplementaton an a etale analyss are beyon the scope of ths thess. Some very basc performance conseratons are gven n Chapters 5 an 7. We gve a more etale escrpton of the Tetrs algorthm n [MZB99].

113 CHAPTER 5: PROTOTYPE IMPLEMENTATION 97 Errors, lke straws, upon the surface flow; he who woul search for pearls must ve below. (John Dryen) Chapter 5 Prototype Implementaton elyng on the bult-n B-Tree of a DBMS, the UB-Tree s easly R mplemente on top of that DBMS. In ths chapter we gve an overvew of the prototype mplementaton of UB-Trees. We escrbe basc mplementaton concepts an specfcally llustrate the calculaton of Z- aresses, snce ths calculaton s crucal to both flexblty an performance of our algorthms. Next to the basc bt-nterleavng algorthm, we llustrate how Z-aresses for arbtrary ata types are calculate by the concept of transformaton functons. Then we gve examples for three transformaton functons: For sgne nteger numbers t suffces to nvert the sgn bt to use bt-nterleavng. To eal wth nepenently strbute mensons the concept of Varable UB-Trees (VUB-Trees) s ntrouce. Multmensonal herarchcal clusterng (MHC) s useful for clusterng a unverse where herarches are bult over the oman of each menson. After the concept of transformaton functons we aress specfc algorthmc problems that we ha to solve when mplementng the nserton, eleton, pont- an range query algorthms. Fnally we escrbe the functonalty that s prove by the UB-Tree lbrary, a C-lbrary offerng UB-Tree functonalty.

114 98 CHAPTER 5: PROTOTYPE IMPLEMENTATION 5. Overvew of the Prototype To practcally evaluate the performance of the UB-Tree, a basc C-lbrary provng the algorthms escrbe n ths thess was mplemente by the author. Several master stuents an nterns supervse by the author mplemente applcatons on top of the C-lbrary: create a program to fll a atabase wth generate test ata of varous ata strbutons, mensonalty, an sze [Fr97] rtest a benchmark program to measure the query performance of UB-Trees an varous other nexes [Fr97] regvs a program to vsualze the regon parttonng of UB-Trees [Fr97] loa a mass loang tool to bul large atabases quckly [Bau97] Several master stuents an nterns enhance the C-lbrary [Sch98, Bau98, Fen98] an helpe to port the orgnal TransBase/Solars coe to further RDBMS an operatng systems [Pe97, Ova99, Pfa99]. The prototype mplementaton of the UB-Tree was performe on top of the RDBMS TransBase on Solars. The current verson of the UB-Tree lbrary supports the RDBMSs TransBase, Oracle 8 [Pe97] an DB [Ova99], an the operatng systems Solars.6, Wnows NT [Pe97], an Lnux [Bau98]. In aton the lbrary was porte to Informx Unversal Server [Pfa99]. The complexty of the lbrary can be seen from the moule herarchy of Fgure 5-. The entre system s mplemente n ANSI C an ESQL an reles on the GNU tools gcc an gmake for coe generaton. The Oracle verson reles on the Oracle Call Interface (OCI, see [Ora97]) an the DB verson uses CLI [IBM97] to communcate wth the DBMS server. Automate ocumentaton generaton s enable by usng oc++ [WZ98]. cvs [CVS98] s use for verson management.

115 SECTION 5.: OVERVIEW OF THE PROTOTYPE 99 testsute applcaton benchmark mass loang vsualzaton range query test create hgh level nterface UB API ntermeate level nterface operators UB-Tree low level nterface utltes atabase operatng system NT B-Tree bt operatons Lnux page hanlng ate tme operatons Solars tuple hanlng tme measurement schema hanlng byte orer converson global aress calculaton DBMS efnes constants templates transformaton splt pont trees bt nterleavng DB Oracle Informx error hanlng performance measurement ata types TransBase float nteger character Fgure 5-: Archtecture of the plot mplementaton

116 00 CHAPTER 5: PROTOTYPE IMPLEMENTATION 5. Basc Implementaton Concepts The UB-Tree s realze on top of a relatonal DBMS by utlzng the B*-Tree of that RDBMS. To show the easy portablty of the UB-Tree lbrary, we porte the ntal TransBase 5.4 mplementaton to Oracle 8, DB UDB V5, an Informx Unversal Server. Each UB-Tree s a logcal construct, whch s physcally realze by a relatonal table. We call that relatonal table from now on UB-Tree nex table. A UB-Tree nex table contans one tuple for each page of the UB-Tree. As lste n Table 5-, each tuple of the UB-Tree nex table stores the Z-aress of the corresponng Z-regon, the number of tuples store on the ata page corresponng to that Z-regon, an the content of the page corresponng to that Z- regon. The prmary key of ths table s the Z-aress. Note that the tuples store n a UB-Tree o not correspon to the tuples store n a UB-Tree nex table: the number of tuples n the UB-Tree nex table rather s the number of ata pages of the UB-Tree. In the remaner of ths chapter a tuple of the UB-Tree nex table s calle UB-page, whereas UB-tuple enotes a tuple that s store n the UB-Tree,.e., a tuple that s store on a UB-page. 3 attrbute ata type escrpton Z-aress bt ata upper Z-aress of the Z-regon corresponng to that UBpage number of tuples number number of tuples store on that UB-page ata page content bt ata content of the UB-page Table 5-: Schema of a UB-Tree nex table The DBA has to ensure that each UB-page (.e., each tuple of the UB-Tree nex table) s store on a separate physcal sk page. Ths s acheve by choosng the sze of each UBpage (sze of the attrbutes Z-aress, number of tuples, ata page content plus some DBMS storage overhea) to be exactly the sze of a physcal atabase page. For some DBMS atonal settngs have to be one to ensure the physcal corresponence of DBMS pages an UB-pages [Pe97, Ova98]. The physcal corresponence allows one to mplement the UB- Tree on top of a RDBMS whle obtanng the I/O-behavor entcal to a kernel mplementaton of UB-Trees. However, sgnfcantly hgher CPU- an nter-process communcaton cost an an mpeance msmatch are the prce that must be pa for ths smple mplementaton approach: 3 A straghtforwar ea for an mplementaton of the UB-Tree nex table s to enhance each tuple of a relaton wth one atonal column, namely ts Z-aress. The Z-aress as prmary key then leas to a multmensonal clusterng [OM84]. However, ths approach oes not offer control over regon bounares urng a page splt; the regon bounares are efne by the tuples n ths case. Our approach offers complete control over the choce of the regon bounary urng a Z-regon splt. We also preferre ths approach snce the UB-Tree s easer to ntegrate nto a RDBMS kernel wth ths mplementaton.

117 SECTION 5.: BASIC IMPLEMENTATION CONCEPTS 0 The UB-tuples on each UB-page must be extracte an set by specal functons prove by the API of the UB-Tree lbrary (UBAPI). The stanar SQL functonalty of the DBMS cannot be use to acheve that. Each query or upate operaton requres two nterprocess communcatons an ESQL parsng operatons for each UB-page that nees to be accesse by the operaton. The performance of the prototype mplementaton heavly reles on the optmzer of the unerlyng DBMS, snce SQL statements are use to access UB-pages. The UB-Tree algorthms requre an effcent access to the Z-aresses store n the UB-Tree. Therefore, a seconary B-Tree on Z-aress s create an buls the UB-nex part of Fgure 4-. The UB-Tree nex table buls the UB-fle llustrate n that fgure. Table 5- lsts the prototype mplementaton concepts an compares ther mplementaton n TransBase 4.3, Oracle 8, an a DBMS kernel mplementaton. UB-Tree concept UB-Tree nex table TransBase 4.3 mplementaton Oracle 8 mplementaton DBMS Kernel Implementaton relaton relaton clusterng prmary nex UB-page tuple wth the sze of a physcal tuple wth the sze of a physcal physcal DBMS page page page UB-tuple substrng of the ata page substrng of the ata page DBMS tuple content attrbute of the UB-Tree content attrbute of the UB-Tree nex table nex table Z-regon efne by the Z-aress efne by the Z-aress bnary strng attrbute of the UB-Tree nex attrbute of the UB-Tree nex table table UB-nex atonal table storng the Z- aress attrbute of the UB- Tree nex table seconary nex on the Z- aress attrbute of the UB- Tree nex table B-Tree noe pages UB-fle relaton storng the UB-Tree relaton storng the UB-Tree B-Tree leaf pages nex table nex table Table 5-: Prototype mplementaton concepts

118 0 CHAPTER 5: PROTOTYPE IMPLEMENTATION In TransBase t s necessary to use an IOT on the Z-aress to mplement the UB-Tree nex table, snce TransBase always clusters the ata accorng to the prmary key. Snce UB-pages are physcal DBMS pages, TransBase nees to perform a ranom leaf page access on the IOT to retreve a Z-aress. An effcent retreval of a range of Z-aresses s not possble va the UB-Tree nex table, snce t s not possble to create a seconary nex on a prmary key attrbute. Therefore the TransBase mplementaton ntrouces a new concept, a UB-Tree seconary nex table, whch stores only the Z-aresses of the UB-Tree nex table. Ths allows an effcent access to a range of Z-aresses: Whle only one Z-aress s store on a leaf page of the UB-Tree nex table, a set of Z-aresses n consecutve orer s store on a leaf page of a UB-Tree seconary nex table. Ths greatly reuces the number of ranom page accesses an sgnfcantly mproves the range query an Tetrs performance of the prototype mplementaton on TransBase. However, a larger cache requrement s the prce for ths mplementaton. The upper noes of both the UB-Tree nex table an the UB-Tree seconary nex table nee to be cache n man memory (see Fgure 5-a). In contrast to that the Oracle mplementaton just caches the UB-Tree nex table (see Fgure 5-b). Ths s the reason for performance gan of % of our prototype mplementaton on Oracle for tables larger than one GB [Pe98]. UB-Tree seconary nex table UB-Tree nex table UB-Tree nex table cache cache cache (a) wth UB-Tree seconary nex table (b) wthout seconary nex table Fgure 5-: Cache requrements an UB-Tree seconary nex table In the remaner of ths chapter we llustrate selecte algorthmc problems an performance problems whch we ha to solve n orer to realze our prototype. 5.3 Implementng the Aress Calculaton The aress calculaton algorthm frst rele solely on bt-nterleavng an supporte only postve nteger numbers. Later the archtecture was enhance to allow plug-n functons for any arbtrary ata type. Snce certan ata strbutons cause a multmensonal parttonng wth a hgh puff pastry egree an thus a ba performance of the UB-Tree algorthms, the aress calculaton was enhance to take any ata strbuton nto account. Another enhancement envsone by the author s to mofy bt nterleavng to wegh attrbutes for the nterleavng process.

119 SECTION 5.3: IMPLEMENTING THE ADDRESS CALCULATION Bt-Interleavng The performance of the UB-Tree crucally reles on an effcent mplementaton of the Z- aress calculaton. For tuples of postve nteger numbers Defnton 3-3 mmeately yels an algorthm to calculate the stanar aress from the bnary representaton of a tuple: Z- values can be calculate by nterleavng the bts of the attrbutes n a certan orer. In practcal applcatons one often wants to nex a multmensonal oman, where the carnalty s not entcal for all one-mensonal omans. Only a slght mofcaton of the nterleave operaton s necessary to support a unverse where the oman of each menson oes not consst of the same number of bts steps(r),.e., there s some, j D so that r r j. In ths case the number of bts s not entcal for each step of an aress, but step k conssts of steplength(k) bts. Algorthm 5- shows ths generalze algorthm for bt-nterleavng. Input: x : tuple Output: Z-aress α // the algorthm requres the mensons to be // sorte accorng to ther resoluton // n escenng orer for step to max({steps(r j ) j D}) for to steplength(step) copy bt step of x to bt of α step en for en for Algorthm 5-: Bt-nterleavng to calculate α Z(x) Wth r max({steps(r j ) j D}) Algorthm 5- has a CPU-complexty of O(r) bt operatons (resp. O( r ) for attrbutes of fferent length). The same hols for Algorthm 5- to calculate the Cartesan coornates of a tuple from ts aress. Input: α : Z-aress Output: x : tuple // the algorthm requres the mensons to be // sorte accorng to ther resoluton // n escenng orer for step to max({steps(r j ) j D}) for to steplength(step) copy bt of α step to bt step of x en for en for Algorthm 5-: Bt extracton to calculate x Z - (α)

120 04 CHAPTER 5: PROTOTYPE IMPLEMENTATION Algorthm 5-3 to extract one specfc Cartesan co-ornate from a Z-aress has the CPUcomplexty O(r ). Input: α : Z-aress : number of attrbute to extract Output: x : attrbute of tuple x // the algorthm requres the mensons to be // sorte accorng to ther resoluton // n escenng orer for step to r j copy bt of α step to bt step of x en for Algorthm 5-3: Bt-extracton to calculate x (Z - (α )) As one can see from the above algorthms, the calculaton of Z(x) an Z - (x) can be mplemente effcently by smple bt operatons. The lnear behavor of the bt nterleavng performance for tuples consstng of 3 bt ntegers for up to 0 mensons on a SUN ULTRA SPARC 43 MHz s splaye n Fgure 5-3. The Fgure shows that Z-aress calculaton takes less than ms even for 0-mennsonal Z-aresses. Thus Z-aress calculaton s more than one orer of magntue faster than a ranom sk access, whch usually takes 0ms. Usng state-of-the-art CPUs wth 300 MHz an more, the Z-aress calculaton s more than 00 tmes faster than a ranom sk access an thus s neglgble tme n µs Z(x ) Z - (α) number of mensons Fgure 5-3: CPU tme to calculate Z(x) an Z - (α)

121 SECTION 5.3: IMPLEMENTING THE ADDRESS CALCULATION Aress Calculaton for Arbtrary Data Types We escrbe bt-nterleavng for aress calculaton just for postve nteger numbers. Algorthm 5- can be use to calculate the Z-aress for any ata type, when the btlexcographc orer on the bnary representaton of tuples matches the natural orerng on ths type. Defnton 5- (lexcographc character orer C Cn, lexcographc byte orer B Bn, lexcographc bt orer b... bn): For the followng examples we wrte C Cn for the lexcographc orer on the character concatenaton C... C n, B Bn for the lexcographc orer on the byte concatenaton B... B n an b... bn for the lexcographc orer on the bt concatenaton b... b n. Lemma 5-: For postve nteger numbers wth a fxe length bnary representaton the bt lexcographc orer on the bnary representaton s entcal to the <-orer on nteger numbers. Proof: The proof s a rect consequence of the efnton of bnary numbers an of the efnton of lexcographc orer. Example 5-: 65 0 < 66 0 maps to the bnary representaton b... b Lemma 5-: For character strngs the bt lexcographc orer on the bnary representaton s entcal to the <-orer on nteger numbers. Proof: The proof s a rect consequence of Lemma 5-, the efnton of ASCII coes an the fact, that character strngs are a concatenaton of characters, whch are orere lexcographcally. Example 5-: AB C Cn BB maps to the ASCII-Coes B B maps to the bnary representaton b... b Because of Lemma 5- an Lemma 5- bt nterleavng can be use to calculate the aresses for character strngs an postve nteger numbers. If for a ata type the bt-lexcographc orer on the bnary representaton of tuples oes not match the natural orerng on ths ata type, a transformaton functon wth that esre property must be efne. For aress calculaton ths transformaton functon s apple on

122 06 CHAPTER 5: PROTOTYPE IMPLEMENTATION the attrbute pror to bt nterleavng. Transformaton s bjectve. For Cartesan co-ornates the nverse functon to the transformaton functon s apple after bt extracton. Fgure 5-4 shows the archtecture for the aress calculaton for arbtrary ata types. tuple (x,..., x ) (x,..., x ) ata type ata type ata type ata type transformaton... transformaton re-transformaton... re-transformaton bt strng x T x T x T x T bt nterleavng bt extracton aress α α (a) aress calculaton Fgure 5-4: Dealng wth arbtrary ata types (b) Cartesan calculaton It s out of the scope of ths thess to analyze transformaton functons for further ata types. We just gve a efnton of transformaton functons here: Defnton 5- (transformaton functon): We call a functon f from any orere oman (, <) to a bnary strng a transformaton functon, f an only f, for a, b : a < b f(a) b... bn f(b) For postve ntegers an character strngs the transformaton functon s the entty functon. For nteger numbers n complement representaton the smplest transformaton functon s to nvert the sgn bt. Transformaton functons for real numbers an ate/tme ata types were mplemente uner the supervson of the author an are escrbe n the MISTRAL ocumentaton [MOD99]. Note that n orer to o any algorthm from the prevous chapter lke the Tetrs algorthm, the spral algorthm, or the range query algorthm we o nether requre a transformaton functon to be bjectve nor njectve. However, f a transformaton functon s bjectve, re-transformaton to erve Cartesan tuples from an aress s possble Dealng wth low Carnalty Domans: Enumeraton Types Character strngs very often have a smlar prefx n the frst bts. All captal letters of the roman alphabet n ASCII have an entcal bnary prefx 00, all numbers have a constant 4 Note that because of ther bnary nature, transformaton functons are harware epenent. One specfc problem of our mplementaton was to eal wth lttle/bg enan representaton [PH90] of machne wors an bnary strngs n varous mcroprocessors.

123 SECTION 5.3: IMPLEMENTING THE ADDRESS CALCULATION 07 bnary prefx of 00. Thus combnng character strngs an nteger numbers easly results n a strong puff-pastry of the UB-Tree. Many applcatons use the oman of character strngs merely to represent an enumeraton type,.e., a oman consstng of a small set of stnct values. A typcal example s the SHIPMODE attrbute of the LINEITEM table of the TPC-D benchmark. SHIPMODE has the oman of character strngs, although the set {REG AIR, AIR, RAIL, SHIP, TRUCK, MAIL, FOB} represents all permssble values for SHIPMODE. There s no orer on the values of SHIPDMODE. For nstance, t makes no sense to ask for all tuples satsfyng the textual comparson SHIPMODE < AIR. Defnton 5-3 (enumeraton type): We call the ata type of an attrbute to be an enumeraton type, f ts actual oman conssts of a relatvely small fnte set of values, usually lste explctly. In orer to maxmze the entropy of an enumeraton type of a oman preservng one-to-one map f an ts nverse functon f - : we efne an orer f : {0,,..., - }, such that for a, b : ƒ(a) < ƒ(b) a < b If there s no reasonable orerng on an enumeraton type, we rop the requrement on f to be orer preservng an merely requre: f : {0,,..., - }, f njectve We call f a surrogate functon for an enumeraton type. For each value a we call f(a) the surrogate of a. For a very compact representaton we number surrogates n sequental orer f(reg AIR) g(reg AIR) 00 f(air) 4 00 g(air) 00 f(rail) 00 g(rail) 3 0 f(ship) 6 0 g(ship) 4 00 f(truck) 3 0 g(truck) 5 0 f(mail) 7 g(mail) 6 0 f(fob) 5 0 g(fob) Table 5-3: Two surrogate mappngs for an enumeraton type The surrogate mappng f of Table 5-3 for the enumeraton type SHIPMODE uses runnng numbers from 0 to 6. The functon g efnes a mappng whch s more sutable than the mappng f wth respect to the space parttonng of UB-Trees: If the orer of values from top to bottom n Table 5-3 represents the nserton orer of SHIPMODE values, even for an actual oman consstng only of the frst four values a puff pastry as scusse n Secton 3.0 s avoe. Ths s not true for the mappng f snce the frst four values have the leang bt as

124 08 CHAPTER 5: PROTOTYPE IMPLEMENTATION common prefx. Thus f an enumeraton type s bult on the fly (.e., ts actual oman may grow urng the course of tme), a surrogate functon shoul vary the leang bts frst, snce these bts wll be use for space parttonng. Usng runnng numbers for the surrogate of an enumeraton type may result n a sub-optmal space parttonng, snce the leang bt mght not be use (see Secton 3.0). If runnng numbers are use an the bnary representaton of each surrogate s reverte wth the functon mrror(b...b n ) b n...b, the leang bt of the surrogate wll alreay partton the oman. However, snce the natural orer on ntegers s lost by mrrorng, ths mappng shoul not be use, f an enumeraton type s not only restrcte to ponts, but also to ranges. To store an enumeraton type n a UB-Tree, we map the enumerate set bjectvely onto a set of surrogates. For v stnct values n the enumeraton, only log v bts are use to represent the enumerate set. Snce all of these bts are use, a puff pastry s avoe to a large extent. Thus the number of Z-regons overlappng the query box but not contrbutng to the result set s mnmze by that mappng. Note that f the values are not unformly strbute, a puff pastry effect wll stll exst. However, the egree of the puff pastry wll be much lower than for the character strng representaton. If an enumeraton s not requre to be orer preservng, mrrorng the bts of a surrogate may help to further reuce probablty of a puff pastry Multmensonal Herarchcal Clusterng Often, especally n ata warehousng applcatons, herarches are bult on enumeraton types to prove structure to otherwse flat mensons. Herarches then are use to prove an approprate metho of escrbng the levels of semantcally meanngful aggregatons for a menson. Wth a relatonal moelng ths means that some attrbutes of a table are functonally epenent. We call ths specal kn of functonal epenence herarchcal epenence. Defnton 5-4 (herarchcal epenence): We call an orere sequence of n attrbutes A A A 3... A n- A n to be herarchcally epenent, f an only f for any {,..., n} the oman of attrbute A functonally epens on the values of attrbutes A,...,A - Defnton 5-5 (herarchcal nepenence): We call a sequence of attrbutes A A A 3... A n- A n to be herarchcally nepenent, f they are not herarchcally epenent. OLAP queres often mpose restrctons wth respect to herarches over multple mensons. These restrctons usually are pont restrctons or nterval restrctons on some herarchy level [Sar97]. The result set satsfyng these restrctons s usually qute large; for presentaton t s groupe an aggregate or ranke. Clusterng ata wth respect to multple herarches can substantally spee up these operatons.

125 SECTION 5.3: IMPLEMENTING THE ADDRESS CALCULATION Herarches on Dmensons For our efnton of multmensonal herarchcal clusterng we use a set concept to formally efne herarches: A menson A conssts of a base type havng a set of values {v,...,v n }. A herarchy of epth h over A s an orere set of levels,.e., H{ 0,..., h} (see Fgure 5-5). Each herarchy level of H over A s a set of sets { m,..., m } wth m for k,..,j. Each m s a member set (or member) of the herarchy of level contanng all members of a category. Usually a member m s assgne a name label(m) (e.g., Orange Juce for m ) nstea of enumeratng all values v k m. The subset relatonshp between the members of two neghborng levels an + efnes a herarchcal relaton (.e., partal orerng) between the levels (e.g., the prouct OJ0,7L s n the prouct category Orange Juce ). Increasng the level of a herarchy ncreases the granularty of the categorzaton,.e., the ata s classfe accorng to fner categores. j k All m 0 {OJ0,33L;OJ 0,7L;OJL;AppleJuce0,5L;AppleJuceL} Orange Juce Apple Juce m 0 {OJ0,33L;OJ 0,7L;OJL} m {AppleJuce0,5L;AppleJuceL} All Proucts Prouct Category 0 0,33L 0,7L L 0,5L L 0 m {OJ0,33L} m {OJ0,7L} m 3 {OJL} m 0 4 {AppleJuce0,5L} m 5 {AppleJuceL} Contaner Sze Legen: Level Label Member Ornal (e.g., ) Member Label (e.g., 0,7L) Fgure 5-5: Example herarchy n member set representaton The noes of H are the herarchy members (or member labels) connecte by eges whch are efne by the subset relatonshp between members of neghborng levels. The chlren of a + member m k of level are all members ml of the lower level + that are subsets of m k,.e., + + chlren( m k ){ m l + ml m k } (e.g., the set {{ Apple Juce 0,5L },{ Apple Juce L }} s the chlren set of Apple Juce ). Wth () the base set as the only member of level 0 (.e., 0 { }) () m k (3) m l for all, k, l an k l U k ) k chlren( m m for all, k a herarchy H buls a herarchy tree 5 wth the root level 0. The parent of a member m k of level then s the member ml of the upper level - that s a superset of m k,.e., parent( m k ){ m l - ml m k } (e.g. Orange Juce s the parent of OJ 0,7L ). 5 We wll explan how to eal wth complex herarches (.e., recte acyclc graphs) n Secton Formally these herarches are moele by roppng the requrement on H to be an orere set of levels. Neghborng levels are then efne by coarsest refnement.

126 0 CHAPTER 5: PROTOTYPE IMPLEMENTATION The bjectve functon or m efnes a numberng scheme for the chlren of a member m of H. Or m assgns each subset (chl) of m a number between 0 an the total number of chlren of m.e., or m : chlren( m) {0,..., chlren( m) } (see Fgure 5-5 for an example). Herarches shoul never relate members of fferent mensons, snce mensons are nepenent an thus such a herarchy coul be splt up n two separate herarches (see Secton ) Clusterng Herarches For each menson herarchy OLAP queres usually restrct some herarchy level to a pont. Sometmes an orerng on the levels of a herarchy exsts (e.g., a tme herarchy has an orer on the ays, months an years). Then nterval restrctons on a herarchy level may also occur. Our goal s to cluster ata wth respect to that partal orer efne by the herarchy levels. For queres wth large result sets one-mensonal clusterng reuces sk accesses by a factor of P. Clusterng of one-mensonal objects an sngle object herarches has been scusse to a large extent (e.g., [ZSL98], [BK89], [Sal88]). If the orer of mensons urng rll own s known n avance, clusterng the ata n ths orer wll result n a goo query performance. In prncple, a concatenate clusterng nex (.e., B * -Tree) on the herarchy levels of all mensons n one lexcographc orer s mantane. However, wth mensons wth h herarchy levels over menson, there are ( Σ h )! Π ( h!) possble lexcographc orerngs. For a 4-mensonal ata cube (wth 4 herarchy levels for menson, 4 for menson, for menson 3 an for menson 4) there are possble orerngs. Thus there s a hgh probablty that the pre-efne clusterng orer wll not be very useful for a partcular query Encong Herarches by Surrogates To effcently encoe herarches, we ntrouce the concept of compoun surrogates for herarches. Snce we requre herarches to form a sjont parttonng, a unquely entfyng compoun surrogate for each chl noe of a herarchy member exsts an can be recursvely calculate by concatenatng ( ) the compoun surrogate of the member wth the runnng number of the chl noe as calculate by the surrogate functon or from Secton Defnton 5-6 (compoun surrogate): For a member m of herarchy level of herarchy H we efne ts compoun surrogate: cs( H, m ) cs or ( m, f ( H,father( m )) o or ( m ), otherwse ) ( father m ) father ( m )

127 SECTION 5.3: IMPLEMENTING THE ADDRESS CALCULATION Defnton 5-7 (path through a herarchy): A path Φ through a herarchy of epth h s specfe by a lst of members m,..., m h, where m s a member of herarchy level. The compoun surrogate for a herarchy path Φ then s calculate as: h h cs( H, Φ) cs( H, m ) or ( m ) o or ( m ) o... o or h ( m ) father ( m ) father( m ) father ( m ) CUSTOMER 0 South Europe... 4 North Amerca... 6 Asa 0 Canaa USA Retal 0 Wholesale Retal... Kana s Sush Bar Bar Joe s Sports Bar... Fgure 5-6: Part of a herarchy The herarchy path North Amerca È USA È Retal È Bar (Fgure 5-6) has the compoun surrogate: or Customer (North Amerca) or North Amerca (USA) or USA (Retal) or Retal (Bar) 4 The upper lmt of the oman for surrogates of level s calculate as the maxmum fan-out (number of chlren) mnus one of all members of level of a herarchy H,.e., surrogates(h, ) max {carnalty(chlren(h, m)) where m level(h, - )} - Wth l log surrogates(h, ) a fxe length compoun surrogate can be store n a very compact way by bnary encong. 6 cs h l h l + l ln ( H, Φ) cs( H, m ) or ( m ) + or ( m ) or h ( m ) father( m ) father( m ) father ( m ) Wth n 4 an l 3, l, l 3 3 an l 4 3 the above formula leas to the compoun surrogate cs(h, Bar) Usually growth expectatons for a herarchy are known well n avance. Often herarchy trees are even statc. Therefore t s possble to etermne a reasonable number of bts for storng each surrogate of the compoun surrogate of a herarchy. Snce herarchy trees grow exponentally, the overall number of bts necessary to store a compoun surrogate s relatvely small. For nstance, a herarchy tree wth four branches on 8 levels alreay represents parttons an s store by 6 bts. 6 In general we use varable length compoun surrogates that nee l (m) log chlren(m) bts to store the surrogate for any chl of m.

128 CHAPTER 5: PROTOTYPE IMPLEMENTATION The lexcographc orer on the herarchy levels s preserve by ths very compact fxe length encong. Pont restrctons on upper herarchy levels result n range restrctons on the fnest granularty of a herarchy. For nstance, the pont restrcton NATION USA on the secon level of the CUSTOMER herarchy wth f( North Amerca ) 4 00 an f( USA ) 00 maps to the range restrcton cs customer between an Thus, a star jon wth ths surrogate encong for the foregn keys of a fact table results n a range restrcton on each compoun surrogate, f some herarchy level of each menson s restrcte to a pont (e.g., customer regon USA ). In the same way ntervals on the chlren of one herarchy level result n a range of the corresponng compoun surrogates (e.g., year 998 an month between Aprl an June). A star jon on a schema wth mensons creates a -mensonal nterval restrcton on the fact table Dealng wth Complex Herarchy Graphs If two levels of a herarchy graph are lnke by several paths, there are several possbltes to efne a herarchy tree an therefore several ways to calculate the compoun surrogates for physcal clusterng: If the orer on the lowest level of granularty s entcal for two herarchy paths, then one path can be erve from the other path by an orer preservng functon on the lowest level of granularty. Then the clusterng orer for both herarchy paths s entcal. Thus, the clusterng orer for WEEK an MONTH n Fgure 5-7a s entcal. Both can be compute by an orer preservng functon from DAY, the lowest granularty level of the TIME herarchy. If the query profle s known, the most useful path of the herarchy graph use for restrctons, sort operatons, or groupng shoul be chosen. Thus, f n Fgure 5-7b queres on CUSTOMER usually restrct REGION an NATION, ths path shoul be chosen for clusterng. If the query profle s not known, all paths of a herarchy graph may be use for clusterng, snce herarches may be use for restrctons nepenently urng rll-own. For clusterng, the fferent paths then can be consere to be nepenent mensons. In the herarchy graph of Fgure 5-7b both the REGION herarchy an the CUSTOMER herarchy mght be use for clusterng. However, ths approach ncreases the clusterng mensonalty an thus shoul be use wth care. REGION YEAR NATION CUSTOMER TYPE MONTH WEEK TRADE TYPE CUSTOMER SIZE DAY CUSTOMER (a) (b) Fgure 5-7: Complex herarchy graphs

129 SECTION 5.3: IMPLEMENTING THE ADDRESS CALCULATION 3 Other ssues n the context of complex herarches are unbalance herarches, slowly changng mensons an multple nhertance. Unbalance herarches occur, f some herarchy members have more chl levels than others. Ths means, that the compoun surrogates of Joe s Sports Bar an Kana s Sush Bar n Fgure 5-6 have fferent lengths. Usng varable length compoun surrogates or pang the shorter compoun surrogate wth zero bts solves ths problem wthout any mpact on clusterng. CUSTOMER South Europe North Amerca Asa Canaa USA Retal Wholesale Bar Restaurant Year < 997 Year > 997 Joe s Sports Bar Fgure 5-8: Slowly changng mensons Slowly changng mensons can be aresse by markng each noe of a herarchy tree wth a valty tme nterval. An object s physcally clustere an retreve wth respect to ts valty tme. Re-organzaton of the physcal clusterng s not necessary: Even wth a new classfcaton upon a certan pont of tme the exstng clusterng shoul be correct from a hstorc perspectve. If the busness type of Joe s Sports Bar changes from bar to restaurant n 998 (cf. Fgure 5-8), all prevously clustere ata stll s correct. The total sales over all bars n 997 must nclue Joe s Sports Bar, whereas t s nclue n restaurants for 998. However, each object of a herarchy nees nformaton about re-classfcaton n orer to correctly calculate the total sales to Joe s Sports Bar over the last years. Multple nhertance (e.g., Joe s Sports Bar s consere to be both a bar an a restaurant at the same tme) s solve smlarly to slowly changng mensons: One of the several possble paths to a herarchy noe s chosen for clusterng. The other paths of a herarchy graph to that object then merely store a ponter to the sub-tree that actually stores the object. If multple aggregaton paths are possble, precautons must be taken that only one of these paths s use for aggregaton Aressng Sparsty Sparsty s efne as the percentage of a oman that s not exstent n the actual oman. In ata warehousng applcatons the multmensonal unverse s often calle ata cube. For a multmensonal unverse,.e., ata cube, sparsty s the rato between the number of cells not contanng any ata an the overall number of cells of a ata cube. Some OLAP tools allows one to mark mensons to be sparsely populate an then specally hanle them.

130 4 CHAPTER 5: PROTOTYPE IMPLEMENTATION However, a multmensonal cube s forme as the cross prouct over the omans of all mensons. Therefore, even for non-sparse mensons the sparsty of the entre cube becomes extremely hgh soon. The Juce & More schema of Secton 8.3, for nstance, s a star schema wth four nepenent mensons wth a sparsty of 99,8%: 6 Mo sparsty(juce & More) 0,9984 wth sparsty(star schema) Fact Table Dm Table To our knowlege sparstes of more than 99% are typcal for ata warehousng applcatons. The TPC-D schema (see Fgure -), for nstance, can be regare to be a snowflake schema wth share herarches consstng of three nepenent mensons: part + suppler (combne menson wth 0.8 mllon recors comng from 0. mllon parts from 0 thousan supplers) customer + orer (combne menson wth.5 mllon orers from 50 thousan customers) tme (557 recors for seven years on the aggregaton level of a sngle ay) For a fact table of 6 mllon recors (a TPC-D scalng factor of ) the resultng ata cube has a sparsty of more than 99,99999%. Thus, n practce sparsty forbs to materalze an entre ata cube of raw ata. Physcal ata organzaton n a multmensonal array s only feasble for hghly aggregate ata. However, serous ecson support applcatons requre a eep rll own nto nterestng areas of a ata cube. Therefore t s necessary to have a physcal representaton of a sparsely populate ata cube that allows effcent access to some part of that cube. Wth multmensonal herarchcal clusterng, rll own efnes a subspace of a ata cube by range restrctons n several mensons. Therefore a metho to cluster sparse ata wth respect to several mensons n combnaton wth an effcent range query an sort algorthm are necessary for effcent hanlng of rll own queres. The surrogate calculatng functon of Secton can use any multmensonal access metho to mplement multmensonal herarchcal clusterng. However, usng any varant of R-Trees [Gut84, BKS+90, BKK96] may result n a sub-optmal performance, snce R-Trees may subve the unverse nto overlappng tles, whch may result n multple accesses to one sk page. Therefore the most nterestng canates are Gr-Fles [NHS84], hb-trees [LS90], or space fllng curves n combnaton wth one-mensonal access methos [OM84, Jag90]. It s very well sutable for the UB-Tree, snce the UB-Tree herarchcally organzes the ata space. The herarchy s rectly reflecte by the bnary representaton of the compoun surrogates. Thus bt-nterleavng as use by the UB-Tree causes a multmensonal parttonng whose bounares rectly reflect the herarchy levels. Partal match queres wth pont restrctons on upper herarchy levels then result n multmensonal range queres.

131 SECTION 5.3: IMPLEMENTING THE ADDRESS CALCULATION Processng OLAP Queres on Multmensonally Clustere Data wth the Tetrs-Algorthm Fgure 5-9 llustrates how the Tetrs algorthm s use to calculate the total sales for each fferent frut juce for all customers n Asa on relaton parttone by a two-mensonal UB- Tree an herarches on the CUSTOMER an PRODUCT mensons. The restrcton of the REGION to Asa results n an nterval n the CUSTOMER menson. The same hols for the restrcton to Juce for PRODUCT. The bounares of each query nterval correspon to Z-regon bounares an thus mnmze the number of Z-regons only partly contane n the query box. The query box s rea n sort orer from bottom to top; the aggregates for each juce type are calculate on the fly. The part of each Z-regon from whch tuples are cache s shae. When all Z-regons ntersectng the Orange Juce slce have been rea, ths slce s sorte an aggregate. In the same way the next slces ( Apple Juce, Cherry Juce, etc.) are processe. Ths contnues untl the entre prouct nterval efne by the restrcton to Juce has been hanle. Juce Orange Juce Apple Asa Asa Fgure 5-9: Processng a query box n sort orer wth the Tetrs algorthm Materalzng Aggregates Multmensonal herarchcal clusterng s not only applcable to the raw ata tself, but can also be use to organze vews wth materalze aggregates. Hgher aggregaton levels result n a UB-Tree wth shorter compoun surrogates or reuce mensonalty. It makes sense to store pre-compute aggregates for the hghest aggregaton levels wth restrctons n only one menson, e.g., the total sales on a yearly bass. However, multmensonal herarchcal clusterng allows one to erve many aggregates effcently from the raw ata. Ths avos materalzaton of many aggregaton levels an thereby reuces the vew mantenance problem for summary tables to a large extent.

132 6 CHAPTER 5: PROTOTYPE IMPLEMENTATION Dealng wth non-unformly strbute Data wth Quantles In Secton 3.0 we entfe the puff pastry effect as a severe problem of UB-Trees for certan ata strbutons. The problem s ue to the fact that UB-aresses are always calculate by splttng an attrbute n the center of a subcube. The space parttonng can be mprove conserably by takng the actual oman of each attrbute nto account. More precsely, f the attrbute s not splt n the mle of the oman, but n the mle of the ata strbuton, the space parttonng s greatly mprove wth respect to spatal proxmty. Ths bascally means that α-quantles (.e., the value of a oman where the cumulate ata strbuton excees α, see e.g., [Zw96]) are use to entfy the Cartesan coornate of each subvson pont α. Quantles were prevously use n combnaton wth hashng technques [KS87]. The varable UB-Tree s an approach whch combnes quantles wth UB-Trees The Varable UB-Tree Defnton 5-8 (varable UB-Tree, VUB-Tree): A varable UB-Tree (VUB-Tree) s a UB- Tree, whose aresses are calculate by subvng each menson wth respect to the quantles accorng to the ata strbuton of ths menson. VUB-Trees recursvely subve the embeng space wth respect to the ata strbuton. Thus for each menson the frst subvson step compares each attrbute value to the value for whch the cumulate ata strbuton of ths menson excees 50% (50%-quantle). Accorngly, the two values for the secon subvson step are the values, where the cumulate strbuton excees 5% an 75% respectvely (5%-quantle, 75%-quantle). Consequently, we wll call these subvson ponts splt ponts. Note that for unformly strbute ata the splt ponts of a VUB-Tree are entcal to those of a UB-Tree. VUB-Trees therefore use a transformaton functon to re-strbute the values of a oman. The resultng bt strngs are equally strbute. Ths means, that wth varable UB-Trees the behavor of UB-Trees for unformly strbute ata s acheve for any ata set whch s strbute nepenently n the mensons. Fgure 5-0 llustrates how the splt ponts for varous ata strbutons are erve. The thck lne at 50% of the cumulate strbuton calculates the splt pont for step, the two thnner lnes at 5% an 75% calculate the splt ponts for step, an the even thnner lnes at.5%, 37.5%, 6.5% an 87.5% calculate the splt ponts for step 3. Fgure 5-0a splays polynomally strbute ata. Some skewe ata strbuton s shown n Fgure 5-0b. Fgure 5-0c shows unformly strbute ata, whch results n recursvely halvng the oman an thus yels the splt ponts of non-varable UB-Trees. The aress calculaton algorthm of VUB-Trees s just a slght mofcaton of the UB-Tree aress calculaton. Values are now compare to splt ponts nstea of fxe subvson ponts. Thus VUB-Trees use a specal transformaton functon that calculates the bt strng of an attrbute wth respect to the ata strbuton. The conventonal bt nterleavng algorthm s then use to calculate the aresses of the VUB-Tree.

133 SECTION 5.3: IMPLEMENTING THE ADDRESS CALCULATION 7 00% 75% (a) cumulate strbuton n % 50% 5% 0% attrbute value 00% 75% (b) cumulate strbuton n % 50% 5% 0% attrbute value 00% 75% (c) cumulate strbuton n % 50% 5% 0% attrbute value Fgure 5-0: Calculaton of splt ponts for varous ata strbutons

134 8 CHAPTER 5: PROTOTYPE IMPLEMENTATION VUB-Trees yel a better space parttonng for non-unformly strbute ata. The transformaton functon performs a re-strbuton of ata. Unformly strbute attrbute bt strngs are create from the ata of any ata strbuton. Thus common prefxes o not exst for any attrbute bt strng. Ths avos the puff pastry effect escrbe n Secton 3.0. (a) (b) (c) Fgure 5-: UB-Tree an VUB-Tree for Gaussan/unform ata strbuton Fgure 5-a shows ata whch s strbute unformly n the vertcal menson an Gaussan n the horzontal menson. The corresponng UB-Tree n Fgure 5-b shows a strong puff pastry, snce the splt ponts n the center of the space are not useful for the horzontal menson. In contrast to that the VUB-Tree n Fgure 5-c shows an equal number of splts n both mensons. (a) (b) (c) Fgure 5-: UB-Tree an VUB-Tree for Gaussan ata strbuton Fgure 5-a shows ata whch s Gaussan strbute n both mensons. Agan the corresponng UB-Tree (Fgure 5-b) yels a puff pastry, whch n ths case s not so strong as n Fgure 5-b, snce both mensons have common prefxes. Snce these prefxes o not have entcal length, the VUB-Tree (Fgure 5-c) agan yels a better parttonng than the UB-Tree.

135 SECTION 5.3: IMPLEMENTING THE ADDRESS CALCULATION 9 (a) (b) (c) Fgure 5-3: UB-Tree an VUB-Tree for centere Gaussan ata strbuton Fgure 5-3a shows ata wth Gaussan strbuton n both mensons. Here the average an the stanar evaton are entcal for both mensons. Therefore the parttonng oes not yel a puff pastry (Fgure 5-3b). However, varable UB-Trees prouce a fferent space parttonng (Fgure 5-3c). Fgure 5- an Fgure 5-3 show that for both UB-Trees an VUB-Trees the locaton of the ata strbuton oes not matter, f the prefxes n all mensons are entcal (see also Secton 3.0) Splt Pont Trees Defnton 5-9 (splt pont tree): A splt pont tree s a bnary tree that stores the splt pont herarchy. The root of the tree conssts of the values where the cumulate strbuton of each menson excees ½. The left an rght son store the values where the cumulate strbutons excee ¼ respectvely ¾ (5%-quantle, 75%-quantle). In general a splt pont tree of heght h stores h splt ponts corresponng to the cumulate strbuton n screte steps of k -h for k {,..., h -}. A splt pont tree s an effcent way of storng the splt ponts for each nepenent menson of a varable UB-Tree. Assumng a splt pont tree of heght h, a menson then can contrbute h bts to the VUB-aress. The total length of each aress then s *h. Databases consstng of p *h pages can be manage wth a splt pont tree of heght h. Thus, the log P heght of the splt pont tree for each menson s h, whch s qute small compare to the sze of the atabase. For nstance, a splt pont tree of heght 4 for a 6- mensonal atabase s suffcent for a atabase sze of up to 4 pages. Table 5-4 lsts the heght of a splt pont tree for atabases wth mllon, 0 mllon an 00 mllon ata pages for to 0 mensons.

136 0 CHAPTER 5: PROTOTYPE IMPLEMENTATION mo. 00 MB mo. GB mo. 0 GB Table 5-4: Heghts of a Splt Pont Tree for varous Database Szes an Dmensons Implementaton of Varable UB-Trees To mplement varable UB-Trees, only the transformaton functon of the aress calculaton algorthm nees to be mofe. A splt pont tree s use to etermne the bt strng of an attrbute. Each ecson n the splt pont tree eces to whch part of the ata strbuton a pont belongs. VUB-Trees were mplemente by Mchael Bauer n a master thess [Bau98] supervse by the author. A more etale escrpton of splt pont trees an the algorthms nvolve can be foun there. Snce VUB-Trees just change the attrbute transformaton, no mofcatons of other algorthms lke range query, pont query, or Tetrs, are necessary. However, VUB-Trees have two major rawbacks: The ata strbuton for each attrbute must be known n avance. Dmensons must be nepenent. Splt pont trees requre the ata strbuton to be known n avance. Ths s possble, f statstcs on the ata are avalable or expectatons on the ata exst. For OLAP ata another effcent metho s to use a bulk loang tool. A VUB-Tree can be bult by two passes over the ata. The frst pass gathers the statstcs an the secon pass calculates the aresses for each tuple. If the ata strbuton changes, the entre VUB-Tree nees to be re-organze. Otherwse the VUB-aress space mght be exhauste, whch causes the multmensonal clusterng to be no longer useful, snce many tuples wll be aresse by the same aress. The mensons of a VUB-Tree must be nepenent. For certan epenences lke the sne epenency the VUB-Tree cannot avo the puff-pastry effect or run nto problems wth the heght of the splt pont tree (see [Bau98]).

137 SECTION 5.4: IMPLEMENTING THE INSERTION ALGORITHM 5.4 Implementng the Inserton Algorthm The complexty of the nserton algorthm s logarthmc n the table sze an reles essentally on the mplementaton of the pont query algorthm. Snce the mplementaton of the pont query algorthm s scusse n Secton 4.3, we focus on the Z-regon splt algorthm here. Fgure 5-4a shows that a Z-regon parttonng usually oes not subve a multmensonal space nto rectangular Z-regons, but usually conssts of Z-regons whch are a unon of several rectangular subcubes. We say that such a regon has frnges. Thus frnges are a unon of small subcubes whch n aton to the largest subcube belong to one regon. Frnges cannot be avoe n general, snce the Z-regon parttonng aapts to the ata strbuton as shown n sectons 3.7 an 3.8. However, t s esrable to mnmze frnges, snce a small frnge mght cause a Z-regon to overlap a query box. Ths Z-regon then must be retreve by the range query algorthm, although probably no ponts or only very few ponts on the Z-regon are locate n the query box. A Z-regon s splt ue to an overflow of the corresponng page. The Z-regon splt then creates two Z-regons from the Z-regon beng splt. Mnmzng frnges thus can be acheve by tryng to create rectangular regons urng the splt. Ths may be acheve by gvng ths algorthm some more flexblty. The splt algorthm conssts of two parts: choose two ponts on the page not exactly n the mle, but close aroun the mle (εsplt) fn the splt aress between two ponts that yels the best rectangular parttonng The best rectangular parttonng of a Z-regon [α : β ] s calculate by choosng a splt aress ξ where as many tralng bts as possble are set to. In the followng we assume that a page s an orere set of tuples {x,...,x C }. The splt algorthm etermnes the Z-aress ξ wth the most tralng bts between the Z-aresses of the tuples x ½C-εC an x ½C+εC. Depenng on ε, frnges are avoe to a large extent. A worst case storage utlzaton of 50%- ε s guarantee for each page. For our tests wth unformly strbute ata frnges get reuce to a large extent. Wth an ε of 5% the parttonng of unformly strbute ata gets very close to a unformly ealze parttonng. Ientcal unformly strbute ata was spoole nto the UB-Trees of Fgure 5-4a, b an c. Whle the UB-Tree of Fgure 5-4a only takes the tuple n the mle of the page nto account for a regon splt, Fgure 5-4(b an c) use an ε of % respectvely 40%. The pcture clearly shows that frnges are mnmze wth growng ε at the expense of a reuce storage utlzaton. As we wll see n Secton 6..5, the range query performance s also mprove by usng an ε > 0. Our measurements showe that an ε of a few percent alreay has a substantal effect on the range query performance. For a etale nvestgaton of the effects of the ε splt please refer to [Sch98].

138 CHAPTER 5: PROTOTYPE IMPLEMENTATION (a) (b) (c) Fgure 5-4: Splttng Z-regons wth ε 0%, ε % an ε 40% 5.5 Implementng the Deleton Algorthm The eleton algorthm works analogously to the eleton algorthm for B-Trees [BM7]. However, when a eleton causes a page unerflow an the corresponng page merge causes an overflow, the subsequent page splt utlzes the splttng algorthm of Secton 5.4 to create a goo space parttonng. See [Sch98] for etals. 5.6 Implementng the Pont-Query Algorthm The pont query algorthm for a tuple x reas a sngle Z-regon [α : β ] an the corresponng page from a UB-Tree. The regon aress β s efne as the nearest upper neghbor n the UB-Tree nex table to the Z-aress ξ of tuple x. As alreay mentone n Secton..4, ths nearest neghbor s foun by a sngle ESQL query. Ths regon aress then s use by a secon ESQL query to retreve the entre ata page. ESQL SELECT MIN(z-aress) INTO regon_aress FROM UB-Tree_nex_table WHERE z-aress ξ ESQL SELECT number_of_tuples, ata_page_content INTO tuples, page FROM UB-Tree_nex_table WHERE z-aress regon_aress The performance of the frst ESQL query heavly reles on the DBMS optmzer. In general queres wth an aggregaton functon an a restrcton are evaluate by frst retrevng all tuples satsfyng the restrcton an then applyng the aggregaton functon on the retreve tuple set. However, f a B * -Tree nex on z-aress exsts, the optmal executon plan for ths statement s fferent: The combnaton of the aggregaton functon MIN() wth a -restrcton n the WHERE-clause allows one to etermne the result by traversng just one path n the B * - Tree: A pont query wth ξ yels the page that stores the regon aress as the frst value exceeng ξ. However, only the TransBase optmzer create ths optmal executon plan.

139 SECTION 5.7: IMPLEMENTATION OF THE RANGE QUERY ALGORITHM 3 Fgure 5-5 shows the access plans of Oracle (a) an TransBase (b) for the SELECT-MINquery. Fgure 5-5c shows the actual operaton that TransBase performs on the UB-Tree seconary nex table n orer to fn the regon aress. MIN(z-aress) σ z-aress MIN(σ z-aress ) UB-Tree nex table UB-Tree nex table (a) (b) (c) Fgure 5-5: Implementaton of the pont query algorthm For DB an Oracle t was necessary to rewrte the query n the followng way: ESQL DECLARE CURSOR mnquery FOR SELECT z-aress from UB-Tree_nex_table WHERE z-aress ξ ORDER BY z-aress ESQL OPEN CURSOR mnquery ESQL FETCH mnquery INTO regon_aress ESQL CLOSE CURSOR mnquery ESQL SELECT number_of_tuples, ata_page_content INTO tuples, page FROM UB-Tree_nex_table WHERE z-aress regon_aress The ORDER BY clause of the rewrtten query forces the optmzer to favor a seconary nex on Z-aress over a full table scan. Snce Z-aresses are elvere n ascenng orer by the query, the frst FETCH returns the mnmum Z-aress. Therefore the rewrte s equvalent to the orgnal query. 5.7 Implementaton of the Range Query Algorthm The range query algorthm as escrbe n Secton 4.5 s exponental n the number of mensons. A lnear verson of ths algorthm was evelope by the author. In aton the algorthm was optmze to operate solely on Z-aresses, whch saves CPU tme for transformatons between Z-aresses an Cartesan tuples. The man functon of the range query algorthm s the functon whch, for a gven Z-regon, etermnes the next Z-regon ntersectng the query box. Next to the Z-aresses of start pont y an en pont z of a query box Q [[y, z]] the regon aress β of a Z-regon [α :β ] suffces to perform the calculaton.

140 4 CHAPTER 5: PROTOTYPE IMPLEMENTATION 5.7. Determnng the next Intersecton The crucal task of the range query algorthm s to calculate the next regon n Z-orer whch ntersects the query box Q after a Z-regon [α : β ] (whch ntersects the query box) has been retreve. Ths algorthm has been stealy refne n the MISTRAL project urng the evoluton of ths thess. Whle the frst verson of the algorthm was exponental n the number of mensons an requre a transformaton of Z-aresses nto Cartesan co-ornates for each teraton, the fnal verson merely requres O(a) bt operatons (set bt an clear bt) for stanar Z-aresses consstng of a bts. The exponental verson of the algorthm has been escrbe n Secton 4.4. In prncple, ths algorthm s ve nto two steps: etermnaton of the Z-aress λ of the last pont of the Z-regon [α : β ] ntersectng the query box etermnaton of the Z-aress φ of the next Z-regon ntersectng the query box Snce each subvson step ves the multmensonal space nto subcubes, ths algorthm s exponental n the number of mensons. In aton, t requres a transformaton of Z-aresses nto Cartesan co-ornates n orer to test the ntersecton of the query box wth the subcube. The overall CPU-complexty (n aress transformatons, nteger comparsons an bt-operatons) of ths algorthm s O(a / ). A lnear verson of ths algorthm can be evelope, observng that mensons are nepenent an query boxes are so-orente. Thus the ntersecton can be teste for each menson nepenently. Then, to construct the Z-aress φ t s not necessary to traverse subcubes for each step of φ an to check f that subcube ntersects the query box. Instea, t suffces to test f menson j ntersects the nterval [y j, z j ]. Thus only tests are necessary. However, stll necessary s a transformaton of the aress to Cartesan coornates for each test, thus the overall complexty s O(a/a/) O(a /). The lnear algorthm can be further mprove nto a bt-orente algorthm by transformng the co-ornates y an z nto aresses ψ an ζ. Then a transformaton of the current Z- aresses to Cartesan coornates s not necessary to perform the ntersecton calculaton; the ntersecton calculaton s solely one on the Z-aress representaton. (It s possble to mplement an ntersecton test on Z-aresses, whch etermnes by smple bt operatons f a Z-aress s locate n a query box efne by two Z-aresses [MOD99]). Each bt of a Z- aress s nspecte at most twce by the bt-orente algorthm n orer to etermne the frst Z-aress of the next Z-regon ntersectng the query box. Thus the overall complexty of the algorthm s O(a).

141 SECTION 5.7: IMPLEMENTATION OF THE RANGE QUERY ALGORITHM 5 Atonal etals about the mplementaton of the range query algorthm can be foun n the MISTRAL Source Coe Documentaton [MOD99]. Several experments were performe to evaluate the practcablty of the algorthm [Fr97]. The major results of these experments were: The exponental algorthm s easy to mplement; however, t s useful only for up to three mensons. The lnear algorthm s useful for up to 6-mensonal Z-aresses, f the result set sze s suffcently small. Otherwse the overhea for the Z-aress/Cartesan transformaton plays a sgnfcant role. The bt orente algorthm s nsenstve to the complexty of the tuple transformaton. Therefore UB-Trees usng complex tuple transformatons n orer to create a goo space parttonng (lke the surrogate calculaton (Secton an 5.3.4) or varable UB-Trees (Secton 5.3.5)) can be bult wthout atonal cost on the range query algorthm Dealng wth Sets of Query Boxes Retrevng a result set efne by a set of query boxes {Q,..., Q n } s possble by only accessng each page of the result set once. Ths s acheve by calculatng the frst ntersecton φ for each Q as mentone n Secton These ntersectons {φ,..., φ n } are store n a heap sorte n Z-orer. MIN( ) then etermnes the next Z-regon [α : β ] to be retreve by the range-query-set algorthm. Then all φ wth φ β can be upate n by removng these φ an storng a new φ calculate by the algorthm of Secton Ths algorthm requres space O(n) for n query boxes an has a CPU-tme complexty of O(an) for Z-aresses consstng of a bts. The I/O-complexty of the algorthm s O(m) (plus the overhea for the B-Tree access to each page) f m regons overlap. Further etals on the mplementaton of the algorthm as well as performance measurements are escrbe n [Fen98]. In aton, nvestgatons about the approxmaton of arbtrary query volumes can be foun there. Here we just sketch the man results: Dealng wth sets of query boxes nuces no atonal CPU- an I/O-overhea to the range query algorthm. Atonal man memory n O(n) s requre to process a set of n query boxes. In the case of one query box the algorthm egenerates to the stanar range query algorthm. Dsk accesses are performe n Z-orer, so f the relaton s page clustere, ths page clusterng wll be explote by the algorthm. When sets of query boxes are use to approxmate arbtrary query volumes, for unformly strbute ata the best performance s acheve when the volume of each query box gets close to the average volume of each Z-regon.

142 6 SECTION 5.8: THE UB-TREE LIBRARY 5.8 The UB-Tree Lbrary The UB-Tree lbrary conssts of an API an an operator nterface. The UB-API proves a C- nterface for ata manpulaton an ata efnton functons. The functonalty of the UB-API nclues C-functons for UB_create, UB_rop, UB_rename UB_open, UB_close UB_nsert, UB_elete UB_pontquery, UB_rangequery Whle UB-API functons eal wth result sets of tuples whch are materalze n one processng step, the operator nterface eals wth streams of tuples. The operator nterface essentally proves the same functonalty as the UB-API. However, snce the operator nterface works wth tuple streams an allows ppelne processng, t s better sute for usng the UB-Tree functonalty n an operator tree of an RDBMS. Next to the functons lste above a prelmnary verson of the Tetrs algorthm for unformly strbute ata s mplemente wth the operator nterface. The UB-API has a tuple hanlng whch allows one to use sgne nteger, unsgne nteger an character strngs for the calculaton of Z-aresses. The concept of transformaton functons has been mplemente to easly plug n support for further ata types. The current mplementaton allows one to calculate Z-aresses for UB-Trees wth up to 3 nex mensons. A etale escrpton of the UB-Tree lbrary s foun on the MISTRAL web server [MOD99].

143 Part III Analyss of Query Processng wth Multmensonal Inexes

144

145 CHAPTER 5: PERFORMANCE ANALYSIS 9 Ieals are lke the stars, we never reach them, but lke the marners of the sea, we chart our course by them. (Carl Schurz) Chapter 6 Performance Analyss nalyzng the cost of query processng wth certan access methos s crucal A to cost-base query optmzaton. Next to that t allows one to smulate query processng wthout actually creatng the atabase an thus saves tme an resources whle ganng a better unerstanng of access methos. In aton a cost functon s a benchmark for the query performance snce t efnes the expecte response tme an thus allows one to juge the qualty of an access metho mplementaton. Cost functons can further be use to prect the result szes of a query or tell a user the expecte processng tme of a query before query executon has starte. Snce mult-attrbute restrctons an sort operatons are very common n many applcatons (cf. Secton.), we n ths chapter nvestgate the cost of range queres an the Tetrs algorthm for UB-Trees. We efne cost functons for page accesses for ealze unformly parttone unverses. The qualty of our cost functon s proven by comparng the precte number of page accesses wth the actual number of page accesses measure wth our prototype mplementaton. We also explan how our cost functon s lnke to the selectvty of a multmensonal query box. Then we efne a cost moel for range queres wth sort operatons for varous further access methos an o an analytcal performance comparson of UB-Trees wth btmap nexes, clusterng B-Trees, non-clusterng B-Trees an a full table scan. Chapters 7 an 8 wll show that the results precte by our cost functons o also hol n practce.

146 30 CHAPTER 6: PERFORMANCE ANALYSIS 6. The Cost of UB-Tree Range Queres For an analyss of the UB-Tree range query performance t s esrable to have a cost functon for the retreve pages,.e., the regons overlappe by a query box. Ths enables the precton of the run tme of a range query an yels a base for query optmzaton. In aton, a cost functon permts to prouce a statstcal relevant number of measures by smulatng range queres wth varyng table szes an mensonaltes. Ths s especally useful, when practcal measurements wth our plot mplementaton can not be performe ue to ther memory requrements or ther long run tme. The cost functon also allows a theoretcal analyss of the range query performance an proves an excellent nsght n the mportance of the attrbute orer. It also explans the page jump phenomenon of UB-Tree range queres. Several master stuents supervse by the author use ths cost functon to smulate the behavor of the UB-Tree for several atabase szes an mensonaltes [Fr97], to analyze the effect of regon frnges an to juge the qualty of the regon splttng strategy [Sch98], to analyze the space parttonng of VUB-Trees [Bau98] an to analyze the behavor of UB-Trees for sets of query boxes [Fen98]. To be useful for a broa range of applcatons, a cost functon shoul not requre too much knowlege about the quere atabase. The mnmum requrements for nput parameters of a cost functon are the atabase sze an the query restrcton. For mult-mensonal nex structures the mensonalty of the nex s also necessary. These parameters are n general easy to obtan an mantan. To erve a cost functon usng only these nput parameters s not achevable n general, snce the ata strbuton s another ecsve factor for the query performance. Yet t s possble to evelop such a cost functon for a certan case of space parttonng, the so-calle ealze unform parttonng, consstng of unformly strbute an nepenent attrbutes. 6.. A Cost-Functon for Perfect Iealze Unform Parttonng In the followng we use P for the number of ata pages of a UB-Tree table. The query box s specfe by the lower bouns vector y an the upper bouns vector z. The lower boun an upper boun of the restrcton n menson are enote by y an z. The UB-Tree has been bult over a total of mensons. Defnton 6- (perfect ealze unform parttonng): A perfect ealze unform parttonng (.e., P k for some k > 0, cf. Fgure 6-a for a two-mensonal example) subves the multmensonal space n P hypercubes wth volume -k. Otherwse we call an ealze unform parttonng mperfect. Thus for perfect ealze unform parttonng each menson j of the unverse has been parttone recursvely l j (, P) log P / k tmes. In ths case the number of Z-regons ntersecte by a query box [[y, z]] s entcal to the number of subcubes overlappe by [[y, z]].

147 SECTION 6.: THE COST OF UB-TREE RANGE QUERIES 3 Q[[y,z]] (a) Fgure 6-: The cost functon for perfect ealze unform parttonng (b) In the followng we assume the bounares of the query nterval n each menson to be normalze to the nterval [0,]. If we have s complete splt levels n menson j, the number of slces of the multmensonal space that are overlappe by the query box [[y, z]] n menson j can be etermne by calculatng the number of the slces, that are contane n nterval [0, ẑ j ], but not n nterval [0, ŷ j ], e., the number of slces n [0, ẑ j ] mnus the number of slces n [0, ŷ j ]. Snce the slce contanng ŷ j s also a slce overlappe by the query box, we must ncrement the above number by one to get the correct number of slces. If the number of slces for the nterval [0, ĉ ] s calculate as ĉ s, a non-exstng slce s + s ae for ẑ j by the formula erve above. We must correct ths error for the case ŷ j < an ẑ j. Ths s acheve by ecrementng the number of slces by one. For ŷ j ẑ j the subtracton removes the error, therefore no correcton s necessary here. Thus the number of slces n(y j, z j,l j ) n menson j overlappe by the query nterval [y j, z j ] for l j complete splts n menson j can be calculate by the followng formula: n( y j, z j, l j ) l j l j yˆ j l j l j z ˆ j yˆ j +, f zˆ j yˆ, otherwse j The number of subcubes ntersecte by the query box c(y, z, P, ) then s the prouct of n(y j, z j, l j (, P)) over all mensons: c(, P, y, z) j n ( y, z, l (, P ) j j j

148 3 CHAPTER 6: PERFORMANCE ANALYSIS 6.. A Cost Functon for Sem-perfect Iealze Unform Parttonng An mperfect ealze unform parttonng prouces rectangular regons, where each regon has ether the shape of a subspace wth volume log P or conssts of two of these subspaces. In ths case the multmensonal space has been parttone recursvely log P mo tmes for some mensons an log P mo + tmes for some other mensons. One menson may exst, where some parts of the space alreay have been parttone log P mo + tmes, whle other parts of the space only have been parttone log P mo tmes. Because of the above conseratons we stngush two cases of mperfect parttonng: Defnton 6- (sem-perfect an probablstc ealze unform parttonng): If P k for some k > 0, we call an mperfect ealze unform parttonng sem-perfect. Otherwse we call t probablstc. Defnton 6-3 (probablstc menson): For a probablstc ealze unform parttonng we call a menson probablstc, f wth respect to ths menson some parts of the space have been parttone log P mo + tmes, whle other parts of the space only have been parttone log P mo tmes. Lemma 6-: Each probablstc ealze unform parttonng has exactly one probablstc menson. Proof: Bt nterleavng takes place n a fxe orer of mensons. Our mplementaton of bt nterleavng starts wth the rghtmost menson. l splts nee to take place to completely splt the space wth respect to splt level l (.e., bt l of the bnary representaton of the Z- aress). After these l splts have taken place (.e., enough ata has been nserte nto the UB-Tree), the next splt takes place at splt level l + (.e., bt l + of the bnary representaton of the stanar aress). Ths splt level correspons to the next bt n the bnary representaton of stanar aresses as obtane by bt nterleavng. Therefore t splts the next menson n the orer of mensons as use by bt nterleavng. Snce splts complete one splt level before movng to the next splt level, only one menson may have both subspaces wth splt level l an subspaces wth splt level l +. As a consequence of the proof of Lemma 6- the nex of the probablstc menson n the orer of menson as use by bt nterleavng (Algorthm 5-) s calculate as: probablstc(, P) - ( log P mo )

149 SECTION 6.: THE COST OF UB-TREE RANGE QUERIES 33 Example 6-: Splt levels for perfect an mperfect unform parttonng are llustrate n Table 6- for a 6- mensonal space: For a table sze of 64 pages the space s perfectly parttone (k, 6) wth one splt level for each menson. Wth 5 (k 9) pages ths space s parttone sem-perfectly wth one splt level n the frst three mensons an two splt levels for the last three mensons. Wth a page number of 700, the parttonng s probablstc wth menson 3 as probablstc menson. Splt Levels per Dmenson Pages Dm Dm Dm 3 Dm 4 Dm 5 Dm 6 64 (perfect) 5 (sem-perfect) 700 (probablstc) > Table 6-: Splt levels For a sem-perfect ealze unform parttonng the number of complete splts l j (,P) wth respect to menson j s calculate as +,otherwse ), ( mo log,f ), ( ), ( P l j P P l P l j j j where P P j l log ), ( Wth l j (,P) as efne above the c(y, z, P, ) s calculate n the same way as for perfect ealze unform parttonng: + otherwse, ˆ ˆ ˆ ˆ f, ˆ ),, ( j j j j l j l j j j l j l j j j y z y z y l z y n The key attrbutes are nepenent an the query box [[x, y]] s so-orente wth respect to each menson. Therefore the total number of pages s obtane by multplcaton of the slces n each menson: ( ) ( ) j j j j P l z y n z y P c,,, ),,, (

150 34 CHAPTER 6: PERFORMANCE ANALYSIS 6..3 A Cost Functon for Probablstc Iealze Unform Partonng If an ealze unform parttonng s probablstc, the formula n(y j, z j, l j (, P)) nees to be mofe for the probablstc menson to take the probablty of an ncomplete splt nto account. Q[[y,z]] (a) Fgure 6-: The cost functon for probablstc unform parttonng (b) The complete splt levels prouce only log P pages, thus P - log P atonal regons are neee to obtan the gven number of pages,.e., the table sze. These regons are create from the log P pages by splttng these pages wth respect to attrbute j. Therefore the probablty of an atonal splt n an attrbute j s: P probablty (, P) logp j 0, f j probab lstc(, P), otherwse If the probablty of an ncomplete splt s taken nto account, the number of slces overlappe by a query range n a certan menson can be erve from the value for the complete splts. By subtracton we calculate, how many slces woul be overlappe atonally, f another complete splt were exstng. For each of these splts the probablty of ts exstence s probablty j (, P). The average number of atonal splts may then be calculate by multplcaton. Thus, the average number of slces n menson j overlappe by the range [y j, z j ] s: n j (, P, y j, z j ) n(y j, z j, l j (, P)) + (n(y j, z j, l j (, P) + ) - n(y j, z j, l j (, P))) probablty j (, P) The key attrbutes are nepenent an the query box [[x, y]] s so-orente wth respect to each menson. Thus the total number of pages s obtane by multplcaton of the slces n each menson as n the prevous sectons.

151 SECTION 6.: THE COST OF UB-TREE RANGE QUERIES Cost Functon an Selectvty For nepenently unformly strbute ata the restrcton of each attrbute n percent of space also efnes the selectvty of that attrbute. Thus the restrcton [y j, z j ] n attrbute A j has a selectvty of s j zˆ yˆ. j j n j (, P, y j, z j ) can be consere to be s ) j P wth s ) j as a specal celng functon rounng the selectvty s j of menson j to the next parttonng gr pont. Then the cost functon can also be represente as: c(, P, s) j n (, P,0, s ) P j Note that by usng 0 an s j nstea of z j an y j some nformaton for the accuracy of the cost functon s lost, snce the poston of the query box nfluences the number of Z-regons overlappe by a query box. Thus c(, P, y, z) shoul be use nstea of c(, P, s), f not only the selectvty, but also start an enpont of the query n space are known. j j ) s j 6..5 The Cost Functon: Theory an Practce Fgure 6-3 shows the number of pages precte by the cost functon as well as the actually retreve number of pages for two range query seres. Both measurement seres where conucte on a sx mensonal UB-Tree of 8 pages storng 0 mllon unformly strbute tuples. The left seres shows a query that constantly restrcts fve mensons to 40% of ther oman an vares the sxth menson from % to 00% of ts oman. The rght seres vares the restrcton n each menson from % of ts oman to 00% of ts oman at the same tme. Thus the left seres creates a lnearly growng result set, whereas the result set of the rght seres grows wth the 6 th power Precte Retreve Precte Retreve pages.500 pages % 5% 50% 75% 00% restrcton n % Fgure 6-3: Cost functon an realty 0 0% 5% 50% 75% 00% retrcton n % Both seres show the accuracy of the cost functon wth an average precton error of 8%. The maxmum evaton s % n the left seres an 30% n the rght seres.

152 36 CHAPTER 6: PERFORMANCE ANALYSIS The evaton s ue to two facts: The ata s just strbute unformly, but there s actually no ealze unform parttonng. The moel s probablstc, thus a certan error s nherent to the cost moel. The effect of the ε-splttng algorthm (cf. Secton 5.4) on the exactness of the cost functon s shown n Fgure 6-4. The sx mensonal UB-Tree of ths fgure conssts of mllon tuples of unformly strbute ata store on about 000 pages (the actual number of pages ffers for each ε, snce the egree of fllng for each page epens on ε). The fgure shows the actually retreve pages for varous values of ε as well as the theoretcally expecte value from the cost functon for a query whch restrcts fve mensons to 5% of ther oman an vares the sxth menson from % to 00%. The larger ε gets, the better the regon parttonng approxmates an ealze unform parttonng. The fgure therefore shows the accuracy of the cost functon an the potental of the regon splttng algorthm to reuce the number of pages overlappe by a range query. In general a trae-off between space parttonng (.e., range query performance) an page utlzaton exsts. However, as the fgure shows, an ε of aroun 3% s alreay qute effectve. Due to the probablstc nature of the cost functon an ε of 00 % sometmes even retreves less pages then precte. For more etals to the ε-splttng algorthm please refer to sectons 4. an 5.4. number of Z-regons ntersectng the query box ε 00% 0,00%,56% 3,3% 6,5% 7,8% 0,94% Precte 00,00% 0 0% 5% 50% 75% 00% restrcton n % Fgure 6-4: ε-splt an cost functon

153 SECTION 6.: A COST MODEL A Cost Moel In the followng we efne a cost moel to compare the cost of varous access methos for range queres. Our cost moel takes both CPU-tme an I/O-tme for query processng nto account. For I/O-tme we conser both clustere access an ranom access n our cost moel. Clusterng places ata that s lkely to be accesse together physcally close to each other. The goal of clusterng s to lmt the number of sk accesses requre to process a query by ncreasng the lkelhoo that query results have alreay been cache. We stngush two kns of clusterng: Tuple clusterng stores tuples of one or several relatons on one sk page, f the tuples are lkely to be use together to create the result set of a query. If the tuples o not ft on one page, the tuples have to be store on several pages. Normally new pages are physcally place on sk n nserton orer. Page clusterng n aton to tuple clusterng also mantans physcal clusterng between sk pages. Let t π be the (average case or worst case) postonng tme of a har sk, t τ be the transfer tme of a har sk an t ξ the CPU tme 7 spent for page processng after retreval. We assume that the prefetchng strategy of the fle system reas a physcal cluster of L consecutve pages from sk wth one ranom access nto the rea-ahea cache. Ths takes tme t π + (t τ + t ξ ) L. In aton we assume that each page has a capacty of C tuples. Wth ths cost moel page clusterng (.e., reang k tuples n consecutve orer) takes tme c page (C, L, t π, t τ,, t ξ,k) mn( k / C / L +, k) t π + max( k / C, L) (t τ + t ξ ) Tuple clusterng requres to poston the rea/wrte-hea of the har sk for every page an therefore takes tme c tuple (C, t π, t τ, t ξ,k) mn( k/c+, k) (t π +t ξ + t τ ) Ranom access wthout cachng requres to poston the rea/wrte-hea of the har sk for every tuple an therefore takes tme c ranom (t π, t τ, t ξ,k) k (t π + t τ + t ξ ) In the followng conseratons we assume that a query retreves large result sets. We consequently omt the + n the cost formulas for page clusterng an tuple clusterng, snce ths atonal summan s neglgble for large result sets. Bascally ths means that we assume a best case strbuton of the tuples on pages so that the frst tuple of the result set also s the frst tuple on a page (.e., a result set of less than C tuples wll always be store on one page). 7 Note that t ξ heavly epens on the specfc query. For complex restrctons lke multmensonal ntervals or IN-clauses of SQL t ξ may be conserably hgher than for smple sngle-attrbute restrctons.

154 38 CHAPTER 6: PERFORMANCE ANALYSIS 6.3 The Cost of Range Queres For retrevng or sortng a relaton n combnaton wth multmensonal herarchcal restrctons we efne cost functons for response tmes an ntermeate temporary storage. Our analyss consers a UB-Tree, a composte seconary nex (CSI, clusterng B * -Tree) over all attrbutes (foregn keys of each menson an measure attrbutes) of a fact table, a sngle seconary nex (SSI, non-clusterng B * -Tree) on the attrbute wth the least selectvty an a full table scan (FTS). In aton we analyze the performance of btmap nex ntersecton (BII), whch combnes the btmaps of each restrcte attrbute to etermne the result set of the query. Note that an nex organze table (IOT) as ntrouce n secton.3 can be regare to be a specal case of a CSI, whch stores the entre tuple n the B-Tree leaf pages an therefore oes not have to store any tuple entfer there. An FTS to answer multmensonal range queres wth selectvty s j n menson j can explot prefetchng technques to reuce the number of ranom page accesses at the expense of havng to rea the entre table. Usng a CSI wth a composte B * -Tree n lexcographc orer A,..., A allows one to use the nex for the restrcton n A at the expense of havng a ranom access for each page. A SSI on A j requres a ranom page access for each tuple satsfyng the restrcton n A j, snce no clusterng of the tuples s avalable. The number of ranom accesses of an SSI s lmte to P, f the row entfers of the SSI are sorte an then processe n physcal page orer for ata page retreval. For pont restrctons on the nex attrbute, sortng of row entfers may even be avoe: nex pages for tuples wth entcal nex attrbutes may be organze n the physcal orer of the row entfers. Then pont restrctons wll get a lst of row entfers sorte accorng to the physcal locaton of the tuple. Ths makes a SSI not to egenerate an behave smlarly to an FTS n worst case. BII requres a ranom access for each tuple satsfyng the restrctons n all attrbutes. In aton the corresponng part of each btmap nex has to be retreve. In analogy to an SSI the result of BII s a btmap whch s use to access ata pages n physcal orer. Thus multple ranom accesses to one ata page wll not occur. Fgure 6-5 shows how two specfc btmap nexes process a two mensonal partal match query. btmap for organzaton TM btmap for regon Asa result of btmap ntersecton % of tuples 3 % of tuples 0 % of tuples accesse sk pages (shae) Page Page Page 3 Page 4 Page 5 Fgure 6-5 Btmap nex ntersecton 80 % of pages

155 clusterng SECTION 6.3: THE COST OF RANGE QUERIES 39 For each restrcton, the btmap s retreve from the corresponng btmap nex. After ntersectng these two btmaps by a btwse AND-operaton the tuples corresponng to -bts are retreve (The zero bts n the fgure are enote by a. nstea of a 0 ). In the fgure we assume C 0 tuples to ft on one page, thus ten consecutve bts correspon to the tuples on one sk page. The selectvtes for both mensons are 3% respectvely 34%, resultng n an overall selectvty of 0%. Snce the ata s not clustere on the pages, the query nees to retreve 80% of the fact table to retreve 0% of the tuples. In practce ths rato s even worse: Actual values for C range between 0 an 400 for 8kB pages. The shae part of each cube n Fgure 6-6 shows the part of a three mensonal atabase whch s retreve by the corresponng access metho to answer a three-mensonal range query wth (s 5%, s 33%, an s 3 7%): an FTS retreves the entre atabase explotng clusterng an prefetchng. In contrast to that an SSI wll rarely utlze any clusterng benefts for small result sets. BII retreves each btmap by clustere access, whereas the ata tself wll often be sprea over many ata pages an then must be retreve by ranom access to each page. However, for larger result sets the probablty rses that prefetchng mght be applcable for btmap nexes. Ths means, that BII wll not be much less effcent than an FTS n worst case. A CSI wth A 3 as frst attrbute n concatenaton orer utlzes clusterng but only explots the 7% restrcton on A 3. In contrast to that the UB-Tree utlzes the restrctons of all mensons an retreves the ata n a clustere way. A 3 A A tuple clusterng ranom access page clusterng FTS SSI tuple clusterng ranom tuple BII tuple clusterng CSI Fgure 6-6: Access methos an clusterng 33% tuple clusterng 7% UB-Tree 5% 6.3. Cost Functons Usng the cost moel of Secton 6. we calculate the cost of processng a fact table consstng of T tuples store on P pages restrcte by a multmensonal nterval Q [[y, z]] [y, z ]... [y j, z j ]... [y, z ] wth a selectvty of s j n attrbute A j. For UB-Trees we assume a - mensonal herarchcal organzaton of the table. For seconary nexes we assume B j to be the sze n pages of a seconary nex on A j. Fgure 6-7 splays cost formulas for these access methos.

156 40 CHAPTER 6: PERFORMANCE ANALYSIS ( ) ( ) ( ) ( ) ( ) ( ) j j j j j j s P t t t z y P n t t t z y P c s s P T s T t t s s P T t B s t t t B B s s T c s s P T s T t t s s P T t B s t t t B s s T c s P t t t s s P c P t t L t P c UB sets) result for large (prefetchng ranom access by retreval tuple & ntersecton btmap retreval BII on menson SSI CSI FTS ),,, ( ),,, ( ),...,,, cluster( ),...,,, prefetch( ),...,,,...,,, ( ),...,,, cluster( ),...,,, prefetch( ),,...,,, ( ),...,,, ( ), ( ) Fgure 6-7: Cost functons for retreval of a multmensonal nterval Interestngly, non-clusterng nexes may utlze tuple clusterng for large result sets, snce the probablty of tuples of the result set to be locate on the same page ncreases wth growng result set sze. Ths clusterng effect of non-clusterng seconary nexes s taken nto account by our cost functons by ntroucng a functon cluster that wth growng result set sze of a query grows from to C an nterpolates lnearly wth the selectvty: for a sngle seconary nex on ) max(, for seconary nex ntersecton ) max(, )) selectvty( max(, ),...,,, cluster( A s C s C Q C s s P T Thus mn(cluster) an max(cluster) C. In the same way seconary nexes may utlze page clusterng when retrevng large result sets. The functon prefetch(t,p,s,...,s ) enotes the actual prefetchng for the retreval of a result set by seconary nexes an grows from to L: Wth growng result set sze the probablty of tuples to be locate on consecutve pages ncreases. on for SSI, max, mn for SII, max, mn ) selectvty(, max, mn ),...,,, prefetch( A L P s T L L P s T L L P Q T L s s P T Thus mn(prefetch) an max(prefetch) L. For T 60 6 tuples on P pages, C 30 an L 6, Fgure 6-8 shows the functons prefetch an cluster epenng on the selectvty of the query. As soon as the selectvty excees /C (3,33%) several tuples of the result set are store on one page. Prefetchng alreay shows benefts when the selectvty excees /(LC) 0,%.

157 SECTION 6.3: THE COST OF RANGE QUERIES prefetch * cluster prefetch cluster factor 0 factor % % % 3% 4% selectvty of the query cluster prefetch 0% 5% 50% 75% 00% selectvty of the query (a) (b) Fgure 6-8: prefetch(t, P, s,..., s ) an cluster(t, P, s,..., s ) These efntons are use for prefetch an cluster n the cost functons of Fgure 6-7. Note that n orer to utlze tuple clusterng or page clusterng the tuple entfers of a seconary nex (ntersecton) nee to be sorte wth respect to the physcal storage orer. Sortng can be avoe, f the restrcton s a pont restrcton an tuples wth entcal nex keys are store n physcal orer n the seconary nex. Note that tuples are elvere n physcal orer an not n the sort orer of any nex, f clusterng s utlze n ths way by seconary nexes. Wth optmze storage structures, lke btmap representaton for row entfers, partal match queres can be hanle effcently. However, the loss of the nex key orer lmts the usablty of the approach for range queres n combnaton wth sort operatons on an nex attrbute (see Secton 6.4) Smulaton Results Current operatng systems usually prefetch L 6 pages wth one ranom access. We assume t π 0 [ms], t τ 0,6 [ms] an t ξ 0,4 [ms]. We assume a four mensonal organzaton of the UB-Tree. Fgure 6-9a shows the cost [n s] for a range query wth (s 33%, s 5%, s 3 7%, s 4 00%) aganst 4-mensonal UB-Trees compare to other access technques. The table sze s vare from one page up to one mllon pages. Varyng the selectvty of the restrcton n A for a table sze of P 878k pages (about 7GB for a page sze of 8kB) shows that UB-Trees are superor to a CSI on the most selectve attrbute A 3, snce ths CSI cannot explot any restrcton but the one on A 3. UB-Trees can explot the restrctons on A an A n aton to the restrcton on A 3. Thus UB-Trees are also superor to an FTS an to BII usng btmap nexes for all four mensons (Fgure 6-9b).

158 4 CHAPTER 6: PERFORMANCE ANALYSIS Note that ue to the hgh selectvty n A 3 a SSI always egenerates to an FTS. For an overall selectvty of 75% 7% 5% 00% 3,875 % an FTS s alreay preferable to BII. Snce btmap nexes o not cluster the ata, the result set efne by the restrctons n all mensons must be suffcently small for BII to be compettve cost [tme] n s CSI A 3 FTS BII UB-Tree cost [tme] n s CSI A 3 SSI A SII FTS CSI A UB-Tree 0 0 table sze n pages % 00% selectvty n menson A n % (a) (b) Fgure 6-9: Smulaton of a four mensonal range query 6.4 The Cost of Range Queres wth Sort Operatons The Tetrs Algorthm of Secton 4.7 allows one to process queres wth mult-attrbute restrctons an sort operatons. Snce only very lttle man memory s neee, n general no external sortng wll be necessary. In ths secton the response tmes an storage requrements of the Tetrs algorthm are compare to those of a merge sort algorthm n combnaton wth any of the access methos of Secton Cost Functons for Seconary Storage The Tetrs cache s conserably smaller than the temporary storage of PΠs requre by the merge-sort algorthm that s necessary after the ata has been retreve by an FTS or any nex on a restrcte attrbute. To sort A j the Tetrs algorthm just requres to cache one slce,.e., Tetrs (, P, y, z, j) n (, P, y, z ) j cache,.., For a two-mensonal UB-Tree the above formula results n a square root functon of the number of Z-regons overlappng the query box,.e., cache Tetrs (, P,s,s, j) P s s for j {, }.

159 SECTION 6.4: THE COST OF RANGE QUERIES WITH SORT OPERATIONS Cost Functons for Response Tme For the followng conseratons we assume a merge sort algorthm usng a man memory of M pages an a merge egree of m. We ve the sort process n a retreval phase (whch retreves the ata to create ntal runs for the merge-sort) an a sort phase (whch actually performs the merge-sort). Because of the mult-attrbute flterng of the retreval phase the ata set to be sorte s usually smaller than the entre table. Wth s j enotng the selectvty of the restrcton n attrbute A j an nepenence of the attrbutes, PΠs pages nee to be sorte. The cost functons of Secton 6.3 can be use to calculate the cost of the retreval phase to create the ntal runs for the sort operaton. If an access metho oes not return the tuples n the requeste sort orer, sortng wth the cost of c sort takes place. If M PΠs, sortng takes place n man memory. c sort then s the cost of an nternal sort operaton. c sort s zero, f a CSI wth A as frst attrbute s use for the retreval phase an the sortng attrbute s A, snce the ata then s alreay retreve n the esre sort orer. c sort (, P, C, m, M, s,..., s ) t P s log P s P t + t + t P s log m s C M pages to sort number of merge phases consecutve access rea wrte &, f M > P, otherwse s As shown n the secton before, SSI an CSI on restrcte attrbutes are only effcent for farly small result sets. In ths case sortng woul take place n man memory. One can expect that as soon as external sortng s necessary, SSI an CSI are not effcent for the retreval phase anymore. Therefore we o not conser SSI an CSI n ths secton, snce we focus on external sortng here. The Tetrs algorthm has to sort each cache slce. Snce the algorthm reas n(y j, z j,l j ) slces, the overall cost of nternal sortng accorng to A j s: c Tetrs (, P, y, z, j) t n( y j, z j,l j ) cachetetrs (, P, y, z, j) log cachetetrs (, P, y, z, j) Cost Functons for Interactve Response Tmes When the Tetrs algorthm has complete a slce, t s usually sorte nternally an then s avalable n sort orer. Thus frst results are avalable for further processng after a tme of cache tetrs (,p,x,y) (t π + t τ + t ξ ). For a CSI (or IOT) on a restrcte attrbute an an FTS t s necessary to wat untl the entre merge sort s complete. Ths yels a tremenous performance avantage of the Tetrs algorthm for ppelne processng an nteractve response tmes.

160 44 CHAPTER 6: PERFORMANCE ANALYSIS Smulaton Results Usng the same parameters as n Secton 6.3 an atonally usng a man memory cache of 3 MB an a merge egree of m for the merge sort algorthm Fgure 6-0 shows the cost [n s] for sortng the result set of a fact table efne by restrctons n multple herarchcal mensons cost [tme] n s CSI A 3 FTS UB-Tree CSI A BII cost [tme] n s CSI A 3 FTS CSI A BII UB-Tree 0 0 table sze n pages % 00% selectvty of A n % (a) (b) Fgure 6-0: Smulaton of sortng a four mensonal query box Fgure 6-0a shows the cost [n s] for sortng a four mensonal query box wth (s 33%, s 5%, s 3 7%, s 4 00%) accorng to A. Agan the table sze s vare from one page to one mllon pages. Snce the selectvty of each restrcton excees 0%, processng ths query by a sngle seconary nex usually egenerates to an FTS. The spee up of the Tetrs algorthm for UB-Trees grows superlnearly wth ncreasng table sze, snce all other access methos requre an external merge sort. Varyng the selectvty of the restrcton n A for a table sze of P 878k pages n Fgure 6-0b shows the superorty of the Tetrs algorthm, snce multmensonal clusterng allows one to explot mult-attrbute restrctons to reuce the number of ranom accesses an at the same tme avos an expensve external sort operaton. Sortng wth Tetrs takes place n man memory as long as ths memory suffces to hol one slce of the query box. Fgure 6-(a, b) shows that the temporary storage for the merge sort algorthm use by FTS, BII, an CSI A 3 soon excees the man memory sorter cache of M 3 MB when processng the queres of Fgure 6-0 (a, b). In contrast to that sortng wth Tetrs never requres more than 4 MB of cache for one slce an thus sortng can take place n man memory.

161 SECTION 6.5: SUMMARY: COST ANALYSIS 45 sorter cache sze n MB Mergesort Tetrs CSI A sorter cache sze n MB Mergesort Tetrs CSI A 0 0 table sze n pages % 00% selectvty of A n % (a) (b) Fgure 6-: Temporary storage requre for the sorter cache (smulaton) A CSI or SSI on A oes not requre any sorter cache. The traeoff of these two access methos s the nablty to use restrctons n multple mensons. Overall, the Tetrs algorthm for UB-Trees outperforms any access metho ether wth respect to response tme or wth respect to both response tme an temporary storage requrements. 6.5 Summary: Cost Analyss Usng our cost functons we foun out that for sort operatons wth restrctons n some attrbutes UB-Trees an the Tetrs algorthm are superor to one-mensonal access methos, unless a strongly preferre sort orer on one attrbute per relaton exsts or the restrctons are not selectve enough to make up the tenfol spee of the FTS. Another lmtaton of our technque s the number of mensons as nvestgate n Secton 3.9: Increasng mensonalty exponentally reuces the potental of the multmensonal space parttonng to create a total sort orer n one menson. Our theoretcal an practcal analyss shows that multmensonal nexes of up to 6 mensons are hanle very well wth table szes aroun GB. These mensonaltes are typcal for ata warehousng applcatons an n partcular for the TPC-D benchmark (see Secton 8.). Wth larger table szes even further attrbutes coul be ae to the multmensonal nex n orer to spee up queres wth restrctons or sorte processng n ths attrbute.

162

163 CHAPTER 7: PERFORMANCE MEASUREMENTS 47 In scence, as n lfe, learnng an knowlege are stnct, an the stuy of thngs, an not of books, s the source of the latter. (Thomas H. Huxley) Chapter 7 Performance Measurements arge-scale experments wth our prototype mplementaton are reporte n L ths chapter. We nvestgate the behavor of our prototype mplementaton of the UB-Tree for relatons storng artfcally generate, unformly strbute ata of up to 0 mllon tuples. We show the performance of nserton nto UB-Trees, clusterng B-Trees an multple non-clusterng B- Trees for varous mensonaltes. After a bref summary of man results of pont query nvestgatons we more closely look on the range query performance. We frst entfy the man factors that nfluence the range query behavor of the UB-Tree. Then several types of measurements are efne. Performance fgures for range queres measure wth our prototype mplementaton on top of two relatonal DBMS are lste an nterprete. Due to legal conseratons the DBMS have been mae anonymous. Last but not least performance fgures are gven for partal match queres that restrct only a subset of the mensons of the UB-Tree. Ths chapter practcally unermnes our observatons from Chapter 3 an also proves the accuracy of our cost functons from Chapter 6 for unformly strbute ata.

164 48 CHAPTER 7: PERFORMANCE MEASUREMENTS 7. Insert Performance Fgure 7- shows performance measurements of the nsert performance of the prototype mplementaton on DBMS on a 67 MHz ULTRA SPARC wth an IBM har sk wth 8ms postonng tme for 3-mensonal (a), 6-mensonal (b), an -mensonal ata (c). Each measurements seres shows the tme n ms for the nserton of one atonal tuple for the atabase sze shown on the horzontal axs. The performance of UB-Trees s compare to nserton nto an nex organze table (IOT) on the attrbutes an to nsertons nto multple seconary nexes over 3, 6 resp. mensons (MSI). tme n ms / atonal tuple MSI IOT UB-Tree tuples n atabase (a) tme n ms / atonal tuple MSI IOT UB-Tree tuples n atabase (b) tme n ms/atonal tuple MSI IOT UB-Tree tuples n atabase (c) Fgure 7-: Insert performance Fgure 7- shows how the nserton tme for UB-Tree nsert s strbute to CPU for aress calculaton, I/O for page retreval, CPU for page mofcaton an I/O for page upate. The tme strbuton n the bars s from top to bottom as lste n the legen.

165 SECTION 7.: EXACT MATCH QUERY PERFORMANCE 49 tme n ms / atonal tuple IO:StorePage CPU:UpatePage IO:GetPage CPU:GetPage table sze n tuples Fgure 7-: Tme strbuton for UB-Tree nserton All n all the performance of UB-Tree nserton s smlar to nserton nto an nex organze table. Ths s not surprsng, snce the UB-Tree merely requres CPU operatons for aress calculaton n aton to nex organze tables. Ths atonal aress calculaton overhea s neglgble. For a 6-mensonal nteger tuple t uses less than % of the total nserton tme. For a UB-Tree of heght h UB-Tree nserton s about (h-)/h tmes faster than nserton nto multple B-Trees. The factor (h-)/h n the above formula s ue to the fact that for a gven atabase the UB-Tree n the average s one level hgher than each of the seconary nexes. Detale measurements an nvestgatons of nsert performance on the prototype mplementaton of UB-Trees can be foun n [Fr97] an [Bau97]. 7. Exact Match Query Performance A pont can be foun n O(log k T) tme, where T s the number of objects n the relaton an k ½C, snce UB-trees are balance an searche exactly lke the varant of B-tree use as the unerlyng ata structure for the UB-tree. Thus the pont query performance of a UB-Tree s smlar to that of an IOT. The atonal aress calculaton overhea s neglgble. Detale measurements an nvestgatons of exact match query performance on the prototype mplementaton of UB-Trees on DBMS can be foun n [Fr97].

166 50 CHAPTER 7: PERFORMANCE MEASUREMENTS 7.3 Storage Requrements As expecte, the storage requrements of UB-Trees are equvalent to those of concatenate B- Trees (IOTs). UB-Trees requre more storage than a smple sequental fle organzaton (FTS), snce one atonal B-Tree s mantane n orer to store the Z-aresses of the multmensonal space parttonng. However, ths yels a tremenous reucton n storage requrements over multple seconary nexes (MSI), whch requre to mantan one B-Tree for each menson. FTS IOT (A..A 6 ) MSI (6 m.) UB-Tree Table Sze n Pages / MB 5 / 4, 7759 / 55, / 53,3 66 / 43, Heght of B-Tree Leaf Page Utlzaton 00% 89,677% 90,003% 99,30% Table 7-: Table Szes of a mllon 8 Byte tuples relaton on DBMS (kb pages) Table 7- an Table 7- show the storage requrements for a one mllon respectvely ten mllon tuple DBMS relaton organze as FTS, IOT, 6 multple seconary nexes an a 6- mensonal UB-Tree on kb pages. Due to the smpler page hanlng of our prototype mplementaton (e.g., no varable length attrbutes, only support for numbers an character ata), storage requrements for UB-Trees are lower than those of natve DBMS B-Trees. FTS IOT (A..A 6 ) MSI (6 m.) UB-Tree Table Sze n Pages / MB 5608 / 5, 9684 / 59, / 55,8 8 / 4,4 Heght of B-Tree Leaf Page Utlzaton 00% 89,677% 90,003% 99,30% Table 7-: Table Szes of a 0 mllon 8 Byte tuples relaton on DBMS (kb pages) For DBMS Table 7-3 shows the storage requrements for several 6-mensonal relatons wth fferent tuple szes an table szes. Due to some mplementaton overhea of B-Trees n DBMS, the table sze of a sequental fle (FTS) for small relatons s conserably lower than the table sze of B-Trees as use for IOT an UB-Tree. Tuples mllon mllon mllon 4 mllon Tuple Sze 48 Byte 48 Byte 8 Byte 8 Byte 8 Byte 8 Byte FTS Pages / MB IOT Pages / MB UB-Tree Pages / MB 784 / 6,0 569 /, 674 / 49, / 69, / 538, / 077,6 395 / 08,7 780 / 7,3 700 / 93, / 474,3 964 / 947, / 98,6 369 / 06,0 794 /, / 39,5 489 / 376,0 95 / 75, / 30, Table 7-3: Table Szes of several tables on DBMS (8kB pages)

167 SECTION 7.4: RANGE QUERY PERFORMANCE Range Query Performance Answerng a range query over a atabase, whch s organze as a UB-tree, requres tme proportonal to the number of Z-regons overlappng the query box. For non-unformly strbute ata ths number s not just a functon of the restrcton n each menson, but also epens on the ata strbuton. Thus the parameters that nfluence the behavor of the UB- Tree range query algorthm are: query box volume query box poston query box wth n each menson (egree of query box generaton [Fr97]) table sze Z-regon parttonng (.e., ata strbuton, splt parameters, etc.) mensonalty Many of these parameters were nvestgate n etal n the master theses of Nls Frelnghaus [Fr97] an Rolan Pernger [Pe98]. There the cost functon of Secton 6. was use to smulate ealze unformly parttone UB-Trees. In aton measurements on a prototype mplementaton of the UB-Tree were conucte for varous mensonaltes, atabase szes, an ata strbutons. Here we just sketch the man results an present the most nterestng measurement seres. For unformly strbute ata the range query performance exponentally ecreases wth the mensonalty of the UB-Tree (cf. also Secton 3.9). The more mensons are restrcte, the better the range query performance gets, snce each restrcton s utlze by the UB-Tree (see also Secton 7.4.3). However, restrctng a menson to less than -l of ts oman for a splt level of l n that menson oes not further reuce the number of pages, snce the splt level efnes the lmt of resoluton of the parttonng gr. The poston of the query box on the parttonng gr can ncrease or ecrease the number of regons overlappe by the query box even for unformly strbute ata an therefore nfluences the range query performance [Fr97]. The fner the gr, the less ths effect may be observe. Snce the gr s fner for larger atabases or lower mensonaltes, mensons that are not restrcte by a query harm the query performance: An entre slce of Z-regons wth respect to the not restrcte menson has to be retreve. Summng up, the mensons store n the UB-Tree shoul be use for restrcton. Thus proper nex moelng s stll necessary wth UB-Trees, snce the curse of hgh mensonalty forbs to a all attrbutes as mensons to an nex.

168 5 CHAPTER 7: PERFORMANCE MEASUREMENTS (a) Fgure 7-3: Range queres n sparsely (a) an ensely (b) populate parts of a unverse (b) In Fgure 7-3 (a an b) range queres aganst the same non-unformly strbute UB-Tree are shown. In ths Fgure the Z-regons are shae that ntersect the query box. The query box of Fgure 7-3b has a result set of 67 ponts an overlaps 7 regons. Although the query box of Fgure 7-3a has the same volume, t only covers a sparsely populate part of the unverse an thus only 78 ponts n 3 regons are retreve by the range query. Snce a Z-regon correspons to a leaf page of the UB-Tree, Fgure 7-3 shows that the number of sk accesses s proportonal to the result set sze of the range query. (a) Fgure 7-4: Query box volumes (b)

169 SECTION 7.4: RANGE QUERY PERFORMANCE 53 Fgure 7-4 (a an b) show two query boxes of fferent volume locate n fferent parts of a unformly strbute UB-Tree. The larger the query box n volume, the more Z-regons are overlappe by the query box an retreve by the range query algorthm. Snce wth a larger query box volume many Z-regons are entrely contane n the query box, one can also state that the number of retreve Z-regons s proportonal to the result set of the query box. (a) Fgure 7-5: Range queres an scalablty (b) Fgure 7-5a splays a range query aganst a UB-Tree storng 000 tuples on 5 Z-regons. The UB-Tree n Fgure 7-5b stores tuples on about 500 Z-regons. Thus Fgure 7-5 shows that the query box s approxmate more closely by the Z-regon parttonng as the atabase ncreases Types of Measurements Defnton 7- (range metho measurement for an access metho): A range query measurement for an access metho s an experment that executes a range query on a sngle table wth ranges n attrbutes A,...,A usng a certan access metho for table access an measures certan parameters (e.g., response tme or number of I/Os) of the query executon. Depenng on the access metho use for the table access we speak of FTS measurement for full table scans IOT A measurement for an nex organze table on attrbute A as frst key n concatenaton orer UB-Tree measurement for UB-Trees SSI A measurement for sngle seconary nex on attrbute A SII measurement for ntersecton of seconary nexes

170 54 CHAPTER 7: PERFORMANCE MEASUREMENTS Defnton 7- (measurement for a set of access methos): A measurement uses the same restrctons to perform a range query measurement for each access metho of a set of access methos. Defnton 7-3 (measurement seres): A measurement seres s an orere sequence of measurements for a set of access methos. In the followng sectons we nvestgate two types of measurement seres on a relaton wth nex attrbutes, namely c%-measurement seres an cube measurements seres. Defnton 7-4 (c%-measurement seres): A c%-measurement seres restrcts - attrbutes to c% of ther oman, whereas one attrbute (the varable attrbute) s vare from 0% to 00%. For nepenently unformly strbute ata the selectvty of a c%-measurement wth a selectvty of x% n the varable attrbute s c% - x%. 60 thousan tuples n result set 0 varable menson 0% selectvty of varable attrbute 00% Z-regons ntersectng the query box % 50% 00% selectvty of the varable attrbute Fgure 7-6: Characterstcs of a c%-measurement seres The upper left part of Fgure 7-6 shows the volume of the query box for three measurements of a two-mensonal c%-measurement seres. Below that two-mensonal vsualzaton the Fgure shows the lnearly growng result set sze for a 35%-measurement seres aganst a sxmensonal atabase storng 0 7 tuples. The rght part of Fgure 7-6 shows the number of Z- regons ntersecte by the query boxes of a 35%-measurement seres aganst a sxmensonal UB-Tree storng 0 7 unformly strbute tuples on 8 Z-regons. In the Fgure one can see a typcal characterstc of c%-measurements for UB-Trees: The number of Z-regons ntersectng the query box s a starcase functon for a c%-measurement seres. Ths starcase behavor s ue to the fact, that for unformly strbute ata the Z-regon parttonng s a screte -mensonal gr wth clearly efne parttonng ponts (whch are the ponts 50%, then /4 an 3/4, then / 3, 3/ 3, 5/ 3 an 7/ 3, etc. epenng on the ata

171 SECTION 7.4: RANGE QUERY PERFORMANCE 55 base sze, cf. Secton 6.). A small enlargement of the query box n the varable menson oes not excee a gr pont an thus oes not cause any further Z-regon to be ntersecte. As soon as the query box excees a gr pont, one atonal slce of Z-regons s ntersecte. The heght of the starcase (.e., the number of atonally ntersecte Z-regons) n the new slce of the gr epens on c%, the restrcton n the varable menson an on the number of mensons of the gr. A larger value for c% means that the query box covers a larger part of the multmensonal space an therefore a hgher number of Z-regons wll be ntersecte when exceeng a gr pont. Smlarly, wth a hgher mensonalty more Z-regons are ntersecte when parttonng pont s exceee. Wth a lnearly growng result set the starcase functon approxmates a lnear functon. Further analyss of the starcase phenomenon can be foun n [Fr97]. Defnton 7-5 (cube measurement seres): A cube measurement seres vares the restrcton on all attrbutes from 0% to 00% at the same tme. For a cube measurement on nepenently unformly strbute ata the selectvty of the restrcton n each menson s entcal. Thus a selectvty of x% n each menson results n an overall selectvty of x% for the query. In analogy to Fgure 7-6, Fgure 7-7 shows the volumes for three measurements of a two mensonal cube measurement seres. In aton t shows the result set sze an the number of ntersecte Z-regons for a cube measurement seres aganst a 6-mensonal UB-Tree of 0 7 tuples on 8 pages. The number of ntersecte Z-regons agan shows a starcase behavor, whch n ths case approxmates the polynomal functon mllon tuples n result set 0 all mensons 0% selectvty of all attrbutes 00% Z-regons ntersectng the query box % 50% 00% selectvty of all attrbutes Fgure 7-7: Characterstcs of a cube measurement seres Thus wth growng restrcton n the varable attrbute(s) the result set of a cube measurement seres grows polynomally, whereas the result set of a c% measurement seres grows lnearly. Ths means, that cube measurement seres are a goo way to theoretcally analyze the

172 56 CHAPTER 7: PERFORMANCE MEASUREMENTS performance egeneraton of a multmensonal nex from very small to very large result sets rangng from (%) (s usually just one or very few tuples) to 00% (the entre relaton). c% measurement seres ncate whether an nex egenerates f the restrcton s vare only n one attrbute an therefore allow to juge the symmetry of a multmensonal nex over the mensons Comparatve Performance Measurements Wth the prototype mplementaton of the UB-Tree performance measurements were conucte on several commercal DBMS. Here we just lst the results of DBMS an DBMS. The results on the other DBMS are qualtatvely equvalent to ether DBMS or DBMS an can be foun n [Ova99] an [Pfa99]. To get comparable results, precautons were taken n orer to elmnate cachng effects. Descrptons of these precautons can be foun n [Fr97] for DBMS an [Pe98] for DBMS. Because of fferent resources an confguratons of the atabase servers for DBMS an DBMS, the measurements on both DBMS are not comparable quanttatvely. Ths hols especally because of the fferent table szes. However, a qualtatve comparson s possble. Qualtatvely, the performance gan of the UB-Tree compare to an IOT s entcal for DBMS an DBMS. The only qualtatve fference s the performance of an FTS: DBMS mplements relatons as IOTs; an FTS retreves the ata n prmary key orer by tuple clustere access. Thus an FTS n DBMS s entcal to reang 00% of a relaton va an IOT. In contrast to that DBMS utlzes page clusterng for FTSs. Therefore an FTS n DBMS s more effcent than n DBMS. tme n s SII FTS IOT A IOT A - A6 UB-Tree tme n s SII FTS IOT A UB-Tree % 5% 50% 75% 00% restrcton n % (a) Fgure 7-8: DBMS 0 0% 5% 50% 75% 00% restrcton n % (b) The measurements of Fgure 7-8 were conucte on DBMS on a SUN ULTRA SPARC II 67 MHz wth an IBM 8ms har sk. The test table stores 0 mllon nepenently unformly strbute 6-mensonal tuples on 8 pages. Fgure 7-8a shows a 35% measurement sequence, where A to A 6 are the constant mensons, whereas the restrcton n A s vare from 0% to 00%. As mentone before, an FTS n DBMS s entcal to an IOT on A wthout any restrcton n the nex attrbute. An IOT on A, A 3, A 4, A 5, or A 6 has to

173 SECTION 7.4: RANGE QUERY PERFORMANCE 57 retreve 35% of the atabase. Ths results n a response tme 35% of that of an FTS. Actually the FTS s not constant, but slghtly ncreases wth a growng selectvty n A. The lnearly growng result set causes lnearly growng CPU tme for result set processng an nterprocess communcaton tme for result set transfer. Whle ths tme s clearly vsble for FTS an IOT on A, A 3, A 4, A 5, or A 6, t s also nclue n the response tmes of the other access methos. Intersecton of seconary nexes n DBMS requres the retreval of the 35% of the fve seconary nexes on A to A 6 an a certan percentage of the seconary nex on A. After that an expensve ntersecton operaton an a ranom access for each tuple of the result set are performe. Therefore SII s worse than tuple clustere access of the IOTs alreay for queres wth a selectvty of less than %. The UB-Tree requres less tme than any IOT, snce t allows a tuple clustere access to the result set efne by the restrctons n all attrbutes, whereas an IOT on A only utlzes the restrcton on A an therefore grows lnearly on a much larger scale than the UB-Tree. The performance fgures for the UB-Tree o harly epen on the varable attrbute: If A was left constant an any other attrbute s vare, the response tme of the UB-Tree s smlar to that reporte n the fgure (actually t epens on the number of recursve splts n that attrbute; see [Fr97] for measurement charts or Chapters 3 an 6 for an analytcal explanaton). Selectvty n A UB-Tree IOT A IOT A SII FTS 0% 3,9 s 4, s 94, s 890,3 s 458,9 s 40% 6,6 s 8, s 00,4 s 0,9 s 477,3 s Table 7-4: Response tmes for 35%-measurements on DBMS wth A as varable attrbute Fgure 7-8b shows a cube measurement sequence where the selectvty of each attrbute s vare from 0% to 00% at the same tme. For a selectvty of less than 8% n each menson (an overall selectvty of (8%) 6 < / for the query,.e., a result set of about 3 tuples!) SII s preferable to an FTS, snce the part of each B-Tree that nees to be retreve as well as the result set are suffcently small. Snce the result set grows polynonally wth the 6 th power, the response tme of the FTS grows wth the 6 th power ue to CPU tme for result set processng. Ths atonal CPU tme s also nclue n the response tme of the other nexes. For a selectvty of 00% FTS, IOT, an UB-Tree take the same tme to respon to the query, snce the ata s retreve by tuple clustere access by all of these access methos. Up to a selectvty of 75% n each menson (an overall selectvty of (75%) 6 7,7%,.e., a result set of.7 mllon tuples) the UB-Tree has a sgnfcant performance avantage, snce the restrctons n all attrbutes are use to lmt the number of page accesses to answer the query. Selectvty UB- IOT SII FTS n each menson Tree 0% 0,9 s 0,7 s 35, s 449,8 s 40% 7,5 s 8, s 753,3 s 475,9 s Table 7-5: Response tmes for cube measurements on DBMS

174 58 CHAPTER 7: PERFORMANCE MEASUREMENTS Summng up, our measurements ncate that the range query performance of UB-Trees on DBMS s more symmetrcal than that of an IOT. It also shows a better absolute performance than SII, when a suffcent number of attrbutes s specfe. In our 6-mensonal test atabase ths s alreay true for or 3 mensons. The UB-Tree range query performance s on the average several orers of magntue faster than IOTs an SII. We measure an ncrease n spee of several thousans compare to SSI an epenng on the restrcton between two an one-hunre compare to a compoun nex. Performng an nex scan over the whole relaton wth a UB-Tree results n a performance smlar to a scan over a clustere prmary compoun B-Tree (see also [Fr97]). tme n s SSI IOT A IOT A-A6 FTS UB 0% 5% 50% 75% 00% restrcton n % (a) tme n s Fgure 7-9: DBMS 0 SSI IOT FTS UB 0% 5% 50% 75% 00% restrcton n % (b) Fgure 7-9 shows 35%-measurements an cube measurements on DBMS performe on a 4 CPU Pentum Pro 00 MHz wth Wnows NT 4.0. The test table stores 50 thousan nepenently unformly strbute 6-mensonal tuples on 794 pages. Beses the more effcent FTS an the use of a SSI nstea of SII these measurement seres are qualtatvely entcal to those of DBMS on Solars.5. as escrbe above. The only fference to DBMS s that DBMS oes not perform an ntersecton of seconary nexes for ths query. Usng btmap nexes resulte n a performance worse than any of the IOT. Instea, we measure the use of a sngle seconary nex (SSI) on the most selectve attrbute. Further measurements on DBMS (varyng tuple sze, varyng table sze, comparson between NT an Solars) showe the behavor as precte by our cost moel (see [Pe98]). Selectvty n A UB-Tree IOT A IOT A SSI FTS 0%,4 s 55,0 s 98,3 s 45,7 s,4 s 40%,4 s 08,6 s 98,5 s 75, s,3 s Table 7-6: Response tmes for 35%-measurements on DBMS wth A as varable attrbute Selectvty UB- IOT SSI FTS n each menson Tree 0% 0, s 55,8 s 48,4 s 0,9 s 40% 4,5 s, s 85, s,4 s Table 7-7: Response tmes for cube measurements on DBMS

175 SECTION 7.4: RANGE QUERY PERFORMANCE Partal Range Queres The followng measurements llustrate the performance of the UB-Tree for partal range queres. The measurements were also performe wth DBMS on a 4 CPU Pentum Pro 00 MHz wth Wnows NT 4.0. The test table stores 50 thousan nepenently unformly strbute 6-mensonal tuples on 794 pages. Whle Secton 7.4. restrcte each of the attrbutes of the UB-Tree to an nterval smaller than the oman of an attrbute, some mensons are not restrcte on ths measurement. The measurement seres here resembles a 35%-measurements seres wth the only fference that less than - mensons are restrcte to 35%. The number a n the legen entry UB a + of Fgure 7-0 shows how many mensons are restrcte to 35%. The other -a- mensons are not restrcte. The graph shows a varaton of a from to 5 for the 6- mensonal UB-Tree (note that UB 5 + s the actual 35%-measurement seres of the prevous secton). IOT var n the legen of the fgure means that an IOT on the varable attrbute of the 35%-measurement s use for answerng the query, whereas IOT 35% resp. IOT 00% use an IOT on an attrbute wth a selectvty of 35% resp. 00% for processng the query. The measurement shows that as soon as two mensons are restrcte the UB-Tree s superor to IOTs, unless the restrcton n the frst attrbute of the IOT s suffcently small. The performance ncrease of the UB-Tree wth a growng number of restrcte mensons shows that the UB-Tree nee utlzes the restrcton n all mensons n orer to reuce I/O. As soon as at least 4 out of 6 mensons are restrcte, the UB-Tree s also superor to an FTS even though the FTS of DBMS uses page clusterng. The UB-Tree s also superor to an FTS, f at least 3 out of 6 attrbutes are restrcte wth a selectvty of less than 50% or out of sx attrbutes are restrcte to a selectvty of less than 5%. Fgure 7- resembles a cube measurement seres, where the UB a legen entry means that only a out of attrbutes are restrcte. UB 6 of that fgure s equvalent to the cube measurement seres of the prevous secton. IOT var here means that an IOT on a varable menson s use for query processng, whereas IOT const uses an IOT on a non-restrcte attrbute. Agan the fgure shows that further restrcte attrbutes are use by the UB-Tree to reuce the response tme for query processng, whle FTS an IOTs cannot take any avantage of atonal restrctons. An IOT on a varable attrbute s superor to the UB-Tree, f only ths attrbute s restrcte by the query (.e., the query efnes a hyperplane, only one menson s restrcte). Of course a specalze one-mensonal nex lke an IOT n ths case s better than a multmensonal UB-Tree. However, as soon as at least two out of sx attrbutes are restrcte, the UB-Tree s superor to an IOT. As soon as at least 3 out of 6 attrbutes are restrcte to a selectvty of less than 50%, the UB-Tree s also superor to an FTS even though the FTS of DBMS uses page clusterng.

176 60 CHAPTER 7: PERFORMANCE MEASUREMENTS tme n s IOT 00% UB 0 + IOT 35 % UB + IOT var UB + FTS UB 3 + UB 4 + UB % restrcton n % 00% Fgure 7-0: Varyng the number of restrcte mensons for 35% restrctons (DBMS) tme n s 00 IOT const UB IOT var UB UB 3 FTS UB 4 UB 5 UB 6 0 0% restrcton n % 00% Fgure 7-: Varyng the number of restrcte mensons for cube restrctons (DBMS)

177 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING 6 Change s not mae wthout nconvenence, even from worse to better. (Rchar Hooker) Chapter 8 Impacts on Relatonal Query Processng B-Trees an the Tetrs algorthm can be use to accelerate almost any U query processng operaton: Relatonal queres or SQL queres consst of restrctons, projectons, orerng, groupng an aggregaton, an jon operatons. In the presence of multmensonal restrctons or sortng these operatons are effcently mplemente by ether usng the range query algorthm or the Tetrs algorthm. In ths chapter we nvestgate the mpacts of our approach on query processng n RDBMS. We present performance measurements for two applcaton scenaros: We selecte three queres of the TPC-D benchmark to show the potental of the UB-Tree range query algorthm an the Tetrs algorthm. We then show the benefts of Multmensonal Herarchcal Clusterng by performance measurements of queres n a star schema typcal for ata warehousng applcatons. The performance results reporte n ths chapter were measure for one of our project partners wth our prototype mplementaton of UB-Trees on top of DBMS. We compare the performance of UB-Trees to natve query processng technques of DBMS, namely access va an nex organze table (IOT), whch essentally stores a relaton n a clustere B*-Tree, an access va a full table scan (FTS) of an entre relaton. In aton we measure the performance of a sngle seconary B*-Tree nex (SSI) an of an ntersecton of multple btmap nexes (BII) to answer multmensonal range queres.

178 6 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING 8. Relatonal Operatons wth UB-Trees an the Tetrs- Algorthm In accorance wth the efnton of O(n) (e.g., [AHU74]) we efne the terms CPUcomplexty, asymptotc CPU-complexty CPU(n), I/O complexty, asymptotc I/O-complexty IO(n), space complexty an asymptotc space complexty SPACE(n). These operatons wll be use to nvestgate the complexty of the operatons of the relatonal algebra usng UB- Trees wth the range query algorthm an the Tetrs algorthm. Defnton 8- (CPU-complexty, asymptotc CPU-complexty CPU(n)): The CPUcomplexty of an algorthm s the CPU-tme neee by an algorthm expresse as a functon of the sze of the nput ata. An algorthm wth a CPU-complexty efne by a functon g(n) has an asymptotc CPU-complexty CPU(f(n)), f there exsts a constant c such that g(n) cf(n) for all but some fnte (possbly empty) set of nonnegatve values for n. Defnton 8- (I/O-complexty, asymptotc I/O-complexty IO(n)): The I/O-complexty of an algorthm s the I/O-tme (usually measure by the number of ranom sk accesses) neee by an algorthm expresse as a functon of the sze of the nput ata. An algorthm wth a I/O-complexty efne by a functon g(n) has an asymptotc I/Ocomplexty IO(f(n)), f there exsts a constant c such that g(n) cf(n) for all but some fnte (possbly empty) set of non-negatve values for n. Defnton 8-3 (SPACE-complexty, asymptotc SPACE-complexty SPACE(n)): The SPACE-complexty of an algorthm s the temporary storage space neee by an algorthm expresse as a functon of the sze of the nput ata. An algorthm wth a SPACE-complexty efne by a functon g(n) has an asymptotc SPACE-complexty SPACE(f(n)), f there exsts a constant c such that g(n) cf(n) for all but some fnte (possbly empty) set of non-negatve values for n. 8.. Asymptotc complexty of the UB-Tree Range Query Algorthm In the followng we assume a multmensonal query box wth the selectvtes s,..., s over a relaton of T tuples store on P pages. We consequently also assume Z-aresses to have a length of a bts. Thus the range query algorthm has a CPU-complexty whch epens on the number of Z- regons ntersectng the query box. Accorng to our cost functons of Secton 6. ths number s relate to the selectvty of the restrctons of the query box. In aton the CPUcomplexty of the range query algorthm lnearly epens on the length of the Z-aresses. The Z-aress length s entcal to the length of the nex attrbutes of a tuple (see Secton 5.3). Thus the range query algorthm also epens lnearly on the tuple sze (see Secton 5.7). The I/O-complexty solely epens on the number of Z-regons ntersectng the query box an only one Z-regon nees to be store to perform the range query algorthm. Thus the storage space complexty of the range query algorthm s constant wth respect to the problem sze.

179 SECTION 8.: RELATIONAL OPERATIONS WITH UB-TREES AND THE TETRIS-ALGORITHM63 Usng the results of Secton 6. an 5.7, the range query algorthm for a query box has an asymptotc CPU-complexty of CPU a P ) s j j bt operatons, an asymptotc I/O-complexty of IO P ) s j j ranom sk accesses, an an asymptotc space complexty of SPACE(). 8.. Asymptotc complexty of the Tetrs Algorthm The Tetrs algorthm has to retreve the same number of Z-regons as the range query algorthm. The number of CPU-operatons for etermnng the next Z-regon n Tetrs orer are also entcal to those of the range query algorthm. Thus from a complexty pont of vew the only fference between the Tetrs algorthm an the range query algorthm s the storage complexty, whch n ths case s etermne by the Tetrs cache. We agan assume a multmensonal query box wth the selectvtes s,..., s over a relaton of T tuples store on P pages. We also assume a UB-Tree wth a Z-aress length of a bts. Then the Tetrs algorthm to sort the query box wth respect to attrbute A k has an asymptotc CPU-complexty of an asymptotc I/O-complexty of CPU a P s j j ), IO P ) s j j ranom sk accesses, an an asymptotc space complexty of P SPACE j l k ) s j sk pages. In the above formula l k n k (, P, y k, z k ) s the number of slces of the query box n menson k wth respect to the multmensonal space parttonng (see Secton 6.).

180 64 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING 8..3 Asymptotc complexty of Operatons of Relatonal Query Processng The UB-Tree range query algorthm may be use to effcently process mult-attrbute restrctons. The Tetrs algorthm may be use to mplement most operatons of the relatonal algebra or SQL wth lnear I/O-complexty, f the Tetrs cache sze suffces to hol one processng slce of the Tetrs algorthm [Bay97b]. Ths assumpton s realstc, snce our measurements showe that sortng 50 % of a.3 GB table only requres a cache sze of.6 MB (cf. Secton 8..). The actual cache sze epens on the multmensonal parttonng of the table an thus on the ata strbuton. Whle t s not possble to gve an upper boun for the cache sze (beses the relaton sze), experments wth varous atabase szes showe that usually a reasonable parttonng exsts whch keeps the cache sze very low. We enote the selecton operatons by σ, the jon operaton by, groupng by γ, orerng by ω, set unon by, ntersecton by an fference by \. These operatons are mplemente by operator trees processng several operators on tuple streams. Next to the access prmtves range an tetrs (whch mplement the UB-Tree range query algorthm an the Tetrs algorthm) we use the operator merge to merge two sorte streams of tuples base on entty n an attrbute set, the operator remove-uplcate, whch elmnates uplcates of a sorte stream of tuples an aggregate, whch performs groupng an performs the requre aggregatons. In aton we use the tuple stream operators unon, ntersecton an fference that perform set unon, ntersecton an fference on an orere stream of tuples. The range query algorthm an the Tetrs algorthm can be use to mplement the followng operatons of SQL or the relatonal algebra, snce an effcent mplementaton of ths algorthm may utlze a sorte stream of tuples: projectng R to attr wth uplcate elmnaton: π attr (R) sortng R wth respect to attr: ω attr (R) groupng R wth respect to attr an aggregatng attrbutes wth aggregaton functons as specfe n agg: γ attr,agg (R) equ-jonng R wth S wth respect to attr: R attr S set unon of R an S: R S set ntersecton of R an S: R S set fference of R an S: R \ S If mult-attrbute restrctons on the relaton(s) are use n combnaton wth any of the above operatons, these restrctons are utlze on the fly an thus reuce both the Tetrs-cache sze an number of page accesses necessary to perform the operaton.

181 SECTION 8.: RELATIONAL OPERATIONS WITH UB-TREES AND THE TETRIS-ALGORITHM65 σ con (R) range con (R) π attr (σ con (R)) remove-uplcate(tetrs attr,con (R)) ω attr (σ con (R)) tetrs con (R, attr) γ attr,agg (σ con (R)) aggregate(tetrs con (R, attr), agg) (selecton) (projecton) (orerng) (groupng an aggregaton) σ con (R) attr σ con (S) merge(tetrs con (R, attr), tetrs con (S, attr)) (jon) σ con (R) σ con (S) unon(range con (R), range con (S)) σ con (R) σ con (S) ntersect(range con (R), range con (S)) σ con (R)? σ con (S) fference(range con (R), range con (S)) (set unon) (set ntersecton) (set fference) Fgure 8-: Transformaton rules for the mplementaton of algebrac operatons Fgure 8- gves transformaton rules for projecton, orerng, groupng an aggregaton an jon n combnaton wth mult-attrbute restrctons, so that the Tetrs operator can be use for effcent mplementaton of these operatons. In the Fgure R, S enote relatons, con, con, con enote contons efnng multmensonal ntervals, attr enotes an attrbute of R respectvely S an agg enotes a specfcaton of attrbutes an an aggregaton functon for the attrbutes. The rules gven n ths fgure may be use as transformaton rules for algebrac query optmzaton. Applyng the transformaton rules of Fgure 8-, Table 8- lsts the asymptotc CPU-, I/Oan space-complexty of the basc operatons of relatonal query processng an opposes them to ther complexty wth a B-Tree mplementaton. ) In the table we wrte V P to enote the sze of the result set of conton con s j j (analgously for V R, V S for con R an con S ). We assume the B-Tree to be bult on the attrbute an enote the selectvty of that attrbute by s. If the nex attrbute s also the sort attrbute attr, then s 0 an V P. We enote the length of a tuple n bts by a.

182 66 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING UB-Tree/Tetrs B-Tree/Mergesort CPU I/O SPACE CPU I/O SPA- CE σ con (R) O(V a ) O(V ) O() O((Ps +Vlog V )a) O(Ps +Vlog V ) O(V ) π attr (σ con (R)) O(aVlog V ) O(V ) V O l attr O((Ps +Vlog V )a) O(Ps +Vlog V ) O(V ) ω attr (σ con (R)) O(aVlog V ) O(V ) V O l attr O((Ps +Vlog V )a) O(Ps +Vlog V ) O(V ) γ attr,agg (σ con (R)) O(aVlog V ) O(V ) V O l attr O((Ps +Vlog V )a) O(Ps +Vlog V ) O(V ) σ con (R) attr σ con (S) O(aV R log V R ) + O(aV S log V S ) O(V R +V S ) V O l R attr V + k S attr O((P R s +V R log V R )a) + O((P S s +V S log V S )a) O(P R s +V R log V R ) + O(P S s +V S log V S ) O(V ) σ con (R) σ con (S) O(aV R ) + O(aV S ) O(V R +V S ) O() O((P R s +V R log V R )a) + O((P S s +V S log V S )a) O(P R s +V R log V R ) + O(P S s +V S log V S ) O(V ) σ con (R) σ con (S) O(aV R ) + O(aV S ) O(V R +V S ) O() O((P R s +V R log V R )a) + O((P S s +V S log V S )a) O(P R s +V R log V R ) + O(P S s +V S log V S ) O(V ) σ con (R)? σ con (S) O(aV R ) + O(aV S ) O(V R +V S ) O() O((P R s +V R log V R )a) + O((P S s +V S log V S )a) O(P R s +V R log V R ) + O(P S s +V S log V S ) O(V ) Table 8-: Complextes of relatonal operators Thus UB-Trees an the Tetrs algorthm have the potental to spee up any operaton nvolvng mult-attrbute restrctons an sort operatons. In Secton 8. we wll show performance measurements an comparsons for queres usng some these operatons. In Secton 8.3 we wll nvestgate a specal varant of a multway jon-operaton, the so-calle star-jon, an reuce t to mult-attrbute restrctons wth the technque of multmensonal herarchcal clusterng as escrbe n Secton

183 SECTION 8.: COMPLEX QUERIES ON GENERATED DATA: THE TPC-D BENCHMARK Complex Queres on Generate Data: The TPC-D Benchmark We analyze the entre TPC-D benchmark [TPC97] for the usablty of multmensonal access methos. From our analyss we expect that out of 7 queres wll beneft from multmensonal nexng technques an the Tetrs algorthm. The fve queres that wll not beneft from multmensonal nexes ether only restrct a sngle attrbute wthout complex jons/sort operatons or retreve a result set whose sze makes an FTS preferable to any access metho. To show the performance gan of UB-Trees an the Tetrs algorthm we selecte the TPC-D queres Q3, Q4 an Q6. We use a SUN ULTRA SPARC II wth 5 MB man memory an an array of fve 4 GB har sks wth an average postonng tme of 8ms an a transfer rate of 0.7ms per page to generate the ORDER, LINEITEM, an CUSTOMER tables (cf. Fgure -) for several scalng factors of the TPC-D benchmark. Actually the Tetrs performance s even better than reporte n ths secton: The measurements were conucte wth UB-Trees emulate on top of DBMS an are compare aganst IOTs an FTS ntegrate nto the DBMS kernel. 8.. Jons an Restrctons Query Q3 (cf. Fgure 8-) of the TPC-D benchmark s a shppng prorty query, whch retreves the shppng prorty an potental revenue of the orers havng the largest revenue among those that ha not been shppe as of a gven ata. Ths query conssts of restrctons an jon operatons nvolvng three relatons an s effcently processe by the Tetrs algorthm. SELECT L_ORDERKEY, SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)) AS REVENUE, O_ORDERDATE, O_SHIPPRIORITY FROM CUSTOMER, ORDER, LINEITEM WHERE C_MKTSEGMENT FOOD AND C_CUSTKEY O_CUSTKEY AND L_ORDERKEY O_ORDERKEY AND O_ORDERDATE < DATE.5.98 AND L_SHIPDATE > DATE.6.98 GROUP BY L_ORDERKEY, O_ORDERDATE, O_SHIPPRIORITY ORDER BY REVENUE DESC, O_ORDERDATE Fgure 8-: Query Q3 of the TPC-D benchmark The operator tree for Q3 whch s generate by a stanar RDBMS lke DBMS s llustrate n Fgure 8-3a. The query s processe by frst applyng the restrctons on each table an then performng a hash jon or a sort-merge jon on the ntermeate result. The jon orer of Fgure 8-3a s ue to the fact that the LINEITEM relaton s four tmes larger than the ORDER relaton an 40 tmes larger than the CUSTOMER relaton. The ntermeate result of the secon jon s use for groupng wth aggregaton an fnal orerng.

184 68 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING ω REVENUE, ORDERDATE, ω REVENUE, ORDERDATE, γ ORDERKEY, ORDERDATE, SHIPPRIORITY γ ORDERKEY, ORDERDATE, SHIPPRIORITY >< ORDERKEY (ω ORDERKEY, µ ORDERKEY ) ω ORDERKEY µ ORDERKEY >< CUSTKEY (ω CUSTKEY, µ CUSTKEY ) µ CUSTKEY σ MKTSEGEMT σ ORDERDATE σ SHIPDATE τ ωcustkey, σmktsegemt τ ωcustkey, σorderdate τ ωorderkey, σsjipdate CUSTOMER ORDER LINEITEM CUSTOMER ORDER LINEITEM (a) Fgure 8-3: Operator trees for Q3 (b) Wth UB-Trees on CUSTOMER(CUSTKEY, MKTSEGMENT), ORDER(ORDERKEY, CUSTKEY, ORDERDATE), an LINEITEM(SHIPDATE, ODERKEY) Fgure 8-3b an Fgure 8-4 llustrate the Tetrs operator τ σ,ω, whch combnes selecton an sortng. Reang the restrcte part of each relaton n sort orer of the jon attrbute causes a sorte stream of tuples. Ths stream s transferre to the merge operator µ an processe further. CUSTOMER ORDER C_CUSTKEY C_MKTSEG. O_CUSTKEY O_ORDERKEY O_ORDERDATE sort recton L_ORDERKEY LINEITEM Fgure 8-4: Processng Q3 wth the Tetrs algorthm L_SHIPDATE

185 SECTION 8.: COMPLEX QUERIES ON GENERATED DATA: THE TPC-D BENCHMARK 69 We measure the sorte table accesses of query Q3 for fferent TPC-D scalng factors (SF) from 0. to (SF means a sze of GB for LINEITEM). We o not want to enter the ebate whether sort-merge jons or hash jons perform better [Mer8, DKO+84]. We chose a large man memory for our test envronment, snce accorng to [CHH+9] sort-merge jon an hash jon have a smlar performance for computer systems wth large man memores. Consequently we use a sort-merge jon because ths metho s easer to hanle by our test envronment. Snce the LINEITEM table s the major bottleneck for Q3, we focus on ths relaton for our performance comparson. We create four nstances of LINEITEM: an IOT on SHIPDATE, an IOT on ORDERKEY an a relaton wth seconary nexes on each restrcte or sorte attrbute. The optmzer favore an FTS over seconary nexes, whch our theoretcal conseratons an measurements prove to be the rght ecson (forcng DBMS to process Q3 wth a seconary nex on SHIPDATE or ORDERKEY took more than 6 hours for SF ). We therefore exclue seconary nexes from further conseratons. Fgure 8-5 an Table 8- show that the Tetrs algorthm for UB-Trees s most preferable to answer ths query. The 50% restrcton on SHIPDATE s not selectve enough for an IOT on SHIPDATE to be compettve. The presorte IOT on ORDERKEY oes not requre a merge sort an therefore shows response tmes smlar to an FTS wth merge sort. Usng Tetrs for sortng LINEITEM s more than three tmes faster than FTS or any IOT. The frst response of Tetrs s alreay prouce after few secons, two to three orers of magntue faster than wth FTS or any IOT. Whle the ntermeate storage requrements of Tetrs are not exactly zero as for an IOT on ORDERKEY, they are extremely low: Compare to an FTS or an IOT on SHIPDATE they are several orers of magntue lower. tme n s IOT SHIPDATE IOT ORDERKEY FTS Tetrs temporary storage n MB IOT SHIPDATE an FTS Tetrs IOT ORDERKEY 0 0 0,5 0,5 0,75 TPC-D scalng factor 0 0 0,5 0,5 0,75 TPC-D scalng factor Fgure 8-5: Response tmes an temporary storage for sortng 50 % of LINEITEM for Q3 Snce for FTS an IOT on SHIPDATE storage requrements grow lnearly wth tablesze, the man memory s exceee soon. To conuct our measurements we ha to enlarge the temporary DBMS tablespaces several tmes. In contrast to that the Tetrs cache grows wth tablesze (cf. Secton 6.4.) an fts nto the man memory of current computer systems even for table szes of several Terabytes.

186 70 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING Table Sze 33 MB 8 MB 3 MB 63 MB 36 MB 65MB 30MB Scalng Factor (SF) (0.05) (0.065) (0.) (0.5) (0.5) (0.5) () Tetrs st response 0.3s 0.5s 0.7s,s,3s,3s 3,3s Tetrs Slces Tme IOT ORDERKEY 64.7s 84.3s 306.7s 356.s 834.3s 753.6s 3604.s Tme IOT SHIPDATE 7.5s 6.9s 40.3s s 3.7s 569.8s 586.4s Tme FTS-Sort 34.s 6.7s 34.0s 38.s 86.5s 479.4s 376.4s Tme Tetrs 3.s 64.4s 9.5s 06.s 57.5s 44.s 06.s Cache Tetrs 0.3.MB 0.3MB 0.9MB.MB.4MB.MB.6MB Temp Storage IOT/FTS 7MB 40MB 65MB 8MB 83MB 36MB 75MB Table 8-: Interactve response tmes an cache szes for sortng 50 % of LINEITEM 8.. Jons, Groupng, an Restrctons The query Q4 (cf. Fgure 8-6) of the TPC-D benchmark s an orer prorty checkng query,.e., t counts the number of orers place n a gven quarter of a gven year n whch at least one lne tem was receve by the customer later than ts commtte ate. The query lsts the count of such orers for each orer prorty sorte n ascenng prorty orer. Snce Q4 nvolves restrctons, jons, an groupng, t s effcently supporte by the Tetrs algorthm. SELECT O_ORDERPRIORITY, COUNT(*) AS ORDER_COUNT FROM ORDER WHERE O_ORDERDATE > DATE [ate] AND O_ORDERDATE < DATE [ate] + INTERVAL 3 MONTH AND EXISTS ( SELECT * FROM LINEITEM WHERE L_ORDERKEY O_ORDERKEY AND L_COMMITDATE < L_RECEIPTDATE ) GROUP BY O_ORDERPRIORITY ORDER BY O_ORDERPRIORITY Fgure 8-6: Q4 of the TPC-D benchmark Assumng a three mensonal organzaton of ORDER (ORDERDATE, ORDERPRIORITY, ORDERKEY) an LINEITEM (COMMITDATE, RECEIPTDATE, ORDERKEY), query processng wth the Tetrs algorthm s shown n Fgure 8-7. Q4 groups the restrcte ORDER table epenng on tuple exstence n the LINEITEM table. Effcently processng ths query means processng ORDER n ORDERKEY orer whle usng the 3.5%-restrcton on ORDERDATE. To evaluate the exstental restrcton, LINEITEM s processe n ORDERKEY orer an sem-jone to ORDER. The Tetrs-algorthm can be use to process the trangular search space efne by COMMITDATE < RECEIPTDATE n ORDERKEY orer. Processng each ORDERDATE-slce n ORDERPRIORITY orer reuces the number of CPU operatons, snce groups can be bult wthout comparsons. When processng the last ORDERDATE slce, completng an ORDERPRIORITY slce allows one to transfer the corresponng group mmeately to the user.

187 SECTION 8.: COMPLEX QUERIES ON GENERATED DATA: THE TPC-D BENCHMARK 7 60% L_RECEIPTDATE L_COMMITDATE L_ORDERKEY 3.5% O_ODERPRIORITY O_ÔRDERDATE O_ORDERKEY LINEITEM ORDER Fgure 8-7: Processng Q4 We just report the response tmes an cache szes of sortng ORDER n Fgure 8-8 an Table 8-3, snce the enhancement of the Tetrs algorthm for non-rectangular query spaces has not been mplemente yet. The restrctons on ORDER are selectve enough for an IOT on ORDERDATE to be superor to FTS an IOT on ORDERKEY. In accorance wth our cost functons the suen ncrease of the FTS for SF 0.5 s ue to the fact that at ths pont the man memory s not suffcent anymore to sort the result nternally. The Tetrs algorthm s superor to FTS an any IOT, snce t utlzes restrctons an sorts the ata at the same tme. Even for ths qute selectve ORDERDATE restrcton the Tetrs algorthm s more than three tmes faster than the IOT on ORDERDATE. Tetrs also s tmes faster than an FTS an about 30 tmes faster than an IOT on ORDERKEY. tme n s IOT ORDERKEY FTS IOT ORDERDATE Tetrs temporary storage n MB IOT ORDERDATE an FTS Tetrs IOT ORDERKEY TPC-D scalng factor TPC-D scalng factor Fgure 8-8: Responses tme an temporary storage for sortng 3.5% of ORDER for Q4 As precte by our cost functons (cf. Secton 6.4.), the Tetrs cache of Fgure 8-8 s more than 60 tmes lower than the ntermeate storage of an IOT on ORDERDATE or FTS. Even for a ORDER table of.5gb (SF 4) the 0.3 MB Tetrs cache easly fts nto man memory.

188 7 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING Table Sze 3 MB 79 MB 3 MB 6 MB 3 MB 750MB 498MB Scalng Factor (SF) (0.) (0.5) (0.4) (0.5) () () (4) Tetrs st response 0,s 0,4s 0,s 0,s 0.s 0.s 0.3s Tetrs Slces Tme IOT ORDERKEY 6.0s 78.4s 95.9s 406.9s 83.8s 67.5s 354.9s Tme IOT ORDERDATE 3.s 9.s 6.8s 34.3s 95.4s 94.s 390.4s Tme FTS-Sort 5.s.5s 9.9s 46.4s 335.s 758.6s 396.7s Tme Tetrs 3.3s 7.8s 0.5s.s 9.7s 47.8s 3.9s Cache Tetrs 0.MB 0.MB 0.MB 0.MB 0.MB 0.MB 0.3MB Temp Storage IOT/FTS.3MB 3.MB 5.MB 6.4MB.9MB 30.MB 60.MB Table 8-3: Interactve response tmes an cache szes for sortng 3.5% of ORDER 8..3 Mult-attrbute Restrctons Query Q6 (cf. Fgure 8-9) of the TPC-D benchmark s a forecastng revenue query, whch lsts the amount by whch the total revenue woul have ncrease f the scounts ha been elmnate for lne tems wth a quantty less than a gven quantty n a gven year wth a scount evatng 0.0 from a gven scount. The UB-Tree range query algorthms may be use to effcently process ths query nvolvng mult-attrbute restrctons an aggregatons. SELECT SUM(L_EXTENDEDPRICE*L_DISCOUNT) AS REVENUE FROM LINEITEM WHERE L_SHIPDATE > [ate] AND L_SHIPDATE < [ate] + INVERVAL YEAR AND L_DISCOUNT BETWEEN [scount] 0.0 AND [scount] AND L_QUANTITY < [quantty] Fgure 8-9: Query Q6 of the TPC-D benchmark Q6 s processe by ether usng an IOT on SHIPDATE to materalze the result an then check the contons on DISCOUNT an QUANTITY, or perform an FTS, f no such nex exsts. Performng an nex ntersecton on three seconary B * -Trees s not very effcent, snce the selectvty of an nvual attrbute s relatvely low (0% for SHIPDATE, 33% for DISCOUNT an 50% for QUANTITY). An ntersecton of btmap nexes s not a goo choce ether, snce the number of stnct values for SHIPDATE, DISCOUNT, an QUANTITY s qute hgh. Snce /30 th of all tuples of LINEITEM satsfy the restrctons of Q6, 00k tuples have to be retreve to process the query for SF. Btmap nexes an seconary B-Trees o not cluster the ata. Therefore an FTS s preferable to both access methos. Multmensonal nexes cluster ata symmetrcally wth respect to all nex attrbutes. Wth 8kB pages 80 tuples of the LINEITEM relaton are store together on one page. Accessng 00k tuples then means.5k ranom sk accesses. Thus t makes sense to use a multmensonal nex for ths type of query.

189 SECTION 8.: COMPLEX QUERIES ON GENERATED DATA: THE TPC-D BENCHMARK 73 For Q6 we create fve nstances of LINEITEM, namely a UB-Tree, an IOT on each restrcte attrbute, an a table wth three seconary B-Trees, one on each restrcte attrbute. As expecte t was not possble to make the optmzer perform an nex ntersecton. The optmzer always preferre an FTS nstea. Forcng the optmzer to use a sngle seconary nex on SHIPDATE (the most selectve attrbute) was much less effcent than an FTS. Snce ths verfes our theoretcal expectatons, as before we exclue seconary nexes also from our performance comparson for Q6. L_SHIPDATE L_DISCOUNT L_QUANTITY L_SHIPDATE L_DISCOUNT L_QUANTITY L_SHIPDATE L_DISCOUNT L_QUANTITY LINEITEM Tetrs LINEITEM IOT LINEITEM FTS Fgure 8-0: Processng Q6 wth UB-Tree, IOT an FTS The shae part of Fgure 8-0 shows the part of LINEITEM that s retreve by the Tetrs algorthm, the IOT on SHIPDATE, an the FTS n orer to process Q6. Although an FTS retreves the entre relaton, prefetchng strateges substantally reuce the number of ranom accesses an make the FTS superor to any IOT. Table Sze 33 MB 8 MB 3 MB 63 MB 36 MB 65MB 30MB Scalng Factor (SF) (0.05) (0.065) (0.) (0.5) (0.5) (0.5) () Tme IOT QUANTITY 43,6s 09,s 80,s 5,s 460,7s 9,4s 84,8s Tme IOT DISCOUNT 3,s 78,3s 6,4s 58,s 339,s 678,4s 356,8s Tme IOT SHIPDATE,s 53,7s 8,6s 0,s 08,s 46,3s 83,5s Tme FTS 5,s,s 9,s 3,8s 47,7s 93,9s 87,6s Tme UB-Tree,s,5s 4,5s 5,8s,0s,3s 30,5s Table 8-4: Interactve response tmes for Q6 Table 8-4 an Fgure 8- agan show the superorty of a multmensonal organzaton over classcal access methos by a sxfol speeup of the Tetrs algorthm over an FTS an by a speeup of two to three orers of magntue over any IOT. The results also show that an FTS s superor to any one mensonal nex, snce the FTS uses page clusterng whereas nexes only use tuple clusterng. In accorance wth our cost formulas from Secton 6.3, the restrcton n any menson s not selectve enough for an IOT to make up the page clusterng avantage of the FTS. However, a multmensonal organzaton of the table wth a UB-Tree utlzes the restrcton n all mensons an thus clearly outperforms the FTS. In aton, an FTS puts an enormous loa on the system n both I/O-resources an CPU-resources, snce each tuple of the relaton s retreve an processe n man memory (cf. [Pe98]). Thus, especally for mult-user envronments nexes may be preferable to an FTS because of concurrency conseratons.

190 74 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING IOT QUANTITY IOT SHIPDATE UB-Tree IOT DISCOUNT FTS tme n s ,5 0,5 0,75 TPC-D scalng factor Fgure 8-: Performance of Q Summary: TPC-D Performance Measurements Our performance measurements of three TPC-D queres have shown that UB-Trees an the Tetrs algorthm are superor to one-mensonal access methos wth respect to both response tme an systems resources for storng ntermeate results. Wth the Tetrs algorthm a new operator may be ntrouce nto query processng, the so-calle Tetrs operator. Ths operator combnes the evaluaton of mult-attrbute restrcton wth a sort operaton n one processng step, f a relaton s organze by a multmensonal nex. Usually no more than 3 to 6 foregn keys are use to escrbe the foregn keys relatonshps between tables. For these mensonaltes only I/O-tme lnear n the sze of the result set an sublnear temporary storage are necessary to perform the Tetrs algorthm. In contrast to a merge-sort algorthm results are prouce n a contnuous flow of operaton. Therefore sortng s no longer a blockng operaton. Compare to exstng technques, the frst results are avalable much earler an thus allow better nteractve response tmes an better nternal ppelnng of the ata. The benchmark results for three queres of the TPC-D benchmark show speeups of up to two orers of magntue n response tme. Depenng on the query, temporary storage requrements are reuce by several orers of magntue.

191 SECTION 8.3: A REAL WORLD DATA WAREHOUSE: THE JUICE & MORE BENCHMARK A Real Worl Data Warehouse: The Juce & More Benchmark The most establshe relatonal ata moels for ata warehousng applcatons are the star schema an the snowflake schema. In both approaches there s a central fact table that contans the measures an the menson tables are stuate aroun t. The connecton between a fact tuple an the corresponng menson members s realze va foregn key relatonshps. In the star schema the menson tables are completely enormalze whle n the snowflake schema they may be normalze. Queres usually contan restrctons on multple menson tables (e.g., only sales for specfc customer group an for a specfc tme pero are aske) that are then use as restrctons on the usually very large fact table. Ths operaton (star jon ) s typcal for such moels. In ROLAP herarches are usually moele mplctly by a set of attrbutes A,..., A n where A correspons to herarchy level. In ths secton we nvestgate, how our technque of multmensonal herarchcal clusterng of Secton may be use to accelerate star-jons, the most frequent operaton of query processng for relatonal ata warehouses. We use the schema of the beverages suppler Juce & More, a real customer of one of our project partners 8. In the ata warehouse of Juce & More ata s organze along the followng four mensons: CUSTOMER, PRODUCT, DISTRIBUTION an TIME. Fgure 8-a shows the herarches over the mensons (the number n parentheses specfes the maxmal number of level members). PRODUCT CUSTOMER DISTRIBUTION TIME PRODUCT 5600 rows DISTRIBUTION rows Type (5) Bran (8) Category (9) Contaner (0) Regon (8) Naton (7) Trae Type () Busness Type (7) Sales Organzaton (5) Dstrbuton Channel (3) Year (3) Month () PRODKEY TYPE BRAND CATEGORY CONTAINER... CUSTOMER 7030 rows CUSTKEY REGION NATION TRADE-TYPE BUSINESS-TYPE... FACT 6M rows PRODKEY CUSTKEY DISTKEY TIMEKEY SALES DISTCOST DISTKEY SALESORG CHANNEL... TIMEKEY YEAR MONTH (a) (b) Fgure 8- Herarches n the Juce & More schema an the corresponng star schema... TIME 36 rows The ROLAP ata moel for the Juce & More schema (Fgure 8-b) s a typcal star schema wth one fact table FACT an a table for each of the 4 mensons. Let SALES an DISTCOST be some of the measures n the fact table. We use the methoologes of 8 The company an the ata presente here has been mae anonymous.

192 76 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING surrogates an multmensonal herarchcal clusterng as escrbe n Secton for clusterng the fact table of Juce & More wth UB-Trees. In the followng we escrbe some example queres nvolvng star jons for the Juce & More schema. These queres were taken from the real-worl ecson support system of Juce & More. The atabase schema an ata of Juce & More are real worl ata whch we obtane from our project partners. Thus next to the nvestgaton of multmensonal herarchcal clusterng ths secton s nterestng from a secon pont of vew: The hghly skewe ata strbuton of the Juce & More wll prove that UB-Trees an the Tetrs algorthm are not only applcable to laboratory envronment tests wth generate ata, but also prove ther effcency n the practcal applcaton scenaros. In orer to show the hghly skewe ata strbuton we nclue an entre secton splayng the one-mensonal ata strbuton of Juce & More for every menson. We then present measurements performe wth our prototype mplementaton of the UB-Tree on top of DBMS. For the evaluaton of our clusterng technque we efne a benchmark wth 36 queres. In comparson we also conucte measurements wth natve DBMS access methos: full table scan (FTS) an btmap nexes (BII). For these measurements we use a completely enormalze fact table, that s, no atonal jons ha to be performe to answer the queres. The btmap nexes were create on each herarchy level. We not nclue seconary nexes n our comparson measurements because earler experments showe that they are nether compettve to the UB-Tree nor to FTS or BII [MZB99] Queres on the Juce & More Schema In the followng we present typcal queres that are taken from real applcatons for the schema gven n the prevous secton. We wll use these queres to llustrate our approach an we wll present performance measurements for exactly these queres n Secton Query (Q, cf. Fgure 8-3) computes the sales for a gven prouct group (TYPE an BRAND specfe as (X, X)) an a gven customer group (NATION an REGION specfe as (Y, Y)) for the months from October to December of 993. SELECT FROM WHERE SUM(Sales) Fact F, Customer C, Prouct P, Tme T F.ProKey P.ProKey AND F.CustKey C.CustKey AND P.Type X AND P.Bran X AND C.Regon Y AND C.Naton Y AND F.TmeKey T.TmeKey AND T.Year 993 AND T.Month > October AND T.Month < December Fgure 8-3: Tme Interval (Q) Query (Q, cf. Fgure 8-4 ) calculates the cost of strbuton of the proucts of type X for each strbuton channel.

193 SECTION 8.3: A REAL WORLD DATA WAREHOUSE: THE JUICE & MORE BENCHMARK 77 SELECT SALESORG, CHANNEL, SUM(DstCost) FROM Fact F, Dstrbuton D, Prouct P WHERE F.DstKey D.DstKey AND F.ProuctKey P.ProuctKey AND P.Type X GROUP BY D.SalesOrg,D.Channel Fgure 8-4: Dstrbuton cost (Q) Query 3 (Q3, cf. Fgure 8-5) restrcts all mensons on the frst level of the herarches. SELECT FROM WHERE SUM(SALES) Fact F, Dstrbuton D, Prouct P, Customer C, Tme T F.DstKey D.DstKey AND F.TmeKey T. TmeKey AND F.CustKey C.CustKey AND F.ProKey P.ProKey AND P.Type t AND D.SalesOrg s AND T.Year y AND C.Regon r Fgure 8-5: Partal match query n the frst herarchy level (Q3) 8.3. Data Dstrbuton The ata of Juce & More s real worl ata from one of our project partners. In contrast to the ata strbutons use for most of the performance analyses an measurements n the prevous sectons, the Juce & More ata strbuton of both the fact table an the menson tables s hghly skewe: The mensons nether strbute unformly nor are nepenent. The orgnal fact table consste of tuples (about 75 MB). To get a realstc large ata cube, the fact table was enlarge to tuples (about 5,6 GB). Our project partner mplemente an augmentaton algorthm wth mnmal mpact on the ata strbuton (see [Pe98]). In the followng we show some charts whch escrbe the one-mensonal ata strbuton for each menson. However, we once more stress that the mensons are not nepenent (e.g., some customers always orer the same subset of proucts, some customers or proucts only exst for a certan tme, etc.). Thus n general, the overall selectvty of a query restrctng several mensons s not the prouct of the selectvtes of the one-mensonal restrctons (whch s shown n the followng charts). We wll see ths evaton n Secton The fact table of Juce & More stores several measures (e.g., strbuton cost, sales) aggregate on a aly bass wth respect to the mensons tme, customer, prouct an strbuton. Snce the ata s senstve real-worl busness ata, t s not possble to show the labels/names of the herarchy members n the charts.

194 78 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING The Tme Dmenson The tme menson of Juce & More conssts of a two-level herarchy of months an years. The test ata store the years from 993 to 995. As shown n Fgure 8-a, the ays of the tme menson are organze by a two level herarchy (year an month). Fgure 8-6 shows the cumulate ata strbuton of the fact table wth respect to the tme menson groupe by year an month. The horzontal axs splays the herarchy members, wth All at the very bottom (.e., the lowest level), the years 993, 994, an 995 n the mle an the twelve months for each year above the year. The arrows n the horzontal axs ncate the relatonshp between the members of neghborng herarchy levels. Thus the fact table of Juce & More s almost unformly strbute wth respect to the tme menson. The mnmum number of facts for one month s,70% of the fact table (n January 993, December 993 an January 995), whereas the maxmum number of facts per month s,83% (n March of each of the three years). Thus wth a multmensonal clusterng usng 5 splt levels (cf. sectons 3.0 an 6.) restrctons to one month n the tme menson (/36) can be expecte to reuce the amount of ata to aroun / 5 /3. 3,0%,70%,75%,83%,79%,8%,8%,78%,77%,7%,75%,80%,8%,70%,75%,83%,79%,8%,8%,78%,77%,7%,75%,80%,8%,70%,75%,83%,79%,8%,8%,78%,77%,7%,75%,80%,8%,5%,0%,5%,0% 0,5% 0,0% Fgure 8-6: Data strbuton of the tme menson The Prouct Dmenson The top level of the herarchy on prouct has fve entres. The ata strbuton s qute skewe, there are three prouct groups to whch 93% of all tuples of the fact table belong. % of the ata s unclassfe. The strbuton of the frst level of the prouct herarchy s llustrate n Fgure 8-7a. Multmensonal herarchcal clusterng as escrbe n 5.3.4

195 SECTION 8.3: A REAL WORLD DATA WAREHOUSE: THE JUICE & MORE BENCHMARK 79 ensures that a restrcton n the frst herarchy level wll result n a %, 7%, 6%, 3% respectvely 34% reucton of I/Os whch are necessary to retreve the result set. Wthout exactly showng the relatonshp between the herarchy members, we show the skew of the ata strbuton over the frst four prouct levels n Fgure 8-7b. The strbuton of the frst two herarchy levels s llustrate n Fgure 8-8. The horzontal axs agan splays the relatonshp of the members of the frst two herarchy levels. % 4% 34% 7% 6% 3% unclassfe % % % (a) frst herarchy level 0% prouct (b) frst four herarchy levels Fgure 8-7: Data strbuton of the prouct menson 4% 6,56% 7,99% %,40% 0% 8% 6% 7,% 6,37% 5,60% 4% 3,65% 4,09% 4,5% % 0% 0,7%,4%,97%,47% 0,0% 0,00% 0,03%,84% 0,0%,03%,34% 0,3%,8% 0,00% Fgure 8-8: Data strbuton of the frst two herarchy levels of the prouct menson

196 80 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING The Customer Dmenson Accorng to busness amnstraton lterature 0% of the customers contrbute to 80% of the busness. The customer menson of Juce & More s a typcal example for a classfcaton of customers n such a company. A hgh number of customers (n ths case 88%, the very left entry of Fgure 8-9) are not classfe (maybe they are not nterestng for the company because of small turnover or t s not possble to fn a classfcaton). The classfe customer groups contan 0% to 3% of the tuples store n the table. 9 The herarchcal relatonshps on the horzontal axs show that the herarchy of the customer menson s not balance, snce several herarchy members just have one chl. The consequence of the small number of classfe customers s that n queres the customer menson wll be restrcte to a small range (3%) an therefore the result set wll be small. 3,0%,5% 87,33%,96%,0%,64%,7%,5%,7%,0% 0,5% 0,0% 0,4% 0,% 0,8% 0,04% 0,73% 0,00% 0,5% 0,03% 0,3% 0,04% 0,0% 0,09% 0,06% 0,7% 0,07% 0,% 0,55% 0,77% 0,08% 0,05% 0,06% 0,06% 0,6% 0,0% 0,07% 0,3% 0,76% Fgure 8-9: Data strbuton of the customer menson The Dstrbuton Dmenson There are seven entres on the frst level of the strbuton herarchy. The ata strbuton of the fact table wth respect to the strbuton menson s hghly skewe. Fgure 8-0a shows the strbuton of the frst herarchy level (.e., sales organzaton), whle Fgure 8-0b shows the strbuton of facts n the fact table for each strbuton channel of each sales organzaton. Agan the arrows ncate the herarchcal relatonshp of the members of neghborng herarchy levels wth the herarchy root All at the bottom of the Fgure. 9 Note that the ata n the Juce & More warehouse s aggregate on a aly bass, thus the amount of ata s usually compresse for large customers, thus the proporton of large customers s reuce.

197 SECTION 8.3: A REAL WORLD DATA WAREHOUSE: THE JUICE & MORE BENCHMARK 8 6% 8% 3% 3%,5% 0,0% 7,5% 0,30% 0,30% 6,66% 5,84%,50%,50%,50%,50% 8,45% 7% 37% 6% 5,0%,5%,0%,0% 4,05% 0,0% (a) (b) Fgure 8-0: Data strbuton of the strbuton menson Multmensonal Herarchcal Clusterng of Juce & More Fgure 8- shows the compoun surrogates for the Juce & More ata warehouse, whch are calculate as fxe length compoun surrogates as escrbe n Secton For any of the 4 herarches the length of the compoun surrogate oes not excee 5 bts an thus can be store n a sngle nteger value. These compoun surrogates are use as attrbutes for each of the four mensons of Juce & More to calculate the Z-aress for each tuple of the Juce & More fact table. cs cs cs cs prouct customer tme p5 p p3 p p p0 p9 p8 p7 p6 p5 p4 p3 p p c0c9c8 c7c6c5 c4 c3cc 3333 t { 6t5 t4t3tt 3 strbuton level level level level level level level level level3 level 4 level3 level 4 Fgure 8-: Compoun surrogates for each menson of Juce & More The UB-Tree for the Juce & More fact table conssts of P pages, whch correspons to: l log P log ,7 herarchcal splt levels. Wth bt nterleavng n the orer of mensons prouct, customer, tme, an strbuton the Z-aress α for a tuple of the Juce & More fact table s calculate as: α p c0t6 5 p4c9t5 4 p3c8t4 3 pc7t3 pc6t completely parttone for any ata strbuton { p0c5t p9c4 p8c3 p7c p6c p5 p4 p3 p p partly splt parttone epenng on the ata strbuton

198 8 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING The frst 9 bts of the Z-aress are guarantee to be use to partton the four mensonal unverse of the Juce & More fact table. Ths means that the bnary strngs p 5 p 4 p 3 p p of the compoun surrogate of prouct, c 0 c 9 c 8 c 7 c 6 of customer, t 6 t 5 t 4 t 3 t of tme an of strbuton are use to partton the unverse. For each of the four mensons the frst herarchy level s completely use for the parttonng. The secon herarchy level s use to a large extent to partton the unverse. Therefore a restrcton n the frst herarchy level wll result n a reucton of the number of pages as etermne by the ata strbuton of Secton 8.3.,.e., a restrcton of the prouct man group to 90 wll reuce the number of pages to be retreve to 7%, a restrcton to 9 wll result n a reucton to 6,4%. Ths hols for the restrcton of the frst herarchy level n any menson. If the top herarchy level s restrcte n several mensons an the nepenence assumptons hols for these mensons, the reucton s multplcatve. Table 8-5 shows the precte selectvty calculate as the prouct of the selectvty n each menson, the actual selectvty an the loae pages n percent of the entre pages n the atabase for two queres, whch restrct the frst herarchy level of three out of four mensons. s customer s prouct s strbuton s tme pages loae precte selectvty actual selectvty loae pages n % 0,78% 6,4% 37,5% 00% 78 0,08% 0,099% 0,007% 87,34% 6,4% 37,5% 00% 8996,098%,68%,67% Table 8-5: Restrctons n the frst herarchy level n 3 of 4 mensons However, the ata s not nepenently strbute n the entre 4-mensonal unverse of Juce & More. In ths case the precte selectvty oes not escrbe the actual selectvty anymore. Thus some bts of the frst 9 bts are correlate. Ths means that not all combnatons of these bts occur an some parttonng wll take places n the bts below bt number 9 of the Z-aress. In ths case the secon level may alreay be completely parttone an even a thr herarchy level parttonng may have starte for some mensons. A typcal part of the multmensonal space where ths wll happen s the customer herarchy unclassfe, whch stores 87,34% of the customers. At most the three bts c 0 c 9 c 8 of the customer herarchy are neee to stngush these customers from all other customers. Thus for the unspecfe customers the bts c 7 c 6 of the frst 9 bts of the Z-aress are correlate to c 0 c 9 c 8 an two further bts may be use for parttonng. Thus wll be splt completely, p 0 wll be use for the parttonng an t wll be partly splt (c 5 s also correlate to c 0 c 9 c 8 ). Actually, ths s a puff-pastry effect, whch ue to the surrogate calculaton s benefcal for query performance snce t allows to have further parttonng steps for correlate herarchy levels. Our measurements show that ths effect also hols for other mensons. Table 8-6 shows queres where the frst two herarchy levels of customer, prouct an tme are restrcte, whereas the strbuton menson s not restrcte. The selectvty precte by the cost functons here ffers from the actual selectvty of the query because of epenences n the ata strbuton. However, the percentage of pages loae s smlar to the actual selectvty of each query.

199 SECTION 8.3: A REAL WORLD DATA WAREHOUSE: THE JUICE & MORE BENCHMARK 83 s customer s prouct s strbuton s tme pages loae precte selectvty actual selectvty loae pages n %,95% 3,64% 00% 38,4% 3 0,044% 0,030% 0,0355%,95% 7,99% 00% 38,4% 78 0,038% 0,085% 0,0890% Table 8-6: Restrcton n the frst two herarchy levels n 3 of 4 mensons Each of the pages of the Juce & More fact table stores 30 tuples. All of the measurements showe that when restrctng the frst herarchy level n each menson n average 99,99% of the tuples on the pages contrbute to the result set. A stanar evaton of less than 0,00 for these measurements means that the multmensonal herarchcal clusterng s perfect for multmensonal restrctons n the frst herarchy level. When atonally restrctng the secon herarchy level n average 557 pages were loae, where 0,7% of the tuples not contrbute to the result set. The stanar evaton here was 0,06. Atonally restrctng the thr herarchy level of each menson usually create result sets wth only one page. Thus multmensonal herarchcal restrctons are very well processe by UB-Trees storng compoun surrogates whch are create by the multmensonal herarchcal clusterng technque ntrouce n Secton Agan, the basc conseraton about the UB-Tree n terms of mensonalty (cf. Secton 3.9) an restrcte mensons (cf. Secton 7.4.3) hol. For star schemas as use n present ata warehousng applcatons ths approach may sgnfcantly spee up query performance an reuce resource requrements n sk space an processng tme. We are currently refnng the technque of multmensonal herarchcal clusterng by usng hash clusters for the calculaton of each nvual surrogate. Ths refnement wll be mplemente to herarchcally cluster a ata warehouse of marketng ata prove by one of our project partners Performance Measurements The measurements for Juce & More were performe on a SUN Enterprse wth four 300 MHz UltraSPARC processors an GB RAM uner Solars.6. As seconary storage a partton on a SPARCstorage array wth RAID level 0 (6 sks strpng, 5-6 MB/s transfer rate per sk) was use. All measurements were one n a sngle-user envronment. It s mportant to note that our mplementaton stll causes sgnfcant overhea ue to the fact that we have mplemente the UB-Tree on top of a DBMS an not n the kernel tself. Frst, the number of SQL statements that have to be processe (UB: statement for each page n the result set, DBMS methos: statement n total) leas to extensve nter-process communcaton (about 30% of the total processng tme) an DBMS overhea (e.g., parsng of statements). Secon, our table s larger than the one for the FTS an the btmap nexes ue to unmplemente compressng technques n the UB-Tree (for 8 KB pages: UB: pages, FTS: pages, BII: FTS+334 pages).

200 84 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING Fgure 8- shows result set szes an response tmes of the three example queres (Secton 8.3.). Q shows that the UB-Tree wth multmensonal clusterng s over tmes faster than BII even for very small result sets. Q3 whch s processe by the unoptmze UB-Tree at least 0 tmes faster than wth any other access metho unermnes ths observaton. Reponse Tme n s Q Q Q3 0 UB FTS Btmap Query Loae Tuples Percentage of Database Q 860 0,03% Q ,5% Q ,08% Fgure 8-: Query response tmes an result set szes The result set of query Q s qute large but the almost perfect clusterng factor of the UB- Tree (n average more than 9 out of 30 tuples/page belong to the result set) stll leas to a spee up of more than 30 % n comparson to BII. The tme for FTS for Q ffers from the tmes for Q an Q3 ue to the less complex WHERE clause of the statement. The number of comparson operatons s therefore much smaller than for the other queres whch causes the faster executon. All these results on real ata show how well the multmensonal herarchcal clusterng wth UB-Trees works n practce an the accuracy of our theoretcal cost moel. In total more than 77% of all benchmark queres (8 out of 36) showe a spee up between a factor of.3 an 0 over tratonal technques [Pe98]. Fgure 8-3 lsts two nstances for each of fve further queres of that benchmark. For each query Table 8-7 lsts the number of herarchy levels that by each query are restrcte to a pont for each menson. Snce the ata s non-unformly strbute, the selectvty of each query epens on the exact pont restrcton, not only on the number of restrcte herarchy levels. We thus present two nstances of the queres, each of whch has a fferent selectvty. Customer Prouct Dstrbuton Tme Selectvty Instance Selectvty Instance Q4 0 0,044% 0,376% Q5 0,0070% 3,4383% Q6 0 0,08%,098% Q7 0 0,346%,346% Q ,0000% 34,0000% Table 8-7: Restrcte herarches an selectvtes for fve queres aganst the Juce & More ata warehouse

201 SECTION 8.3: A REAL WORLD DATA WAREHOUSE: THE JUICE & MORE BENCHMARK ,53 77,0 69,05 74, , Tme n Secons 50 8 UB BM FTS Q4 (0,044%) Q5 (0,0070%) Q6 (0,08%) Q7 (,346%) Q8 (6,0000%) tme n secons UB BM FTS Q4 (0,376%) Q5 (3,4383%) Q6 (,098%) Q7 (,346%) Q8 (34,000%) Fgure 8-3: Juce & More performance

202 86 CHAPTER 8: IMPACTS ON RELATIONAL QUERY PROCESSING Summary: Data Warehouse Performance Measurements Our performance measurements have shown that multmensonal herarchcal clusterng reuces the number of ranom accesses to the fact table for star jons an other queres wth restrctons n multple herarches. In aton, sort operatons as necessary for groupng an aggregaton are performe on the fly wthout atonal I/O. For mensonaltes typcal for ata warehousng only I/O-tme lnear n sze of the result set pror to aggregaton an sublnear temporary storage are necessary to aggregate parts of a ata cube. Thus seconary storage space an pre-computaton tme for many aggregates an btmap nexes can be avoe. In aton the wely scusse vew mantenance problem s mnmze. The benchmark results for typcal queres of a 7 GB real worl retal ata warehouse confrme our analytcal expectatons an showe sgnfcant speeups up to a factor 0 n response tme. Depenng on the query, temporary storage requrements for sortng are reuce by several orers of magntue. Our clusterng approach also hols not only for ROLAP but also for MOLAP mplementatons of a ata warehouse snce both ROLAP fact tables an MOLAP ata cubes can be clustere n ths way.

203 CHAPTER 9: SUMMARY 87 A thousan thngs... suenly ae up lke a column of fgures n her mn. (Wllam Humphrey) Chapter 9 Summary efore rawng conclusons an escrbng future research work, we brefly B summarze the contrbuton of ths thess to the research n atabase management systems: We have analyze the applcaton of a multmensonal access metho to RDBMS. After ntroucton of a formal moel for multmensonally parttone relatons we scusse several query types an entfe the sgnfcance of multmensonal range queres an sort operatons for query processng. Dscussng current access methos we motvate the nee for a multmensonal parttonng of relatons. We escrbe the UB- Tree n combnaton wth algorthms for nserton, eleton, pont queres, an range queres. We further ntrouce two new algorthms, the spral algorthm for nearest neghbor queres wth UB-Trees an the Tetrs algorthm for effcent access to a table n arbtrary sort orer. We also gave a etale analyss of the UB-Tree space parttonng. We efne a cost moel an compare the cost of range queres an sort operatons for UB-Trees to the state of the art n current DBMS. The practcal applcablty of our approach was shown by performance measurements on a prototype mplementaton of UB-Trees on top of the RDBMS DBMS an DBMS. These experments wth artfcal ata, for the TPC-D benchmark an for a star schema ata warehouse confrme the theoretcally expecte superorty of UB-Trees an the algorthms evelope n ths thess over tratonal access technques wth respect to both response tmes an storage requrements.

204 88 CHAPTER 9: SUMMARY 9. Conclusons Next to a prototype mplementaton of most of the algorthms scusse n ths thess, our man contrbutons are the analytcal cost moel (Chapter 6), the Tetrs algorthm (Secton 4.7) an the technque of multmensonal herarchcal clusterng (Secton 5.3.4). The performance measurements of Chapters 7 an 8 prove, that all of these technques are feasble an of practcal relevance. Moreover, the expectatons erve from our theoretcal cost moel n Chapter 6 are met by these performance measurements. In aton, some nterestng algorthmc problems were solve urng the work on ths thess (e.g., btnterleavng for aress calculaton (cf. Secton 5.3) or the reucton of the complexty of the calculaton of the next Z-regon ntersecton from exponental to lnear (cf. Secton 5.7.). The cost moel may be use as a bass for cost-base query optmzaton. Usng hstograms [PIH+97] to etermne the selectvty of the multmensonal restrctons, the selectvty base cost formula (Secton 6..4) wll also gve a reasonable cost estmaton for nonunformly strbute parttone multmensonal unverses. Wth enumeraton types (cf. Secton 5.3.3), multmensonal herarchcal clusterng (cf. Secton 5.3.4) an varable UB-Trees (cf. Secton 5.3.5) we have ntrouce three technques to overcome the puff-pastry effect whch s nherent n Z-orere ata spaces (cf. Secton 3.0). Sutable ata moelng shoul ensure that these technques are applcable n orer to obtan a sutable multmensonal parttonng of a relaton. By the vrtue of the cost moel an by experments we foun that multmensonal nexes are useful for parttonng ata accorng to up to 0 mensons (cf. Secton 3.9). For large mensonaltes a subset of the attrbutes shoul by chosen by proper physcal ata moelng technques. However, mensonaltes of 6 0 mensons are usual n relatonal ata moels, snce one relaton selom has more than 6 0 foregn keys. Note that the number of key attrbutes s not entcal to the mensonalty of the ata space, whch s often much smaller f foregn keys are composte keys. The Tetrs algorthm allows to accelerate most operatons of relatonal query processng, f multmensonal range restrctons an/or sort operatons are nvolve. Due to that fact one of our project partners has starte to ntegrate UB-Trees an the Tetrs algorthm nto the kernel of ts RDBMS. Wth sutable heurstcs for query optmzaton, the Tetrs algorthm thus may substantally spee up query processng n RDBMS. For mensonaltes typcal for relatonal atabases only I/O-tme lnear n the sze of the result set an sublnear temporary storage are necessary to perform the Tetrs algorthm. In contrast to a merge-sort algorthm, the sorte relaton s prouce n a contnuous flow of operaton. Therefore, usng the Tetrs algorthm, sortng s no longer a blockng operaton. Thus nternal ppelnng s tremenously mprove, snce results are earler transferre to other noes of the operator tree or even to the user. Thereby the Tetrs algorthm offers the chance to effcently process ceberg queres for rankng [FSG+98], f the esre measure s use as a further menson of the UB-Tree. The

205 CHAPTER 9: SUMMARY 89 Tetrs algorthm then oes not rea the entre query box n the sortng menson, but termnates after processng the frst slces. Aggregatons can be calculate on-the-fly an allow better nteractve response tmes. When sortng a relaton or jonng relatons, restrctons n multple attrbutes can be effcently utlze n orer to reuce I/O-cost an CPU-cost. The benchmark results for three queres of the TPC-D benchmark show speeups of up to two orers of magntue n response tme. Depenng on the query, temporary storage requrements are reuce by several orers of magntue. Further analyss ncates that our approach s useful for an even broaer range of queres. A star jon wth a pont restrcton of some herarchy level of each menson table results n a range restrcton on each compoun surrogate, f the fact table s organze wth multmensonal herarchcal clusterng usng surrogate encong for the foregn keys. In the same way ntervals on the chlren of one herarchy level result n a range of the corresponng compoun surrogates. Thus, wth multmensonal herarchcal clusterng a star jon on a schema wth mensons creates a -mensonal nterval restrcton on the fact table. Therefore star-jons n ata warehouse beneft from the Tetrs algorthm an the range query algorthm for UB-Trees as escrbe n the prevous paragraph. For mensonaltes typcal for ata warehousng only I/O-tme lnear n sze of the result set pror to aggregaton an sublnear temporary storage are necessary to aggregate parts of a ata cube. Seconary storage space an pre-computaton tme for many aggregates an btmap nexes can be avoe. In aton the wely scusse vew mantenance problem s mnmze. The benchmark results for typcal queres of a 7 GB real worl retal ata warehouse confrme our analytcal expectatons an showe sgnfcant speeups up to a factor 0 n response tme. Depenng on the query, temporary storage requrements for sortng are reuce by several orers of magntue. Our clusterng approach also hols not only for ROLAP but also for MOLAP mplementatons of a ata warehouse snce both ROLAP fact tables an MOLAP ata cubes can be clustere n ths way. 9. Future Work In our future work we are partcularly ntereste n a etale stuy of relatonal query processng wth multmensonal nexes. We are n the process of nvestgatng a methoology for query optmzaton wth multmensonal access methos, both for heurstcs-base an cost-base query optmzers. We wll o performance measurements n mult-user envronments where we expect even more sgnfcant speeups. In a jont research project wth TransActon Software we are currently ntegratng the UB-Tree nto the DBMS kernel n orer to reuce the overhea of the current mplementaton. An nterestng enhancement of UB-Trees mght be the use of Hlbert curves [Hl9] as space fllng curves for the UB-aresses. As our theoretcal conseratons n Secton 3.3 showe, Hlbert curves wll result n a better space parttonng because spatal proxmty for both neghbors on the Hlbert curve s guarantee. However, the algorthms for ealng wth

206 90 CHAPTER 9: SUMMARY Hlbert aresses (H-aresses) are more complcate than these for Z-aresses, snce the subcube n space efne by each step of an H-aress epens on the prevous steps of the H-aress. A prototype mplementaton an performance measurements of the overhea for H-aress calculaton an next ntersecton calculaton of the range query algorthm wll have to show f the better space parttonng pays off. Query optmzaton wth multmensonal access methos s an area whch has not been researche n etal yet. The ablty to hanle mult-attrbute restrctons wth tuple clustere access methos wll have to be taken nto account by query optmzers. The Tetrs operator of sectons 4.7 an Chapter 8 shows that query optmzaton wth multmensonal access methos wll allow more complex operatons to be hanle by a sngle operator. In orer to take avantage of multmensonal access methos, t s not suffcent for cost base optmzers to rely on one-mensonal statstcal nformaton. In aton, approaches lke multmensonal hstograms [PIH+96, Poo97] shoul be use here. For these applcatons our cost moel of Chapter 6 coul be further refne an aapte. Multmensonal access methos can also be use for the organzaton of ntermeate results. Very often a noe of an operator tree gets an nput stream sorte wth respect to one attrbute A j, whereas a sorte processng wth respect to another attrbute A k s requre. A refnement of the Tetrs algorthm coul be use to perform a sorte wrtng of the ntermeate result n sort orer of A k whle takng avantage of the sorte nput n orer to reuce the cache sze an to avo the external sortng process. The mplementaton of ths refnement of the Tetrs algorthm for sorte wrtng as well as the mplementaton an analyss of the operators of the relatonal algebra wth multmensonal access methos wll be an nterestng task of future research. Another ssue whch has not been aresse n ths thess are the ssue of parallelzaton an ata parttonng. In our opnon many of the algorthms propose n ths thess have a hgh potental for parallelzaton (e.g., the UB-Tree range query algorthm an the Tetrs algorthm). Multmensonally parttonng ata wth respect to several sks or RAID systems [PGK88] wll also allow to explot parallelsm. Another ssue wth respect to ata parttonng s the organzaton of huge atabases on tertary storage. Inexng of tertary storage archves mght also take our approach of multmensonal clusterng nto account. Further research can also be one n the fel of lockng, where lockng strateges mght take avantage of the multmensonal organzaton of a relaton. Spatal lockng s a specal kn of precate lockng whch effcently takes the physcal organzaton of the relaton nto account n orer to reuce contenton problems. Page lockng then correspons to lockng regons n multmensonal space. Spatal lockng wth UB-Trees mght result n a herarchcal spatal lockng concept whch mght be an mprovement over classcal lockng mechansms n both CPU-tme for lock operatons as well as space requre for mantanng the lockng nformaton.

207 SECTION 9.: FUTURE WORK 9 Durng the work for the thess we also envsone the lack of a unversal physcal ata moel. Whle a broa varety of ata moels has been ntrouce on the conceptual an logcal level (e.g., E/R-moel, relatonal moel, object-orente ata moels, multmensonal ata moels) no stanarze physcal ata moel has been establshe. Snce for any logcal moel the ata must be mappe to prmary, seconary an tertary storage, a unversal physcal ata moel woul allow to use a sngle physcal DBMS engne to mplement any logcal moel. Wth approprate mappng strateges a (tme optmal, space optmal, etc.) physcal moel for a certan system confguraton (logcal DBMS schema, set of queres, set of access methos, harware confguraton) coul be automatcally erve (n analogy to the Auto Amn tool of Mcrosoft s SQL Server 7 [MS98a]). Ths mght result n cost savngs for DBMS applcaton evelopment, snce a DBMS coul offer vews of the same physcal ata n several logcal moels. In aton, cost for DBMS amnstraton mght be reuce, snce performance tunng s separate from the applcaton ata moel an can be solely performe on the physcal moel.

208

209 REFERENCES 93 Tme present an tme past are both perhaps present n tme future. An tme future contane n tme past. (T. S. Elot) References [AFS93] [AGM+90] [AHU74] [AM90] R. Agrawal, C. Faloutsos, an A. Swam. Effcent smlarty search n sequence atabases. Proc. 4 th Int. Conf. on Founatons of Data Organzaton an Algorthms, LNCS 730, 993, pp S.F. Altschul, W. Gsh, W. Mller, E.W. Myers, an D.J. Lpman. A Basc Local Algnment Search Tool. Journal of Molecular Bology 5(3), 990, pp A.V. Aho, J.E. Hopcroft, an J.D. Ullman. The Desgn an Analyss of Computer Algorthms. Ason-Wesley, Reang, MA, 974. D.J. Abel an D.M. Mark. A comparatve Analyss of some two-mensonal Orerngs. Intl. Journal Geographcal Informaton Systems, 4(), 990, pp. -3. [APB96] OLAP Councl. OLAP Councl APB- Benchmark Informaton [ARS97] [AS83] URL: K. Alsabt, S. Ranka, an V. Sngh. A One-Pass Algorthm for Accurately Estmatng Quantles for Dsk-Resent Data. VLDB, 997, pp D.J. Abel an J.L. Smth. A Data Structure an Algorthm base on a lnear Key for a Rectangle Retreval Problem. Computer Vson 4, 983, pp. -3.

210 94 REFERENCES [AS95] R. Agrawal an A.N. Swam. A One-Pass Space-Effcent Algorthm for Fnng Quantles. COMAD, 995. [Bau97] M. Bauer. A Mass Loang Tool for UB-Trees. Internshp Report, FORWISS München, 997. [Bau98] M. Bauer. Varable UB-Trees. Master Thess, TU München, 998. [Bay96] [Bay97a] [Bay97b] [BCE77] [BDK9] [Ben75] R. Bayer. The unversal B-Tree for multmensonal Inexng. Techncal Report TUM-I9637, Insttut für Informatk, TU München, 996. R. Bayer. The unversal B-Tree for multmensonal Inexng: General Concepts. Worl-We Computng an Its Applcatons '97 (WWCA '97). Tsukuba, Japan, 0-, Lecture Notes on Computer Scence, Sprnger Verlag, March, 997. R. Bayer. UB-Trees an UB-Cache A new Processng Paragm for Database Systems. Techncal Report TUM-I97, Insttut für Informatk, TU München, 997. M.W. Blasgen, R.G. Casey, an K.P. Eswaran. An Encong Metho for Multfel Sortng an Inexng. Comm. of ACM 0(), 977, pp F. Banclon, C. Delobel, an P. Kanellaks (es.). Bulng an object-orente Database System: The Story of O. Ason-Wesley, 99. J.L. Bentley. Multmensonal Bnary Search Trees Use for Assocatve Searchng. Comm. of ACM 8(9), 975, pp [Ben79] J.L. Bentley. Multmensonal Bnary Search Trees n Database Applcatons. IEEE TSE 5(4), 979, pp [BGW+8] [BK89] [BKK96] [BKK97] [BKS+90] [BKS93] [BM7] [BM98] [Böh98] [BSH+98] P.A. Bernsten, N. Gooman, E. Wong, C.L. Reeve, an J.B. Rothne. Query Processng n a System for Dstrbute Databases (SSD-). ACM TODS 6(4), 98, pp E. Bertno an W. Km. Inexng Technque for Queres on Neste Objects. IEEE Transactons on Knowlege an Data Engneerng, 989, pp S.D. Berchtol, D. Kem, an H.-P. Kregel. The X-Tree. An Inex Structure for hghmensonal Data. Proc. of n VLDB Conf., 996. S. Berchtol, D. Kem, an H.-P. Kregel. Usng Extene Feature Objects for Partal Smlarty Retreval. VLDB Journal Vol. 6(4), 997, pp N. Beckmann, H.-P. Kregel, R. Schneer, an B. Seeger. The R*-Tree. An effcent an robust Access Metho for Ponts an Rectangles. Proc. of ACM SIGMOD Conf., 990, pp T. Brnkhoff, H.-P. Kregel, an B. Seeger. Effcent Processng of Spatal Jons usng R-Trees. Proc. of ACM SIGMOD Conf., 993, pp R. Bayer an E. McCreght. Organzaton an Mantanance of large orere Inexes. Acta Informatca, 97, pp R. Bayer an V. Markl. The UB-Tree. Performance of Multmensonal Range Queres. Techncal Report TUM-I984, Insttut für Informatk, TU München, 998. C. Böhm. Effcently Inexng Hgh-Dmensonal Data Spaces. Ph.D. Thess, LMU München, 998. M. Blaschka, C. Sapa, G. Höflng, an B. Dnter. Fnng Your Way through Multmensonal Data Moels. Proc. Intl. Workshop on Data Warehouse Desgn an OLAP Technology, Venna, August 998.

211 REFERENCES 95 [BSW97] J. van en Bercken, B. Seeger, an P. Wmayer. A General Approach to Bulk Loang Multmensonal Inex Structures. Proc. of 3 r VLDB Conf., Athens, Greece, 997. [BU77] R. Bayer an K. Unterauer. Prefx B-Trees. ACM TODS (), 977, pp. -6. [CD97] [CHH+9] [CI98] S. Chauhur an U. Dayal. An Overvew of Data Warehousng an OLAP Technologes. ACM SIGMOD Recor 6(), Marc 997. J. Cheng, D. Haerle, R. Heges, B.R. Iyer, T. Messnger, C. Mohan an Y. Wang. An effcent hybr hash jon algorthm: a DB prototype. Proc. of the 7 th ICDE, Kobe, 99, pp C. Chan an Y. Ioanns. Btmap Inex Desgn an Evaluaton. Proc. of ACM SIGMOD Conf., 998. [Co70] E.F. Co. A Relatonal Moel of Data for Large Share Databanks. Comm. of ACM 3(6), 970, pp [Com79] D. Comer. The ubqutous B-Tree. ACM Computng Surveys, (), 979, pp [CVS98] CVS, a verson control system n the free software communty URL: [Dat88] C.J. Date. A Gue to the SQL Stanar, n eton. Ason-Wesley, 988. [Dev97] [DKO+85] [Ell8] [Eva94] B. Devln. Data Warehouse from Archtecture to Implementaton. Ason-Wesley Longman, Inc D.J. DeWtt, R.H. Katz, F. Olken, L.D. Shapro, M.R. Stonebraker, an D. Woo. Implementaton Technques for Man Memory Database Systems. Proc. of ACM SIGMOD Conf., 984, pp. -8. C. Ells. Extenble Hashng for Concurrent Operatons an Dstrbute Data. Proc. of ACM SIGMOD Conf., 98, pp G. Evangels. The hb π -Tree: A Concurrent an Recoverable Mult-Attrbute Inex Structure. Ph.D. thess, North-Eastern Unversty, Boston, MA, 994. [Fal85] C. Faloutsos. Mult-attrbute Hashng Usng Gray Coes. Proc. of ACM SIGMOD Conf., 985, pp [Fal88] [FB74] [Fen98] [FG86] [FG89] [FG96] [FK94] C. Faloutsos. Gray Coes for Partal Match an Range Queres. IEEE TSE 4(0), 988, pp R. Fnkel an J.L. Bentley. Qua-Trees: A Data Structure for Retreval of Composte Keys. Acta Informatca 4(), 974, pp. -9. R. Fenk. Desgn an Implementaton of a UB-Tree Range Query Algorthm for a Set of Query Boxes. Master Thess, Technsche Unverstät München, 998. J.C. Freytag an N. Gooman. Rule-Base Translaton of Relatonal Queres nto Iteratve Programs. Proc. of ACM SIGMOD Conf., 986, pp J.C. Freytag an N. Gooman. On the Translaton of Relatonal Queres nto Iteratve Programs. TODS 4(), 989, pp. -7. C. Faloutsos an V. Gaee. Analyss of n-mensonal Qua-Trees usng the Hausorff Fractal Dmenson. Proc. of n VLDB Conf., Bombay, Ina, 996, pp C. Faloutsos an I. Kamel. Beyon Unformty an Inepenence: Analyss of R-Trees usng the Concept of the Fractal Dmenson. Proc. of ACM SIGMOD-PODS Conf., Mnneapols, Mnnesota, 994, pp. 4-3.

212 96 REFERENCES [FNP+79] R. Fagn, J. Nevergelt, N. Pppenger, an H. R. Strong. Extenble Hashng A Fast Access Metho for Dynamc Fles. ACM TODS 4(3), 979, pp [FR89] C. Faloutsos an S. Roseman. Fractals for Seconary Key Retreval. Proc. of 8 th ACM SIGMOD-PODS Conf., 989, pp [Fre87] [Fre95] [Fre89] [Fr97] [FSG+98] [FSR87] [Gae95] [Gar8] [GBL+96] M. Freeston. The BANG Fle: A new Kn of Gr Fle. Proc. of ACM SIGMOD Conf., San Francsco, CA, 987, pp M. Freeston. A General Soluton of the n-mensonal B-Tree Problem. Proc. of the ACM SIGMOD Conf., 995, pp J.C. Freytag. The Basc Prncples of Query Optmzaton n Relatonal Database Management Systems. IFIP, 989, pp N. Frelnghaus: Evaluerung er Ensatzfähgket es UB-Baums für as SAP-System R/3. Master Thess, TU München, 997. M. Fang, N. Shvakumar, H. Garca-Molna, R. Motwan, an J. Ullman. Computng Iceberg Queres Effcently. Proc. of VLDB Conf., 998, pp C. Faloutsos, T. Sells an N. Roussopoulos. Analyss of Object-Orente Spatal Access Methos. Proc. of ACM SIGMOD Conf., 987. V. Gaee. Optmal Reunancy n Spatal Database Systems. Proc. SSD 95, Portlan, LNCS Volume 95, 995, pp I. Gargantn. An effectve Way to Represent Qua-Trees. Comm. of ACM 5(), 98, pp J. Gray, A. Bosworth, A. Layman, H. Prahesh: Data Cube: A Relatonal Aggregaton Operator Generalzng Group-By, Cross-Tab, an Sub-Total. Proc. of ICDE 996, pp [GG97] V. Gaee an O. Günther. Multmensonal Access Methos. ACM Computng Surveys 30(), 997. [GHR+97] [GR97] [Gra9] H. Gupta, V. Harnarayan, A. Rajaraman, an D. Ullman. Inex Selecton for OLAP. Proc. of ICDE, 997. J. Gray an A. Reuter. Transacton Processng: Concepts an Technques. 3 r eton. Morgan Kaufmann Publshers, 997. J. Gray. The Benchmark Hanbook for Database an Transacton Processng Systems. Morgan Kaufmann, San Mateo, CA, 99. [Gra93] G. Graefe. Query Evaluaton Technques for Large Databases. ACM Computng Surveys 5, 993, pp [Gre89] D. Greene. An Implementaton an Performance Analyss of Spatal Data Access Methos. Proc. of 5 th ICDE, 989. [Gün93] O. Günther. Effcent Computatons of Spatal Jons. Proc. of 9th ICDE, Venna, 993. [Gup97] [Gut84] [HAM+97] H. Gupta. Selecton of Vews to Materalze n a Data Warehouse. Proc. of the Intl. Conference on Database Theory, Athens, Greece, January 997. A. Guttman. R-Trees: A ynamc Inex Structure for spatal Searchng. Proc. of ACM SIGMOD Conf., 984, pp C.T. Ho, R. Agrawal, N. Mego, an R. Srkant. Range Queres n OLAP Data Cubes. Proc. of ACM SIGMOD Conf., Tucson, Arzona, 997, pp

213 REFERENCES 97 [Hl9] D. Hlbert. Über e stetge Abblung ener Lne auf en Flächenstück. Math. Annln., 38, 89, pp [Hn85] K. Hnrchs. Implementaton of the Gr Fle: Desgn Concepts an Experence. BIT 5, 985, pp [HNK+90] L. Haraa, M. Nakano, M. Ktsuregawa, an M. Takag. Query Processng Methos for Mult- Attrbute Clustere Relatons. Proc. of 6th VLDB Conf., 990, pp [HR96] E.P. Harrs an K. Ramamohanarao. Jon algorthm costs revste. VLDB Journal, 5, 996. [HSW88a] [HSW88b] [HSW89] [IBM97] [Inf97] A. Hutflesz, H.-W. Sx, an P. Wmayer. Globally Orer Preservng Multmensonal Lnear Hashng. Proc. of 4 th ICDE, 988, pp A. Hutflesz, H.-W. Sx, an P. Wmayer. Twn Gr Fles: Space Optmzng Access Schemes. Proc. of ACM SIGMOD Conf., 988. A. Hutflesz, H.-W. Sx, an P. Wmayer. The LSD-Tree: Spatal Access to Multmensonal Pont an non-pont Objects. Proc. of 5 th VLDB Conf., Amsteram, Netherlans, 989, pp IBM Corporaton. IBM DB Unversal Database for UNIX Documentaton. IBM Corporaton, 997. Informx Software Incorporaton. A New Generaton of Decson Support Inexng for Enterprse-we Data Warehouses URL: [Inf98] Informx Corporaton. Informx Documentaton: Dynamc Server 7. an Later., 998. [Inm96] W.H. Inmon. Bulng the Data Warehouse. John Wley & Sons, Inc., n eton, 996. [Jag89] [Jag90] H.V. Jagash. Incorporatng Herarchy n a Relatonal Moel of Data. Proc. of ACM SIGMOD Conf., Portlan, Oregon, 989. H.V. Jagash. Lnear Clusterng of Objects wth multple Attrbutes. Proc. of ACM SIGMOD Conf., 990, pp [Jag9] H.V. Jagash. A Retreval Technque for Smlar Shapes. Proc. of ACM SIGMOD Conf., 99, pp [JC85] [JK84] [JL98] [KE97] [KF94] [Km96] R. Jan an I. Chlamtac. The P² Algorthm for Dynamc Calculaton of Quantles an Hstograms Wthout Storng Observatons. Comm. of ACM 8(0), 985, pp M. Jarke an J. Koch. Query Optmzaton n Database Systems. ACM Computng Surveys 6(), 984, pp. -5. M. Jürgens an H.J. Lenz. The R*a-Tree: An mprove R*-Tree wth Materalze Data for Supportng Range Queres on OLAP-Data. DWDOT Workshop, Venna, 998. A. Kemper, A. Eckler. Datenbanksysteme : Ene Enführung.. aktual. u. erw. Aufl., Olenbourg Verlag, München, 997, p. 504 I. Kamel an C. Faloutsos. Hlbert R-Tree: An Improve R-Tree usng Fractals. Proc. of 0 th VLDB Conf., 994, pp R. Kmball. The Data Warehouse Toolkt : PRACTICAL TECHNIQUES FOR BUILDING DIMENSIONAL DATA WAREHOUSES. JOHN WILEY & SONS, INC., New York, 996, p [Km96] R. Kmball. The Data Warehouse Toolkt. John Wley & Sons, New York. 996.

214 98 REFERENCES [KKD89] [Knu68] [Knu73] [KP88] [Kr84] [KS87] [KSF+96] [LS90] W. Km, C. Km, an A. Dale. Inexng technques for object-orente atabase. In Object- Orente Concepts, Databases an Applcatons. Ason-Wesley, 989, pp D.E. Knuth. The Art of Computer Programmng Volume : Funamental Algorthms. Ason- Wesley, Reang, MA, 968. D.E. Knuth. The Art of Computer Programmng Volume 3: Sortng an Searchng. Ason- Wesley, Reang, MA, 973. M.H. Km an S. Pramank. Optmal Fle Dstrbuton for Partal Match Retreval. Proc. of ACM SIGMOD Conf., Boston, NA, 984, pp H.-P. Kregel. Performance Comparson of Inex Structures for Mult-Key Retreval. Proc. of ACM SIGMOD Conf., Boston, MA, 984, pp H.-P. Kregel an B. Seeger. Multmensonal Dynamc Quantle Hashng s Very Effcent for Non-Unform Recor Dstrbutons. ICDE, pp. 0-7, 987. F. Korn, N. Sropoulos, C. Faloutsos, E. Segel, an Z. Protopapas. Fast Nearest Neghbor Search n Mecal Image Databases. Proc. of n VLDB Conf., Bombay, Ina, 996, pp D. Lomet an B. Salzberg. The hb-tree: A Multattrbute Inexng Metho wth goo guarantee Performance. ACM TODS, 5(4), 990, pp [Lum70] Y.V. Lum. Mult-Attrbute Retreval wth Combne nexes. Comm. of ACM 3(), November 970, pp [MB97a] [MB97b] [MB97c] [MB98] [McG96] [ME9] [Mer8] [MHW+90] V. Markl. The Tetrs-Algorthm for multmensonal Sorte Reang from UB-Trees. Internal Report, FORWISS München, 997. V. Markl an R. Bayer. A Cost Moel for multmensonal Queres n Relatonal Database Systems. Internal Report, FORWISS München, 997. V. Markl an R. Bayer. Varable UB-Trees for the effcent Inexng of arbtrarly strbute multmensonal Data. Internal Report, FORWISS München, 997. V. Markl an R. Bayer. The Tetrs-Algorthm for Sorte Reang from UB-Trees. In Grunlagen von Datenbanken, 0th GI Workshop, Konstanz, 998. F. McGuff. Data Moelng Patterns for Data Warehouses. Comprehensve Systems Inc., July 996. P. Mshra an M.H. Ech. Jon Processng n Relatonal Databases. ACM Computng Surveys, Vol. 4 No., 99, pp T.H. Merret. Why Sort-Merge gves the best Implementaton of the Natural Jon. ACM SIGMOD Recor 3(), 98, pp C. Mohan, D. Haerle, Y. Wang, an J. Cheng. Sngle Table Access Usng Multple Inexes: Optmzaton, Executon an Concurrency Control Technques. Internatonal Conference on Extenng Database Technology, 990, pp [MOD99] MISTRAL Onlne Documentaton. TU München, 999. [Moe98] URL: G. Moerkotte. Small Materalze Aggregates: A Lght Weght Inex Structure for Data Warehousng. Proc. of 4th VLDB Conf., New York, USA, 998.

215 REFERENCES 99 [MS98a] Mcrosoft Corporaton. SQL Server URL: [MS98b] Mcrosoft Corporaton. OLE DB for OLAP Programmer s Reference. February 998. [Mul7] [MZB99] [NHS84] URL: J.K. Mulln. Retreval-Upate Spee Traeoffs Usng Combne Inexes. Comm. of ACM 4(), Dec. 97, pp V. Markl, M. Zrkel, an R. Bayer. Processng Operatons wth Restrctons n Relatonal Database Management Systems wthout external Sortng. Proc. of ICDE, Syney, Australa, 999. J. Nevergelt, H. Hnterberger, an K. C. Sevck. The Gr-Fle. ACM TODS, 9(), March 984, pp [OLA98] The OLAP Councl. MDAPI TM The OLAP Applcaton Program Interface Verson.0 Specfcaton. January 998. [OM84] [OQ97] J. A. Orensten an T.H. Merret. A Class of Data Structures for Assocate Searchng. Proc. of ACM SIGMOD-PODS Conf., Portlan, Oregon, 984, pp P. O Nell an D. Quass. Improve Query Performance wth Varant Inexes. Proc. of ACM SIGMOD Conf., Tucson, Arzona,997, pp [Ora97] Oracle Corporaton. Oracle 8 Documentaton. Oracle Corporaton, 997. [Ore89] J. Orensten. Reunancy n spatal Databases. Proc. of ACM SIGMOD Conf., 989, pp [Ova99] D. Ovaya. Portng the UB-Tree to DB. Internshp Report, FORWISS München, 999. [PDF+98] [Pfa99] [PGK88] [PH90] P. Baumann, A. Dehmel, P. Furtao, R. Rtsch, an N. Wmann. The Multmensonal Database System RasDaMan. Proc. of ACM SIGMOD Conf., 998, pp M. Pfaenhauer. Portng the UB-Tree to Informx Unversal Server. Internshp Report, FORWISS München, 999. D.A. Patterson, G.A. Gbson, an R.H. Katz. A Case for Reunant Arrays of Inexpensve Dsks (RAID). Proc. of ACM SIGMOD Conf., 988, pp D.A. Patterson an J.L. Hennessy. Computer archtecture: a quanttatve approach. Kaufmann, San Mateo, Calf., 990. [Pe97] R. Pernger. Portng the UB-Tree to Oracle. Internshp Report, TU München, 997. [Pe98] [PIH+96] [Poo97] [PS85] [PST+93] R. Pernger. Evaluaton of the UB-Tree n the SAP Envronment. Master Thess, TU München, 998. V. Poosala, Y.E. Ioanns, P.J. Haas, an E.J. Shekta. Improve Hstograms for Selectvty Estmaton of Range Precates. Proc. of ACM SIGMOD Conf., 996, pp V. Poosala. Hstogram-Base Estmaton Technques n Database Systems. Ph.D thess, Unv. of Wsconsn-Mason, 997. F.P. Preparata an M.I. Shamos. Computatonal Geometry: An Introucton. Sprnger-Verlag, New York, 985. B.U. Pagel, H.W. Sx, H. Toben, an P. Wmayer. Towars an Analys of Range Query Performance n Spatal Data Structures. Proc. of th ACM SIGMOD-PODS Conf., Washngton D.C., 993, pp. 4-.

216 00 REFERENCES [Re97] Rebrck Systems. Star Schema processng for Complex Queres [Rt89] [Rob8] URL: G. Rtter. Informaton Processng 89, Proc. of IFIP, Worl Computer Congress, San Francsco, USA, August 8 - September, 989. J.T. Robnson. The K-D-B-Tree: A Search Structure for large multmensonal ynamc nexes. Proc. of ACM SIGMOD Conf., 98, pp [Rot9] D. Rotem. Spatal Jon Inces. Proc. of ICDE, 99, pp [Sag94] H. Sagan. Space Fllng Curves. Sprnger Verlag, Berln/Heelberg/New York, 994. [Sal88] B. Salzberg. Fle Structures: An Analytc Approach. Prentce Hall, 988. [Sam90] H. Samet. The Desgn an Analyss of Spatal Data Structures. Ason Wesley, 990 [SAP99] SAP AG. SAP Homepage. 999 URL: [Sar97] S. Sarawag. Inexng OLAP ata. Data Engneerng Bulletn 0 (), 997, pp [Sch98] M. Schramm. UB-Tree Inserton an Deleton Functons. Internshp Report, TU München, 998. [SDN+96] A. Shukla, P. Deshpane, J. Naughton, an K. Ramasamy. Storage Estmaton for Multmensonal Aggregates n the Presence of Herarches. Proc. of n VLDB Conf., Mumba (Bombay), Ina, 996. [SDN+98] [SH94] [SK90] [Spe9] [SRF87] A. Shukla, P. Deshpane, an J. Naughton. Materalze Vew Selecton for Multmensonal Datasets. Proc. of ACM SIGMOD Conf., 998. H. Shawney an J. Hafner. Effcent Color Hstogram Inexng. Proc. Int. Conf. on Image Processng, 994, pp B. Seeger an H.-P. Kregel. The Buy Tree: An Effcent an Robust Access Methos for Spatal Database Systems. Proc. of 4th VLDB Conf., 988, pp G. Specht. LOLA, LDL, NAIL!, RDL, ADITI an STARBURST: A Comparson of Deuctve Database Systems. Insttut für Informatk. TU-München 99. T. Sells, N. Roussopoulos, an C. Faloutsos. The R+-Tree: A Dynamc Inex for Mult- Dmensonal Objects. Proc. of 3 th VLDB Conf., Brghton, Englan, 987, pp [Sto94] M. Stonebraker (e.). Reangs n Database Systems n eton. Morgan Kaufmann, 994. [Str99] M. Strechsber. Implementaton an Analyss of the Spral Algorthm for Nearest-Neghbor Queres wth UB-Trees. Master Thess, TU München, 999. [Tam8] M. Tammnen. The extenble cell metho for closest pont problems. BIT, 98, pp. 7-4 [TAS98] [TH8] [TPC97] TransActon Software GmbH. TransBase Relatonal Database System Verson 4.3, Manual. Transacton Software GmbH, München, Germany, 998. H. Tropf an H. Herzog. Multmensonal Range Search n Dynamcally Balance Trees. Angewante Informatk, /98. Transacton Processng Performance Councl. TPC Benchmark D (Decson Support). Stanar Specfcaton, Revson..3. June 997. URL:

217 REFERENCES 0 [Ull88] [Ull89] J.D. Ullman. Database an Knowlege Base Systems Volume I. Computer Scence Press, Rockvlle, MD, 988. J.D. Ullman. Database an Knowlege Base Systems Volume II. Computer Scence Press, Rockvlle, MD, 989 [WB97] M.C. Wu an A.P. Buchmann. Research Issues n Data Warehousng. BTW [WB98] M.C. Wu an A.P. Buchmann. Encoe Btmap Inexng for Data Warehouses. Proc. of ICDE, Orlano, 998. [W95] J. Wom. Research Problems n Data Warehousng. Proc. of 4th CIKM, November 995. [WSB98] R. Weber, H.J. Schek, an S. Blott. A Quanttatve Analyss an Performance Stuy for Smlarty-Search Methos n Hgh-Dmensonal Spaces. Proc. of VLDB Conf., New York, 998 [WZ98] R. Wunerlng an M. Zöckler. DOC++, a free source coe ocumentaton tool [ZDN+98] [ZSL98] [Zw96] URL: Y. Zhao, R. Deshpane, J. Naughton, an A. Shukla. Slmutaneous Optmzaton an Evaluaton of Multple Dmensonal Queres. Proc. of ACM SIGMOD Conf., 998. C. Zou, B. Salzberg, an R. Lan. Back to the Future: Dynamc Herarchcal Clusterng. Proc. of ICDE, 998, pp D. Zwllnger (e.). Stanar Mathematcal Tables an Formulae. 30 th eton, CRC Press, 996.

218

219 INDEXES 03 Lst of Fgures Fgure -: Query Categores... Fgure -: Schema of the TPC-D Benchmark... 4 Fgure -3: Partal Match Queres (a, b) an exact match queres (c, )... 5 Fgure -4: Range queres (a, b) an partal range queres (c, )... 6 Fgure -5: Sets of query boxes (a, b) an arbtrary query spaces (c, )... 7 Fgure -6: Nearest neghbor query... 9 Fgure -7: Jons an range queres n multmensonal space Fgure -8: A har sk... 3 Fgure -9: Clusterng an mensonalty Fgure -0: Theoretcal performance of ranom access, tuple clusterng an page clusterng Fgure -: Usng multple B-Trees for answerng range queres Fgure 3-: Space fllng curves Fgure 3-: Dstance shrnkng an stance enlargement Fgure 3-3: C-values, Z-values an H-values Fgure 3-4: Relatve egree of the symmetry σ z /σ c Fgure 3-5: C-regons an Z-regons Fgure 3-6: Z-regons an query spaces Fgure 3-7: Connecte an sconnecte Z-regons Fgure 3-8: Z-areas an Z-regons Fgure 3-9: Aress representaton Fgure 3-0: Unformly an Gaussan strbute ata Fgure 3-: Iealze unform parttonng... 7 Fgure 3-: Depenent mensons Fgure 3-3: Splt epth per menson for ealze unformly strbute regons Fgure 3-4: Utlzaton of the multmensonal space Fgure 4-: The UB-Tree Fgure 4-: Inserton nto UB-Trees Fgure 4-3: Processng a range query Fgure 4-4: Zoom nto Fgure Fgure 4-5: Transformng a query box nto a set of Z-ntervals Fgure 4-6: Three nearest neghbor queres... 9 Fgure 4-7: Sorte reang wth the Tetrs algorthm...95 Fgure 4-8: Reang a query box n sort orer Fgure 4-9: Z-Orerng / Tetrs Orerng Fgure 5-: Archtecture of the plot mplementaton...99 Fgure 5-: Cache requrements an UB-Tree seconary nex table... 0 Fgure 5-3: CPU tme to calculate Z(x) an Z - (α) Fgure 5-4: Dealng wth arbtrary ata types Fgure 5-5: Example herarchy n member set representaton Fgure 5-6: Part of a herarchy... Fgure 5-7: Complex herarchy graphs... Fgure 5-8: Slowly changng mensons... 3

220 04 INDEXES Fgure 5-9: Processng a query box n sort orer wth the Tetrs algorthm... 5 Fgure 5-0: Calculaton of splt ponts for varous ata strbutons... 7 Fgure 5-: UB-Tree an VUB-Tree for Gaussan/unform ata strbuton... 8 Fgure 5-: UB-Tree an VUB-Tree for Gaussan ata strbuton... 8 Fgure 5-3: UB-Tree an VUB-Tree for centere Gaussan ata strbuton... 9 Fgure 5-4: Splttng Z-regons wth ε 0%, ε % an ε 40%... Fgure 5-5: Implementaton of the pont query algorthm... 3 Fgure 6-: The cost functon for perfect ealze unform parttonng... 3 Fgure 6-: The cost functon for probablstc unform parttonng Fgure 6-3: Cost functon an realty Fgure 6-4: ε-splt an cost functon Fgure 6-5 Btmap nex ntersecton Fgure 6-6: Access methos an clusterng Fgure 6-7: Cost functons for retreval of a multmensonal nterval Fgure 6-8: prefetch(t, P, s,..., s ) an cluster(t, P, s,..., s )... 4 Fgure 6-9: Smulaton of a four mensonal range query... 4 Fgure 6-0: Smulaton of sortng a four mensonal query box Fgure 6-: Temporary storage requre for the sorter cache (smulaton) Fgure 7-: Insert performance Fgure 7-: Tme strbuton for UB-Tree nserton Fgure 7-3: Range queres n sparsely (a) an ensely (b) populate parts of a unverse... 5 Fgure 7-4: Query box volumes... 5 Fgure 7-5: Range queres an scalablty Fgure 7-6: Characterstcs of a c%-measurement seres Fgure 7-7: Characterstcs of a cube measurement seres Fgure 7-8: DBMS Fgure 7-9: DBMS Fgure 7-0: Varyng the number of restrcte mensons for 35% restrctons (DBMS) Fgure 7-: Varyng the number of restrcte mensons for cube restrctons (DBMS) Fgure 8-: Transformaton rules for the mplementaton of algebrac operatons Fgure 8-: Query Q3 of the TPC-D benchmark Fgure 8-3: Operator trees for Q Fgure 8-4: Processng Q3 wth the Tetrs algorthm Fgure 8-5: Response tmes an temporary storage for sortng 50 % of LINEITEM for Q Fgure 8-6: Q4 of the TPC-D benchmark Fgure 8-7: Processng Q Fgure 8-8: Responses tme an temporary storage for sortng 3.5% of ORDER for Q Fgure 8-9: Query Q6 of the TPC-D benchmark... 7 Fgure 8-0: Processng Q6 wth UB-Tree, IOT an FTS Fgure 8-: Performance of Q Fgure 8- Herarches n the Juce & More schema an the corresponng star schema Fgure 8-3: Tme Interval (Q) Fgure 8-4: Dstrbuton cost (Q) Fgure 8-5: Partal match query n the frst herarchy level (Q3)... 77

221 INDEXES 05 Fgure 8-6: Data strbuton of the tme menson Fgure 8-7: Data strbuton of the prouct menson Fgure 8-8: Data strbuton of the frst two herarchy levels of the prouct menson Fgure 8-9: Data strbuton of the customer menson Fgure 8-0: Data strbuton of the strbuton menson... 8 Fgure 8-: Compoun surrogates for each menson of Juce & More... 8 Fgure 8-: Query response tmes an result set szes Fgure 8-3: Juce & More performance... 85

222 06 INDEXES Lst of Tables Table -: Number of mensons restrcte to ponts, ntervals or unrestrcte for several types of partal range queres... 7 Table -: Inexes n present RDBMS Table 3-: Transformaton from Z -aresses to Z s -aresses Table 5-: Schema of a UB-Tree nex table Table 5-: Prototype mplementaton concepts...0 Table 5-3: Two surrogate mappngs for an enumeraton type Table 5-4: Heghts of a Splt Pont Tree for varous Database Szes an Dmensons... 0 Table 6-: Splt levels Table 7-: Table Szes of a mllon 8 Byte tuples relaton on DBMS (kb pages) Table 7-: Table Szes of a 0 mllon 8 Byte tuples relaton on DBMS (kb pages) Table 7-3: Table Szes of several tables on DBMS (8kB pages) Table 7-4: Response tmes for 35%-measurements on DBMS wth A as varable attrbute Table 7-5: Response tmes for cube measurements on DBMS Table 7-6: Response tmes for 35%-measurements on DBMS wth A as varable attrbute Table 7-7: Response tmes for cube measurements on DBMS Table 8-: Complextes of relatonal operators Table 8-: Interactve response tmes an cache szes for sortng 50 % of LINEITEM Table 8-3: Interactve response tmes an cache szes for sortng 3.5% of ORDER... 7 Table 8-4: Interactve response tmes for Q Table 8-5: Restrctons n the frst herarchy level n 3 of 4 mensons... 8 Table 8-6: Restrcton n the frst two herarchy levels n 3 of 4 mensons Table 8-7: Restrcte herarches an selectvtes for fve queres aganst the Juce & More ata warehouse... 84

223 INDEXES 07 Lst of Defntons Defnton - (multmensonal oman, Ω)... 7 Defnton - ( -orer of Ω)... 7 Defnton -3 (<-neghbors)... 7 Defnton -4 (stance of two values)... 8 Defnton -5 (scalar normalzaton)... 8 Defnton -6 (tuple normalzaton)... 8 Defnton -7 (one-mensonal ntervals)... 8 Defnton -8 (mult-mensonal nterval)... 8 Defnton -9 (volume of a lnear nterval)... 9 Defnton -0 (volume of a multmensonal nterval)... 9 Defnton - (volume of a set of multmensonal ntervals)... 9 Defnton - (average)... 0 Defnton -3 (stanar evaton)... 0 Defnton -4 (regon)... 0 Defnton -5 (page; page capacty)... 0 Defnton -6 (corresponence between pages an regons)... 0 Defnton -7 (regon parttonng)... Defnton -8 (query, result set)... Defnton -9 (selectvty)... Defnton -0 (restrcton)... Defnton - (restrcton nterval, query box)... 3 Defnton - (nex canate, nex attrbute, result attrbute)... 3 Defnton -3 (tuple clusterng; page clusterng) Defnton 3- (space fllng functon)... 4 Defnton 3- (C-value, C-aress)... 4 Defnton 3-4( compoun curve an Z-curve) Defnton 3-5 (Z-orerng, ) Defnton 3-6 (contnuty) Defnton 3-7 (monotoncty) Defnton 3-8 (successor an preecessor of a value)...47 Defnton 3-9 (successor an preecessor of a tuple n one menson) Defnton 3-0 (neghbors n one menson of Ω) Defnton 3- (average neghbor stance for a pont n one menson) Defnton 3- (bt of an expresson) Defnton 3-3 ( ) Defnton 3-4 (cumulate neghbor stance for one menson)... 5 Defnton 3-5 (egree of symmetry of a space fllng curve)... 5 Defnton 3-6 (C-Regon) Defnton 3-7 (Z-Regon) Defnton 3-8 (connecton n space) Defnton 3-9 (Z-area; ncrement Z-aress (Z -aress))... 59

224 08 INDEXES Defnton 3-0 (step an length of a Z -aress) Defnton 3- (tuple, Z-aress of a tuple)... 6 Defnton 3- (mappng between tuples an aresses)... 6 Defnton 3-3 (Z-regon)... 6 Defnton 3-4 (number of steps for an attrbute) Defnton 3-5 (length of a step) Defnton 3-6 (stanar representaton of a Z-aress (Z s -aress)) Defnton 3-7 (contrbuton of a menson to a Z s -aress) Defnton 3-8 (aress ncrementaton an ecrementaton) Defnton 3-9 (epenence an nepenence of mensons) Defnton 3-30 (ealze unform parttonng)... 7 Defnton 3-3 (lnear epenency)... 7 Defnton 3-3 (sne epenency) Defnton 3-33 (total splt epth; eal splt epth per menson) Defnton 3-34 (actual oman of a menson) Defnton 3-35 (partton of an actual oman) Defnton 3-36 (prefx length of a partton) Defnton 3-37 (actual splt epth) Defnton 3-38 (normalze actual splt epth per menson an normalze actual splt epth) Defnton 3-39 (puff pastry, puff pastry egree) Defnton 4- (UB-Tree) Defnton 4- (subcube of an aress) Defnton 5- (lexcographc character orer C Cn, lexcographc byte orer B Bn, lexcographc bt orer b... bn) Defnton 5- (transformaton functon) Defnton 5-3 (enumeraton type) Defnton 5-4 (herarchcal epenence) Defnton 5-5 (herarchcal nepenence) Defnton 5-6 (compoun surrogate)... 0 Defnton 5-7 (path through a herarchy)... Defnton 5-8 (varable UB-Tree, VUB-Tree)... 6 Defnton 5-9 (splt pont tree)... 9 Defnton 6- (perfect ealze unform parttonng) Defnton 6- (sem-perfect an probablstc ealze unform parttonng)... 3 Defnton 6-3 (probablstc menson)... 3 Defnton 7- (range metho measurement for an access metho) Defnton 7- (measurement for a set of access methos) Defnton 7-3 (measurement seres) Defnton 7-4 (c%-measurement seres) Defnton 7-5 (cube measurement seres) Defnton 8- (CPU-complexty, asymptotc CPU-complexty CPU(n))... 6 Defnton 8- (I/O-complexty, asymptotc I/O-complexty IO(n))... 6 Defnton 8-3 (SPACE-complexty, asymptotc SPACE-complexty SPACE(n))... 6

225 INDEXES 09 Lst of Theorems an Lemmas Theorem 3- (symmetry theorem): For practcal values for {,..., 0} an r {,..., 0 9 } the Z-curve has a hgher egree of symmetry than the compoun curve Theorem 3- (connecton theorem): Any Z-regon conssts of at most two spatally sconnecte sets of ponts an such Z-regons exst Theorem 3-3 (mappng between tuples an aresses): There exsts a one-to-one map between Cartesan coornates of a tuple an Z-aresses... 6 Theorem 3-4 (somorphsm between Z s -aresses an Z -aresses): For Z-aresses ε the stanar representaton an the ncrement representaton of Z-aresses are somorphc Theorem 4- (range query theorem): The range query algorthm transforms the multmensonal nterval [[y, z]] nto a set of one-mensonal aress ntervals {]α, β ], ]β, γ],...}, whch wth respect to the gven regon parttonng s the smallest cover for [[y, z]] Lemma - ( neghbors): Two ponts x, y Ω wth x y are neghbors, f an only f, there exsts an nex so that x j y j for all j D\{} an x an y are <-neghbors Lemma -: A pont x n -mensonal space has at most neghbors... 8 Lemma -3: The volume of the fference of two pont-sets S an Q S s the fference of the volumes of S an Q... 9 Lemma 3-: A space fllng functon efnes a one-mensonal orerng for a multmensonal space Lemma 3-: For a {0,..., s -} wth the bnary representaton a a s-...a 0 the nverse functon x C - (a) s calculate as: Lemma 3-3: For a {0,..., s -} wth the bnary representaton a a s-...a 0 the nverse functon x Z - (a) s calculate as: Lemma 3-4: C(Ω) Z(Ω) {0,..., s -} R Lemma 3-5: The C-curve creates the orerng A... A A on the multmensonal space Ω Lemma 3-6 (C-stance of two ponts): For two ponts x, y Ω wth x A A,..., A y ther stance on the C-curve s: Lemma 3-7 (Z-stance of two ponts): For two ponts x, y Ω wth x y ther stance on the Z- curve s: Lemma 3-8: The compoun curve an the Z-curve are not contnuous Lemma 3-9: The compoun curve an the Z-curve are monotonous Lemma 3-0: For a pont x the average stance of a neghbor on the C-curve wth respect to menson s: Lemma 3-: For a pont x the average stance of a neghbor on the Z-curve wth respect to menson s Lemma 3-: The average neghbor stance for the Z-curve for menson s:... 5 Lemma 3-3: The average neghbor stance for the C-curve for menson s:... 5 Lemma 3-4: A Z-regon [α : β ] covers the multmensonal nterval efne by the bounares Z - (α) an Z - (β) wth Z - (α) Z - (β)... 56

226 0 INDEXES Lemma 3-5: A C-regon [α : C β ] covers the multmensonal nterval efne by the bounares C - (α ) an C - (β ) Lemma 3-6: If for any > 0 α, the Z -aresses α α.... α + an α* α.....α. efne the same area Lemma 3-7: The lexcographc orerng of Z-aresses (enote by ) an set contanment of areas n space (enote by ) are somorphc:... 6 Lemma 3-8: Snce the two maps are nverses of each other we get:... 6 Lemma 3-9: For Z s -aresses the contrbuton of an attrbute s just the attrbute tself Lemma 3-0 (oman of a step of a stanar aress): Z s -aresses result n a subcube numberng between 0 an Lemma 3- (length of a Z s -aress): The length n bts of Z s -aresses s entcal for each tuple of a gven unverse Lemma 3- (open regons an close regons): ] α : β ] [α : β ] Lemma 3-3: The orerng of Z-values of Secton 3. s entcal to the orerng efne by Algorthm Lemma 3-4: If β...β k s the bt-sequence of a Z s -aress numbere from left to rght, the volume of the corresponng area s calculate as Lemma 3-5: If α...α k an β...β k are the bt sequences of the aresses α an β, the volume of the Z- regon ]α : β] s calculate as: Lemma 5-: For postve nteger numbers wth a fxe length bnary representaton the bt lexcographc orer on the bnary representaton s entcal to the <-orer on nteger numbers Lemma 5-: For character strngs the bt lexcographc orer on the bnary representaton s entcal to the <-orer on nteger numbers Lemma 6-: Each probablstc ealze unform parttonng has exactly one probablstc menson.... 3

227 GLOSSARY Inex Term Descrpton Symbol see also area, geometrc volume Secton..3 arty number of attrbutes of a relaton,, R Secton. BII btmap nex ntersecton Secton.4., Secton 8.3.4, Secton 6.3 bt nterleavng C-aress effcent mplementaton of the Z-aress calculaton nex of a pont on the compoun curve Secton 5.3. Defnton 3- carnalty number of elements n a set Secton.. C-curve clusterng clusterng, page clusterng, tuple synonym to compoun curve ata that s lkely to be accesse together s place physcally close to each other pages are store n nex orer on physcal storage tuples are store n nex orer on a page Secton.3. Defnton -3, Secton.3..3, Secton 6. Defnton -3, Secton.3.., Secton 6. column synonym to attrbute Secton.

228 GLOSSARY compoun curve compoun orerng space fllng curve create by concatenatng the co-ornates of the ponts n some fxe orer lnear orerng of the multmensonal space create by the C-curve Defnton 3-4 Secton.3., Example -6 coornate value of a pont n one menson x, y, z Secton. corresponence between pages an regons Defnton -6 CSI C-value composte seconary nex (see also IOT) synonym to C-aress Secton 7.4., Secton 6.3, Secton.3 menson Secton. mensonalty synonym to arty Secton. stance Secton..3 oman,, Ω Secton.. oman, actual subset of a oman actually use n a relaton Defnton 3-34 oman, maxmum value υ Secton.. oman, mnmum value λ Secton.. oman, multmensonal cross prouct of one-mensonal omans Ω Secton.. FTS full table scan Secton.3, Secton.4, Secton 7.4., Secton 6.3

229 GLOSSARY 3 H-aress H-curve Hlbert curve H-value ealze unform parttonng nex canate nterval, multmensonal nex of a pont on the Hlbert curve synonym to Hlbert curve space fllng curve create by the letmotf of D. Hlbert [Hl9] synonym to H-aress specal regon strbuton of a multmensonally parttone unverse attrbute restrcte or sorte urng the processng of a query [[x, y]], ]]x, y]], [[x, y[[, ]]x, y[[ Secton 3. Secton 3. Secton 3.7.3, Secton 6. Defnton - Defnton -8 nterval, one-mensonal [a, b], ]a, b], [a, b[, ]a, b[ Defnton -7 IOT Lebesgue curve nex organze table, clusterng B * -Tree synonym to Z-curve Secton 7.4., Secton.4. length volume vol Secton..3 lexcographc orerng MSI synonym to compoun orerng multple seconary nex ntersecton (see also BII) Secton.3., Example -6 Secton.4., Secton 7.4., Secton 6.3 Multmensonal Clusterng Herarchcal encong technque for multmensonal herarches Secton 5.3.4

230 4 GLOSSARY neghbor of a pont pont, whch only ffers n one coornate from a gven pont an the fference between the two ponts n ths coornate s the lmt of resoluton Secton.. normalzaton, scalar â Defnton -5 normalzaton, tuple ^ Defnton -6 page pont, -mensonal preecessor, pont preecessor, value byte contaner or (orere) set storng tuples of a relaton vector of coornates whch efnes a locaton n - mensonal space Defnton -5 x, y, z Secton. Defnton 3-9 Defnton 3-8 puff pastry egree egree of unsymmetry of a Z- orere space query Secton 3.0 Defnton -8 query box Q Defnton - query, arbtrary volume Secton..3 query, exact match Secton.. query, partal match Secton.. query, partal range Secton..

231 GLOSSARY 5 query, range Secton.. regon subspace of Ω Defnton -4 relaton set of tuples R, S Secton. relaton, base space multmensonal oman of a relaton Secton.. relaton, parttone multmensonally set of tuples store on a set of pages, where each page correspons to a regon n multmensonal space P R, P S Secton..5, Defnton -7 relaton, parttone restrcton set of tuples store on a set of pages logcal precate efne by a query P R, P S Secton..5, Defnton -7 Defnton -0 restrcton nterval subspace efne by a query Defnton - result attrbute result set attrbut projecte nto the result set of a query tuples satsfyng a query conton Defnton - Defnton -8 row synonym to tuple Secton. selectvty spral algorthm percentage of tuples n the atabase satsfyng a restrcton algorthm to process nearest neghbor queres on multmensonally parttone ata spaces Defnton -9 Secton 4.6 SSI sngle seconary nex Secton 6.3, Secton.3

232 6 GLOSSARY successor, pont successor, value Defnton 3-9 Defnton 3-8 table synonym to relaton R, S Secton. Tetrs algorthm algorthm to process sort operatons whle processng mult-attrbute restrcton on multmensonally parttone ata spaces Secton 4.7, Secton , Secton 6.4 tuple, -mensonal x, y, z Secton.. type type of an attrbute Secton. UB-Tree varable UB-Tree volume multmensonal access metho base on Z-orerng encong technque for nepenent, arbtrarly strbute mensons percentage of space covere by a subspace compare to the entre multmensonal space Defnton 4- Secton vol secton..3 Z-aress nex of a pont on the Z-curve Defnton 3-3, Defnton 3-9, Defnton 3-0 Z-area subspace of the multmensonal space constructe by a Z-aress Λ, Λ, Λ 3,... Defnton 3-9 Z-curve space fllng curve create by btnterleavng of the co-ornates of the ponts Defnton 3-4

233 GLOSSARY 7 Z-orerng Z-regon (close) Z-regon (open) Z-value lnear orerng of multmensonal space create by the Z-curve subspace covere by the part of the Z-curve startng wth Z- aress α an enng wth Z- aress β subspace covere by the part of the Z-curve startng wth Z- aress α an enng wth Z- aress β, where α s not nclue n the regon synonym to Z-aress Defnton 3-5 [α : β ] Defnton 3-7, Secton 3.6 ]α : β ] Defnton 3-3

Cluster Analysis. Cluster Analysis

Cluster Analysis. Cluster Analysis Cluster Analyss Cluster Analyss What s Cluster Analyss? Types of Data n Cluster Analyss A Categorzaton of Maor Clusterng Methos Parttonng Methos Herarchcal Methos Densty-Base Methos Gr-Base Methos Moel-Base

More information

On the Optimal Marginal Rate of Income Tax

On the Optimal Marginal Rate of Income Tax On the Optmal Margnal Rate of Income Tax Gareth D Myles Insttute for Fscal Stues an Unversty of Exeter June 999 Abstract: The paper shows that n the quas-lnear moel of ncome taxaton, the optmal margnal

More information

DEGREES OF EQUIVALENCE IN A KEY COMPARISON 1 Thang H. L., Nguyen D. D. Vietnam Metrology Institute, Address: 8 Hoang Quoc Viet, Hanoi, Vietnam

DEGREES OF EQUIVALENCE IN A KEY COMPARISON 1 Thang H. L., Nguyen D. D. Vietnam Metrology Institute, Address: 8 Hoang Quoc Viet, Hanoi, Vietnam DEGREES OF EQUIVALECE I A EY COMPARISO Thang H. L., guyen D. D. Vetnam Metrology Insttute, Aress: 8 Hoang Quoc Vet, Hano, Vetnam Abstract: In an nterlaboratory key comparson, a ata analyss proceure for

More information

A Practical Study of Regenerating Codes for Peer-to-Peer Backup Systems

A Practical Study of Regenerating Codes for Peer-to-Peer Backup Systems A Practcal Stuy of Regeneratng Coes for Peer-to-Peer Backup Systems Alessanro Dumnuco an Ernst Bersack EURECOM Sopha Antpols, France {umnuco,bersack}@eurecom.fr Abstract In strbute storage systems, erasure

More information

High Performance Latent Dirichlet Allocation for Text Mining

High Performance Latent Dirichlet Allocation for Text Mining Hgh Performance Latent Drchlet Allocaton for Text Mnng A thess submtte for Degree of Doctor of Phlosophy By Department of Electronc an Computer Engneerng School of Engneerng an Desgn Brunel Unversty September

More information

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna [email protected] Abstract.

More information

APPLICATION OF BINARY DIVISION ALGORITHM FOR IMAGE ANALYSIS AND CHANGE DETECTION TO IDENTIFY THE HOTSPOTS IN MODIS IMAGES

APPLICATION OF BINARY DIVISION ALGORITHM FOR IMAGE ANALYSIS AND CHANGE DETECTION TO IDENTIFY THE HOTSPOTS IN MODIS IMAGES APPLICATION OF BINARY DIVISION ALGORITHM FOR IMAGE ANALYSIS AND CHANGE DETECTION TO IDENTIFY THE HOTSPOTS IN MODIS IMAGES Harsh Kumar G R * an Dharmenra Sngh ([email protected], [email protected]) Department

More information

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ). REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or

More information

RSA Cryptography using Designed Processor and MicroBlaze Soft Processor in FPGAs

RSA Cryptography using Designed Processor and MicroBlaze Soft Processor in FPGAs RSA Cryptography usng Desgne Processor an McroBlaze Soft Processor n FPGAs M. Nazrul Islam Monal Dept. of CSE, Rajshah Unversty of Engneerng an Technology, Rajshah-6204, Banglaesh M. Al Mamun Dept. of

More information

The Design of Efficiently-Encodable Rate-Compatible LDPC Codes

The Design of Efficiently-Encodable Rate-Compatible LDPC Codes The Desgn of Effcently-Encoable Rate-Compatble LDPC Coes Jaehong Km, Atya Ramamoorthy, Member, IEEE, an Steven W. McLaughln, Fellow, IEEE Abstract We present a new class of rregular low-ensty party-check

More information

EXAMPLE PROBLEMS SOLVED USING THE SHARP EL-733A CALCULATOR

EXAMPLE PROBLEMS SOLVED USING THE SHARP EL-733A CALCULATOR EXAMPLE PROBLEMS SOLVED USING THE SHARP EL-733A CALCULATOR 8S CHAPTER 8 EXAMPLES EXAMPLE 8.4A THE INVESTMENT NEEDED TO REACH A PARTICULAR FUTURE VALUE What amount must you nvest now at 4% compoune monthly

More information

ON CYCLOTOMIC POLYNOMIALS WITH ±1 COEFFICIENTS

ON CYCLOTOMIC POLYNOMIALS WITH ±1 COEFFICIENTS ON CYCLOTOMIC POLYNOMIALS WITH ±1 COEFFICIENTS PETER BORWEIN AND KWOK-KWONG STEPHEN CHOI Abstract. We characterze all cyclotomc polynomals of even egree wth coeffcents restrcte to the set {+1, 1}. In ths

More information

Exact GP Schema Theory for Headless Chicken Crossover and Subtree Mutation

Exact GP Schema Theory for Headless Chicken Crossover and Subtree Mutation Exact GP Schema Theory for Healess Chcken Crossover an Subtree Mutaton Rccaro Pol School of Computer Scence The Unversty of Brmngham Brmngham, B5 TT, UK [email protected] Ncholas F. McPhee Dvson of Scence

More information

Present Values and Accumulations

Present Values and Accumulations Present Values an Accumulatons ANGUS S. MACDONALD Volume 3, pp. 1331 1336 In Encyclopea Of Actuaral Scence (ISBN -47-84676-3) Ete by Jozef L. Teugels an Bjørn Sunt John Wley & Sons, Lt, Chchester, 24 Present

More information

A Binary Quantum-behaved Particle Swarm Optimization Algorithm with Cooperative Approach

A Binary Quantum-behaved Particle Swarm Optimization Algorithm with Cooperative Approach IJCSI Internatonal Journal of Computer Scence Issues, Vol., Issue, No, January 3 ISSN (Prnt): 694-784 ISSN (Onlne): 694-84 www.ijcsi.org A Bnary Quantum-behave Partcle Swarm Optmzaton Algorthm wth Cooperatve

More information

Stock Profit Patterns

Stock Profit Patterns Stock Proft Patterns Suppose a share of Farsta Shppng stock n January 004 s prce n the market to 56. Assume that a September call opton at exercse prce 50 costs 8. A September put opton at exercse prce

More information

The OC Curve of Attribute Acceptance Plans

The OC Curve of Attribute Acceptance Plans The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4

More information

1 Example 1: Axis-aligned rectangles

1 Example 1: Axis-aligned rectangles COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton

More information

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Luby s Alg. for Maximal Independent Sets using Pairwise Independence Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent

More information

An Efficient Recovery Algorithm for Coverage Hole in WSNs

An Efficient Recovery Algorithm for Coverage Hole in WSNs An Effcent Recover Algorthm for Coverage Hole n WSNs Song Ja 1,*, Wang Balng 1, Peng Xuan 1 School of Informaton an Electrcal Engneerng Harbn Insttute of Technolog at Weha, Shanong, Chna Automatc Test

More information

Efficient Algorithms for Computing the Triplet and Quartet Distance Between Trees of Arbitrary Degree

Efficient Algorithms for Computing the Triplet and Quartet Distance Between Trees of Arbitrary Degree Effcent Algorthms for omputng the Trplet an Quartet Dstance Between Trees of Arbtrary Degree Gerth Støltng Broal, Rolf Fagerberg Thomas Malun hrstan N. S. Peersen, Anreas San, Abstract The trplet an quartet

More information

BERNSTEIN POLYNOMIALS

BERNSTEIN POLYNOMIALS On-Lne Geometrc Modelng Notes BERNSTEIN POLYNOMIALS Kenneth I. Joy Vsualzaton and Graphcs Research Group Department of Computer Scence Unversty of Calforna, Davs Overvew Polynomals are ncredbly useful

More information

What is Candidate Sampling

What is Candidate Sampling What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble

More information

An Alternative Way to Measure Private Equity Performance

An Alternative Way to Measure Private Equity Performance An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate

More information

Decomposition Methods for Large Scale LP Decoding

Decomposition Methods for Large Scale LP Decoding Decomposton Methos for Large Scale LP Decong Sharth Barman Xshuo Lu Stark Draper Benjamn Recht Abstract Felman et al. IEEE Trans. Inform. Theory, Mar. 2005) showe that lnear programmng LP) can be use to

More information

A Performance Analysis of View Maintenance Techniques for Data Warehouses

A Performance Analysis of View Maintenance Techniques for Data Warehouses A Performance Analyss of Vew Mantenance Technques for Data Warehouses Xng Wang Dell Computer Corporaton Round Roc, Texas Le Gruenwald The nversty of Olahoma School of Computer Scence orman, OK 739 Guangtao

More information

PROBABILISTIC DECISION ANALYSIS FOR SEISMIC REHABILITATION OF A REGIONAL BUILDING SYSTEM

PROBABILISTIC DECISION ANALYSIS FOR SEISMIC REHABILITATION OF A REGIONAL BUILDING SYSTEM 3 th Worl Conference on Earthquake Engneerng Vancouver, B.C., Canaa August -6, 4 Paper No. 54 PROBABILISTIC DECISION ANALYSIS FOR SEISMIC REHABILITATION OF A REGIONAL BILDING SYSTEM Joonam PARK, Barry

More information

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by 6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng

More information

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression Novel Methodology of Workng Captal Management for Large Publc Constructons by Usng Fuzzy S-curve Regresson Cheng-Wu Chen, Morrs H. L. Wang and Tng-Ya Hseh Department of Cvl Engneerng, Natonal Central Unversty,

More information

The Greedy Method. Introduction. 0/1 Knapsack Problem

The Greedy Method. Introduction. 0/1 Knapsack Problem The Greedy Method Introducton We have completed data structures. We now are gong to look at algorthm desgn methods. Often we are lookng at optmzaton problems whose performance s exponental. For an optmzaton

More information

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications CMSC828G Prncples of Data Mnng Lecture #9 Today s Readng: HMS, chapter 9 Today s Lecture: Descrptve Modelng Clusterng Algorthms Descrptve Models model presents the man features of the data, a global summary

More information

Recurrence. 1 Definitions and main statements

Recurrence. 1 Definitions and main statements Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.

More information

A General and Practical Datacenter Selection Framework for Cloud Services

A General and Practical Datacenter Selection Framework for Cloud Services 212 IEEE Ffth Internatonal Conference on Clou Computng A General an Practcal Datacenter Selecton Framework for Clou Servces Hong Xu, Baochun L henryxu, [email protected] Department of Electrcal an Computer

More information

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

How To Understand The Results Of The German Meris Cloud And Water Vapour Product Ttel: Project: Doc. No.: MERIS level 3 cloud and water vapour products MAPP MAPP-ATBD-ClWVL3 Issue: 1 Revson: 0 Date: 9.12.1998 Functon Name Organsaton Sgnature Date Author: Bennartz FUB Preusker FUB Schüller

More information

Forecasting the Direction and Strength of Stock Market Movement

Forecasting the Direction and Strength of Stock Market Movement Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye [email protected] [email protected] [email protected] Abstract - Stock market s one of the most complcated systems

More information

Politecnico di Torino. Porto Institutional Repository

Politecnico di Torino. Porto Institutional Repository Poltecnco d Torno Porto Insttutonal Repostory [Artcle] A cost-effectve cloud computng framework for acceleratng multmeda communcaton smulatons Orgnal Ctaton: D. Angel, E. Masala (2012). A cost-effectve

More information

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy Fnancal Tme Seres Analyss Patrck McSharry [email protected] www.mcsharry.net Trnty Term 2014 Mathematcal Insttute Unversty of Oxford Course outlne 1. Data analyss, probablty, correlatons, vsualsaton

More information

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy 4.02 Quz Solutons Fall 2004 Multple-Choce Questons (30/00 ponts) Please, crcle the correct answer for each of the followng 0 multple-choce questons. For each queston, only one of the answers s correct.

More information

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Interest-Oriented Network Evolution Mechanism for Online Communities An Interest-Orented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne

More information

Enterprise Master Patient Index

Enterprise Master Patient Index Enterprse Master Patent Index Healthcare data are captured n many dfferent settngs such as hosptals, clncs, labs, and physcan offces. Accordng to a report by the CDC, patents n the Unted States made an

More information

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek HE DISRIBUION OF LOAN PORFOLIO VALUE * Oldrch Alfons Vascek he amount of captal necessary to support a portfolo of debt securtes depends on the probablty dstrbuton of the portfolo loss. Consder a portfolo

More information

On the computation of the capital multiplier in the Fortis Credit Economic Capital model

On the computation of the capital multiplier in the Fortis Credit Economic Capital model On the computaton of the captal multpler n the Forts Cret Economc Captal moel Jan Dhaene 1, Steven Vuffel 2, Marc Goovaerts 1, Ruben Oleslagers 3 Robert Koch 3 Abstract One of the key parameters n the

More information

+ + + - - This circuit than can be reduced to a planar circuit

+ + + - - This circuit than can be reduced to a planar circuit MeshCurrent Method The meshcurrent s analog of the nodeoltage method. We sole for a new set of arables, mesh currents, that automatcally satsfy KCLs. As such, meshcurrent method reduces crcut soluton to

More information

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm Prepared by: Yacine ghanjaoui Supervised by: Dr. Hachim Haddouti March 24, 2003 Abstract The indexing techniques in multidimensional

More information

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK Sample Stablty Protocol Background The Cholesterol Reference Method Laboratory Network (CRMLN) developed certfcaton protocols for total cholesterol, HDL

More information

Project Networks With Mixed-Time Constraints

Project Networks With Mixed-Time Constraints Project Networs Wth Mxed-Tme Constrants L Caccetta and B Wattananon Western Australan Centre of Excellence n Industral Optmsaton (WACEIO) Curtn Unversty of Technology GPO Box U1987 Perth Western Australa

More information

THE LOAD PLANNING PROBLEM FOR LESS-THAN-TRUCKLOAD MOTOR CARRIERS AND A SOLUTION APPROACH. Professor Naoto Katayama* and Professor Shigeru Yurimoto*

THE LOAD PLANNING PROBLEM FOR LESS-THAN-TRUCKLOAD MOTOR CARRIERS AND A SOLUTION APPROACH. Professor Naoto Katayama* and Professor Shigeru Yurimoto* 7th Internatonal Symposum on Logstcs THE LOAD PLAIG PROBLEM FOR LESS-THA-TRUCKLOAD MOTOR CARRIERS AD A SOLUTIO APPROACH Professor aoto Katayama* an Professor Shgeru Yurmoto* * Faculty of Dstrbuton an Logstcs

More information

Trust Network and Trust Community Clustering based on Shortest Path Analysis for E-commerce

Trust Network and Trust Community Clustering based on Shortest Path Analysis for E-commerce Internatonal Journal of u- an e- Serce, Scence an Technology Trust Network an Trust Communty Clusterng base on Shortest Path Analyss for E-commerce Shaozhong Zhang 1, Jungan Chen 1, Haong Zhong 2, Zhaox

More information

Minimal Coding Network With Combinatorial Structure For Instantaneous Recovery From Edge Failures

Minimal Coding Network With Combinatorial Structure For Instantaneous Recovery From Edge Failures Mnmal Codng Network Wth Combnatoral Structure For Instantaneous Recovery From Edge Falures Ashly Joseph 1, Mr.M.Sadsh Sendl 2, Dr.S.Karthk 3 1 Fnal Year ME CSE Student Department of Computer Scence Engneerng

More information

DEFINING %COMPLETE IN MICROSOFT PROJECT

DEFINING %COMPLETE IN MICROSOFT PROJECT CelersSystems DEFINING %COMPLETE IN MICROSOFT PROJECT PREPARED BY James E Aksel, PMP, PMI-SP, MVP For Addtonal Informaton about Earned Value Management Systems and reportng, please contact: CelersSystems,

More information

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic Lagrange Multplers as Quanttatve Indcators n Economcs Ivan Mezník Insttute of Informatcs, Faculty of Busness and Management, Brno Unversty of TechnologCzech Republc Abstract The quanttatve role of Lagrange

More information

A DATA MINING APPLICATION IN A STUDENT DATABASE

A DATA MINING APPLICATION IN A STUDENT DATABASE JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (53-57) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng Büyükbakkalköy-Istanbul

More information

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background: SPEE Recommended Evaluaton Practce #6 efnton of eclne Curve Parameters Background: The producton hstores of ol and gas wells can be analyzed to estmate reserves and future ol and gas producton rates and

More information

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT Toshhko Oda (1), Kochro Iwaoka (2) (1), (2) Infrastructure Systems Busness Unt, Panasonc System Networks Co., Ltd. Saedo-cho

More information

Distributed Strategic Learning with Application to Network Security

Distributed Strategic Learning with Application to Network Security Amercan Control Conference on O'Farrell Street San Francsco CA USA June 9 - July Dstrbute Strategc Learnng wth Applcaton to Network Securty Quanyan Zhu Hamou Tembne an Tamer Başar Abstract We conser n

More information

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña Proceedngs of the 2008 Wnter Smulaton Conference S. J. Mason, R. R. Hll, L. Mönch, O. Rose, T. Jefferson, J. W. Fowler eds. A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION

More information

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION Vson Mouse Saurabh Sarkar a* a Unversty of Cncnnat, Cncnnat, USA ABSTRACT The report dscusses a vson based approach towards trackng of eyes and fngers. The report descrbes the process of locatng the possble

More information

A Data Mining-Based OLAP Aggregation of. Complex Data: Application on XML Documents

A Data Mining-Based OLAP Aggregation of. Complex Data: Application on XML Documents 1 Runnng head: A DATA MINING-BASED OLAP AGGREGATION A Data Mnng-Based OLAP Aggregaton of Complex Data: Applcaton on XML Documents Radh Ben Messaoud, Omar Boussad, Sabne Loudcher Rabaséda {rbenmessaoud

More information

Mining Multiple Large Data Sources

Mining Multiple Large Data Sources The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 24 Mnng Multple Large Data Sources Anmesh Adhkar, Pralhad Ramachandrarao 2, Bhanu Prasad 3, and Jhml Adhkar 4 Department of

More information

Calculation of Sampling Weights

Calculation of Sampling Weights Perre Foy Statstcs Canada 4 Calculaton of Samplng Weghts 4.1 OVERVIEW The basc sample desgn used n TIMSS Populatons 1 and 2 was a two-stage stratfed cluster desgn. 1 The frst stage conssted of a sample

More information

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

On the Optimal Control of a Cascade of Hydro-Electric Power Stations On the Optmal Control of a Cascade of Hydro-Electrc Power Statons M.C.M. Guedes a, A.F. Rbero a, G.V. Smrnov b and S. Vlela c a Department of Mathematcs, School of Scences, Unversty of Porto, Portugal;

More information

Simple Interest Loans (Section 5.1) :

Simple Interest Loans (Section 5.1) : Chapter 5 Fnance The frst part of ths revew wll explan the dfferent nterest and nvestment equatons you learned n secton 5.1 through 5.4 of your textbook and go through several examples. The second part

More information

Formulating & Solving Integer Problems Chapter 11 289

Formulating & Solving Integer Problems Chapter 11 289 Formulatng & Solvng Integer Problems Chapter 11 289 The Optonal Stop TSP If we drop the requrement that every stop must be vsted, we then get the optonal stop TSP. Ths mght correspond to a ob sequencng

More information

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems Jont Schedulng of Processng and Shuffle Phases n MapReduce Systems Fangfe Chen, Mural Kodalam, T. V. Lakshman Department of Computer Scence and Engneerng, The Penn State Unversty Bell Laboratores, Alcatel-Lucent

More information

Compiling for Parallelism & Locality. Dependence Testing in General. Algorithms for Solving the Dependence Problem. Dependence Testing

Compiling for Parallelism & Locality. Dependence Testing in General. Algorithms for Solving the Dependence Problem. Dependence Testing Complng for Parallelsm & Localty Dependence Testng n General Assgnments Deadlne for proect 4 extended to Dec 1 Last tme Data dependences and loops Today Fnsh data dependence analyss for loops General code

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada [email protected] Abstract Ths s a note to explan support vector machnes.

More information

Implementation of Deutsch's Algorithm Using Mathcad

Implementation of Deutsch's Algorithm Using Mathcad Implementaton of Deutsch's Algorthm Usng Mathcad Frank Roux The followng s a Mathcad mplementaton of Davd Deutsch's quantum computer prototype as presented on pages - n "Machnes, Logc and Quantum Physcs"

More information

Sketching Sampled Data Streams

Sketching Sampled Data Streams Sketchng Sampled Data Streams Florn Rusu, Aln Dobra CISE Department Unversty of Florda Ganesvlle, FL, USA [email protected] [email protected] Abstract Samplng s used as a unversal method to reduce the

More information

SURVEYING THE INFLUENCE OF CUSTOMER RELATIONSHIP MANAGEMENT ON ORGANIZATIONAL PRODUCTIVITY

SURVEYING THE INFLUENCE OF CUSTOMER RELATIONSHIP MANAGEMENT ON ORGANIZATIONAL PRODUCTIVITY ALBERTIANA ISSN 0169-4324 2015; VOL. 82; SPECIAL ISSUE; PP. 342-347 SURVEYING THE INFLUENCE OF CUSTOMER RELATIONSHIP MANAGEMENT ON ORGANIZATIONAL PRODUCTIVITY HOOSHANG HASSANZADEH* *CORRESPONDING AUTHOR

More information

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

A Dynamic Load Balancing for Massive Multiplayer Online Game Server A Dynamc Load Balancng for Massve Multplayer Onlne Game Server Jungyoul Lm, Jaeyong Chung, Jnryong Km and Kwanghyun Shm Dgtal Content Research Dvson Electroncs and Telecommuncatons Research Insttute Daejeon,

More information

Overview of monitoring and evaluation

Overview of monitoring and evaluation 540 Toolkt to Combat Traffckng n Persons Tool 10.1 Overvew of montorng and evaluaton Overvew Ths tool brefly descrbes both montorng and evaluaton, and the dstncton between the two. What s montorng? Montorng

More information

Title Language Model for Information Retrieval

Title Language Model for Information Retrieval Ttle Language Model for Informaton Retreval Rong Jn Language Technologes Insttute School of Computer Scence Carnege Mellon Unversty Alex G. Hauptmann Computer Scence Department School of Computer Scence

More information

Loop Parallelization

Loop Parallelization - - Loop Parallelzaton C-52 Complaton steps: nested loops operatng on arrays, sequentell executon of teraton space DECLARE B[..,..+] FOR I :=.. FOR J :=.. I B[I,J] := B[I-,J]+B[I-,J-] ED FOR ED FOR analyze

More information

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson

More information

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING Matthew J. Lberatore, Department of Management and Operatons, Vllanova Unversty, Vllanova, PA 19085, 610-519-4390,

More information

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Can Auto Liability Insurance Purchases Signal Risk Attitude? Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159-164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? Chu-Shu L Department of Internatonal Busness, Asa Unversty, Tawan Sheng-Chang

More information

Enabling P2P One-view Multi-party Video Conferencing

Enabling P2P One-view Multi-party Video Conferencing Enablng P2P One-vew Mult-party Vdeo Conferencng Yongxang Zhao, Yong Lu, Changja Chen, and JanYn Zhang Abstract Mult-Party Vdeo Conferencng (MPVC) facltates realtme group nteracton between users. Whle P2P

More information

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College Feature selecton for ntruson detecton Slobodan Petrovć NISlab, Gjøvk Unversty College Contents The feature selecton problem Intruson detecton Traffc features relevant for IDS The CFS measure The mrmr measure

More information

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts Power-of-wo Polces for Sngle- Warehouse Mult-Retaler Inventory Systems wth Order Frequency Dscounts José A. Ventura Pennsylvana State Unversty (USA) Yale. Herer echnon Israel Insttute of echnology (Israel)

More information

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign PAS: A Packet Accountng System to Lmt the Effects of DoS & DDoS Debsh Fesehaye & Klara Naherstedt Unversty of Illnos-Urbana Champagn DoS and DDoS DDoS attacks are ncreasng threats to our dgtal world. Exstng

More information

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES Zuzanna BRO EK-MUCHA, Grzegorz ZADORA, 2 Insttute of Forensc Research, Cracow, Poland 2 Faculty of Chemstry, Jagellonan

More information

Improved SVM in Cloud Computing Information Mining

Improved SVM in Cloud Computing Information Mining Internatonal Journal of Grd Dstrbuton Computng Vol.8, No.1 (015), pp.33-40 http://dx.do.org/10.1457/jgdc.015.8.1.04 Improved n Cloud Computng Informaton Mnng Lvshuhong (ZhengDe polytechnc college JangSu

More information

Multi-Resource Fair Allocation in Heterogeneous Cloud Computing Systems

Multi-Resource Fair Allocation in Heterogeneous Cloud Computing Systems 1 Mult-Resource Far Allocaton n Heterogeneous Cloud Computng Systems We Wang, Student Member, IEEE, Ben Lang, Senor Member, IEEE, Baochun L, Senor Member, IEEE Abstract We study the mult-resource allocaton

More information

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and POLYSA: A Polynomal Algorthm for Non-bnary Constrant Satsfacton Problems wth and Mguel A. Saldo, Federco Barber Dpto. Sstemas Informátcos y Computacón Unversdad Poltécnca de Valenca, Camno de Vera s/n

More information

L10: Linear discriminants analysis

L10: Linear discriminants analysis L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss

More information

320 The Internatonal Arab Journal of Informaton Technology, Vol. 5, No. 3, July 2008 Comparsons Between Data Clusterng Algorthms Osama Abu Abbas Computer Scence Department, Yarmouk Unversty, Jordan Abstract:

More information

Single and multiple stage classifiers implementing logistic discrimination

Single and multiple stage classifiers implementing logistic discrimination Sngle and multple stage classfers mplementng logstc dscrmnaton Hélo Radke Bttencourt 1 Dens Alter de Olvera Moraes 2 Vctor Haertel 2 1 Pontfíca Unversdade Católca do Ro Grande do Sul - PUCRS Av. Ipranga,

More information

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP) 6.3 / -- Communcaton Networks II (Görg) SS20 -- www.comnets.un-bremen.de Communcaton Networks II Contents. Fundamentals of probablty theory 2. Emergence of communcaton traffc 3. Stochastc & Markovan Processes

More information

Financial Mathemetics

Financial Mathemetics Fnancal Mathemetcs 15 Mathematcs Grade 12 Teacher Gude Fnancal Maths Seres Overvew In ths seres we am to show how Mathematcs can be used to support personal fnancal decsons. In ths seres we jon Tebogo,

More information

Lecture 2: Single Layer Perceptrons Kevin Swingler

Lecture 2: Single Layer Perceptrons Kevin Swingler Lecture 2: Sngle Layer Perceptrons Kevn Sngler [email protected] Recap: McCulloch-Ptts Neuron Ths vastly smplfed model of real neurons s also knon as a Threshold Logc Unt: W 2 A Y 3 n W n. A set of synapses

More information

A Secure Password-Authenticated Key Agreement Using Smart Cards

A Secure Password-Authenticated Key Agreement Using Smart Cards A Secure Password-Authentcated Key Agreement Usng Smart Cards Ka Chan 1, Wen-Chung Kuo 2 and Jn-Chou Cheng 3 1 Department of Computer and Informaton Scence, R.O.C. Mltary Academy, Kaohsung 83059, Tawan,

More information

Traffic State Estimation in the Traffic Management Center of Berlin

Traffic State Estimation in the Traffic Management Center of Berlin Traffc State Estmaton n the Traffc Management Center of Berln Authors: Peter Vortsch, PTV AG, Stumpfstrasse, D-763 Karlsruhe, Germany phone ++49/72/965/35, emal [email protected] Peter Möhl, PTV AG,

More information

ERP Software Selection Using The Rough Set And TPOSIS Methods

ERP Software Selection Using The Rough Set And TPOSIS Methods ERP Software Selecton Usng The Rough Set And TPOSIS Methods Under Fuzzy Envronment Informaton Management Department, Hunan Unversty of Fnance and Economcs, No. 139, Fengln 2nd Road, Changsha, 410205, Chna

More information