The Priority RTree: A Practically Efficient and WorstCase Optimal RTree


 Gladys Cummings
 1 years ago
 Views:
Transcription
1 The Priority RTree: A Practically Efficient and WorstCase Otimal RTree Lars Arge Deartment of Comuter Science Duke University, ox Durham, NC USA Mark de erg Deartment of Comuter Science TU Eindhoven, P.O.ox M Eindhoven The Netherlands Herman J. Haverkort Institute of Information and Comuting Sciences Utrecht University, PO ox T Utrecht, The Netherlands Ke Yi Deartment of Comuter Science Duke University, ox Durham, NC , USA ASTRACT We resent the Priority Rtree, or PRtree, which is the first Rtree variant that always answers a window query using O((N/) 1 1/d + T/) I/Os, where N is the number of d dimensional (hyer) rectangles stored in the Rtree, is the disk block size, and T is the outut size. This is rovably asymtotically otimal and significantly better than other R tree variants, where a query may visit all N/ leaves in the tree even when T = 0. We also resent an extensive exerimental study of the ractical erformance of the PRtree using both reallife and synthetic data. This study shows that the PRtree erforms similar to the best known Rtree variants on reallife and relatively nicely distributed data, but outerforms them significantly on more extreme data. 1. INTRODUCTION Satial data naturally arise in numerous alications, including geograhical information systems, comuteraided design, comuter vision and robotics. Therefore satial database systems designed to store, manage, and maniulate satial data have received considerable attention over the years. Since these databases often involve massive datasets, disk based index structures for satial data have been researched extensively see e.g. the survey by Gaede and Gün Suorted in art by the National Science Foundation through RI grant EIA , CAREER grant CCR , ITR grant EIA , and U.S. Germany Cooerative Research Program grant INT Suorted by the Netherlands Organization for Scientific Research (NWO). Permission to make digital or hard coies of all or art of this work for ersonal or classroom use is granted without fee rovided that coies are not made or distributed for rofit or commercial advantage, and that coies bear this notice and the full citation on the first age. To coy otherwise, to reublish, to ost on servers or to redistribute to lists, requires rior secific ermission and/or a fee. SIGMOD 2004 June 1318, 2004, Paris, France. Coyright 2004 ACM /04/06... $5.00. ther [11]. Esecially the Rtree [13] and its numerous variants (see e.g. the recent survey by Manolooulos et al. [19]) have emerged as ractically efficient indexing methods. In this aer we resent the Priority Rtree, or PRtree, which is the first Rtree variant that is not only ractically efficient but also rovably asymtotically otimal. 1.1 ackground and revious results Since objects stored in a satial database can be rather comlex they are often aroximated by simler objects, and satial indexes are then built on these aroximations. The most commonly used aroximation is the minimal bounding box: the smallest axisarallel (hyer) rectangle that contains the object. The Rtree, originally roosed by Guttman [13], is an index for such rectangles. It is a heightbalanced multiway tree similar to a tree [5, 9], where each node (excet for the root) has degree Θ(). Each leaf contains Θ() data rectangles (each ossibly with a ointer to the original data) and all leaves are on the same level of the tree; each internal node v contains ointers to its Θ() children, as well as for each child a minimal bounding box covering all rectangles in the leaves of the subtree rooted in that child. If is the number of rectangles that fits in a disk block, an Rtree on N rectangles occuies Θ(N/) disk blocks and has height Θ(log N). Many tyes of queries can be answered efficiently using an Rtree, including the common query called a window query: Given a query rectangle Q, retrieve all rectangles that intersect Q. To answer such a query we simly start at the root of the Rtree and recursively visit all nodes with minimal bounding boxes intersecting Q; when encountering a leaf l we reort all data rectangles in l intersecting Q. Guttman gave several algorithms for udating an Rtree in O(log N) I/Os using treelike algorithms [13]. Since there is no unique Rtree for a given dataset, and because the window query erformance intuitively deends on the amount of overla between minimal bounding boxes in the nodes of the tree, it is natural to try to minimize bounding box overla during udates. This has led to the develoment of many heuristic udate algorithms; see for examle [6, 16,
2 23] or refer to the surveys in [11, 19]. Several secialized algorithms for bulkloading an Rtree have also been develoed [7, 10, 12, 15, 18, 22]. Most of these algorithms use N O( N log M/ ) I/Os (the number of I/Os needed to sort N elements), where M is the number of rectangles that fits in main memory, which is much less than the O(N log N) I/Os needed to build the index by reeated insertion. Furthermore, they tyically roduce Rtrees with better sace utilization and query erformance than Rtrees built using reeated insertion. For examle, while exerimental results have shown that the average sace utilization of dynamically maintained Rtrees is between 50% and 70% [6], most bulkloading algorithms are caable of obtaining over 95% sace utilization. After bulkloading an Rtree it can of course be udated using the standard Rtree udating algorithms. However, in that case its query efficiency and sace utilization may degenerate over time. One common class of Rtree bulkloading algorithms work by sorting the rectangles according to some global onedimensional criterion, lacing them in the leaves in that order, and then building the rest of the index bottomu levelbylevel [10, 15, 18]. In two dimensions, the socalled acked Hilbert Rtree of Kamel and Faloutsos [15], which sorts the rectangles according to the Hilbert values of their centers, has been shown to be esecially queryefficient in ractice. The Hilbert value of a oint is the length of the fractal Hilbert sacefilling curve from the origin to. The Hilbert curve is very good at clustering satially close rectangles together, leading to a good index. A variant of the acked Hilbert Rtree, which also takes the extent of the rectangles into account (rather than just the center), is the fourdimensional Hilbert Rtree [15]; in this structure each rectangle ((x min, y min), (x max, y max)) is first maed to the fourdimensional oint (x min, y min, x max, y max) and then the rectangles are sorted by the ositions of these oints on the fourdimensional Hilbert curve. Exerimentally the fourdimensional Hilbert Rtree has been shown to behave slightly worse than the acked Hilbert Rtree for nicely distributed realistic data [15]. However, intuitively, it is less vulnerable to more extreme datasets because it also takes the extent of the inut rectangles into account. Algorithms that bulkload Rtrees in a todown manner have also been develoed. These algorithms work by recursively trying to find a good artition of the data [7, 12]. The socalled Todown Greedy Slit (TGS) algorithm of García, Lóez and Leutenegger [12] has been shown to result in esecially queryefficient Rtrees (TGS Rtrees). To build the root of (a subtree of) an Rtree on a given set of rectangles, this algorithm reeatedly artitions the rectangles into two sets, until they are divided into subsets of (aroximately) equal size. Each subset s bounding box is stored in the root, and subtrees are constructed recursively on each of the subsets. Each of the binary artitions takes a set of rectangles and slits it into two subsets based on one of several onedimensional orderings; in two dimensions, the orderings considered are those by x min, y min, x max and y max. For each such ordering, the algorithm calculates, for each of O() ossible artitioning ossibilities, the sum of the areas of the bounding boxes of the two subsets that would result from the artition. Then it alies the binary artition that minimizes that sum. 1 1 García et al. describe several variants of the todown greedy method. They found the one described here to be While the TGS Rtree has been shown to have slightly better query erformance than other Rtree variants, the construction algorithm uses many more I/Os since it needs to scan all the rectangles in order to make a binary artition. In fact, in the worst case the algorithm may take O(N log N) I/Os. However, in ractice, the fact that each artition decision is binary effectively means that the algorithm uses O( N log 2 N) I/Os. While much work has been done on evaluating the ractical query erformance of the Rtree variants mentioned above, very little is known about their theoretical worstcase erformance. Most theoretical work on Rtrees is concerned with estimating the exected cost of queries under assumtions such as uniform distribution of the inut and/or the queries, or assuming that the inut are oints rather than rectangles. See the recent survey by Manolooulos et al. [19]. The first bulkloading algorithm with a nontrivial guarantee on the resulting worstcase query erformance was given only recently by Agarwal et al. [2]. In d dimensions their algorithm constructs an Rtree that answers a window query in O((N/) 1 1/d + T log N) I/Os, where T is the number of reorted rectangles. However, this still leaves a ga to the Ω((N/) 1 1/d + T/) lower bound on the number of I/Os needed to answer a window query [2, 17]. If the inut consists of oints rather than rectangles, then worstcase otimal query erformance can be achieved with e.g. a kdtree [21] or an Otree [17]. Unfortunately, it seems hard to modify these structures to work for rectangles. Finally, Agarwal et al. [2], as well as Haverkort et al. [14], also develoed a number of Rtrees that have good worstcase query erformance under certain conditions on the inut. 1.2 Our results In Section 2 we resent a new Rtree variant, which we call a Priority Rtree or PRtree for short. We call our structure the Priority Rtree because our bulkloading algorithm utilizes socalled riority rectangles in a way similar to the recent structure by Agarwal et al. [2]. Window queries can be answered in O((N/) 1 1/d + T/) I/Os on a PRtree, and the index is thus the first Rtree variant that answers queries with an asymtotically otimal number of I/Os in the worst case. To contrast this to revious Rtree bulkloading algorithms, we also construct a set of rectangles and a query with zero outut, such that all Θ(N/) leaves of a acked Hilbert Rtree, a fourdimensional Hilbert Rtree, or a TGS Rtree need to be visited to answer the query. We also show how to bulkload the PRtree efficiently, using only O( N log M/ N ) I/Os. After bulkloading, a PRtree can be udated in O(log N) I/Os using the standard R tree udating algorithms, but without maintaining its query efficiency. Alternatively, the external logarithmic method [4, 20] can be used to develo a structure that suorts inser N tions and deletions in O(log + 1 (log N M M/ )(log N 2 )) M N and O(log ) I/Os amortized, resectively, while maintaining the otimal query erformance. M In Section 3 we resent an extensive exerimental study of the ractical erformance of the PRtree using both reallife and synthetic data. We comare the erformance of our the most efficient in ractice [12]. In order to achieve close to 100% sace utilization, the size of the subsets created is actually rounded u to the nearest ower of (excet for one remainder set). As a result, one node on each level, including the root, may have less than children.
3 index on twodimensional rectangles to the acked Hilbert Rtree, the fourdimensional Hilbert Rtree, and the TGS Rtree. Overall, our exeriments show that all these R trees answer queries in more or less the same number of I/Os on relatively square and uniformly distributed rectangles. However, on more extreme data large rectangles, rectangles with high asect ratios, or nonuniformly distributed rectangles the PRtree (and sometimes also the fourdimensional Hilbert Rtree) outerforms the others significantly. On a secial worstcase dataset the PRtree outerforms all of them by well over an order of magnitude. 2. THE PRIORITY RTREE In this section we describe the PRtree. For simlicity, we first describe a twodimensional seudoprtree in Section 2.1. The seudoprtree answers window queries efficiently but is not a real Rtree, since it does not have all leaves on the same level. In Section 2.2 we show how to obtain a real twodimensional PRtree from the seudoprtree, and in Section 2.3 we discuss how to extend the PRtree to d dimensions. Finally, in Section 2.4 we show that a query on the acked Hilbert Rtree, the fourdimensional Hilbert Rtree, as well as the TGS Rtree can be forced to visit all leaves even if T = Twodimensional seudoprtrees In this section we describe the twodimensional seudo PRtree. Like an Rtree, a seudoprtree has the inut rectangles in the leaves and each internal node ν contains a minimal bounding box for each of its children ν c. However, unlike an Rtree, not all the leaves are on the same level of the tree and internal nodes only have degree six (rather than Θ()). The basic idea in the seudoprtree is (similar to the fourdimensional Hilbert Rtree) to view an inut rectangle ((x min, y min), (x max, y max)) as a fourdimensional oint (x min, y min, x max, y max). The seudoprtree is then basically just a kdtree on the N oints corresonding to the N inut rectangles, excet that four extra leaves are added below each internal node. Intuitively, these socalled riority leaves contain the extreme oints (rectangles) in each of the four dimensions. Note that the fourdimensional kdtree can easily be maed back to an Rtreelike structure, simly by relacing the slit value in each kdtree node ν with the minimal bounding box of the inut rectangles stored in the subtree rooted in ν. The idea of using riority leaves was introduced in a recent structure by Agarwal et al. [2], they used riority leaves of size one rather than. In section below we give a recise definition of the seudoprtree, and in section Section we show that it can be used to answer a window query in O( N/ + T/) I/Os. In Section we describe how to construct the structure I/Oefficiently The Structure Let S = {R 1,..., R N} be a set of N rectangles in the lane and assume for simlicity that no two of the coordinates defining the rectangles are equal. We define Ri = (x min(r i), y min(r i), x max(r i), y max(r i)) to be the maing of R i = ((x min(r i), y min(r i)), (x max(r i), y max(r i))) to a oint in four dimensions, and define S to be the N oints corresonding to S. A seudoprtree T S on S is defined recursively: If S ν ν xmin ν ymax ν x min ν ν ymin slit on x min S < S > ν xmax ν ymax ν xmax T S< T S> ν y min Figure 1: The construction of an internal node in a seudoprtree. contains less than rectangles, T S consists of a single leaf; otherwise, T S consists of a node ν with six children, namely four riority leaves and two recursive seudoprtrees. For each child ν c, we let ν store the minimal bounding box of all inut rectangles stored in the subtree rooted in ν c. The node ν and the riority leaves below it are constructed as follows: The first riority leaf ν x min contains the rectangles in S with minimal x mincoordinates, the second ν y min the rectangles among the remaining rectangles with minimal y mincoordinates, the third ν xmax the rectangles among the remaining rectangles with maximal x maxcoordinates, and finally the fourth ν ymax the rectangles among the remaining rectangles with maximal y maxcoordinates. Thus the riority leaves contain the extreme rectangles in S, namely the ones with leftmost left edges, bottommost bottom edges, rightmost right edges, and tomost to edges. 2 After constructing the riority leaves, we divide the set S r of 2 S may not contain enough rectangles to ut rectangles in each of the four riority leaves. In that case, we may assume that we can still ut at least /4 in each of them, since otherwise we could just construct a single leaf.
4 remaining rectangles (if any) into two subsets, S < and S >, of aroximately the same size and recursively construct seudoprtrees T S< and T S>. The division is erformed using the x min, y min, x max, or y maxcoordinate in a roundrobin fashion, as if we were building a fourdimensional kdtree on Sr, that is, when constructing the root of T S we divide based on the x minvalues, the next level of recursion based on the y minvalues, then based on the x maxvalues, on the y maxvalues, on the x minvalues, and so on. Refer to Figure 1 for an examle. Note that dividing according to, say, x min corresonds to dividing based on a vertical line l such that half of the rectangles in S r have their left edge to the left of l and half of them have their left edge to the right of l. We store each node or leaf of T S in O(1) disk blocks, and since at least four out of every six leaves contain Θ() rectangles we obtain the following (in Section we discuss how to guarantee that almost every leaf is full). Lemma 1. A seudoprtree on a set of N rectangles in the lane occuies O(N/) disk blocks Query comlexity We answer a window query Q on a seudoprtree exactly as on an Rtree by recursively visiting all nodes with minimal bounding boxes intersecting Q. However, unlike for known Rtree variants, for the seudoprtree we can rove a nontrivial (in fact, otimal) bound on the number of I/Os erformed by this rocedure. Lemma 2. A window query on a seudoprtree on N rectangles in the lane uses O( N/ + T/) I/Os in the worst case. Proof. Let T S be a seudoprtree on a set S of N rectangles in the lane. To rove the query bound, we bound the number of nodes in T S that are kdnodes, i.e. not riority leaves, and are visited in order to answer a query with a rectangular range Q; the total number of leaves visited is at most a factor of four larger. We first note that O(T/) is a bound on the number of nodes ν visited where all rectangles in at least one of the riority leaves below ν s arent are reorted. Thus we just need to bound the number of visited kdnodes where this is not the case. Let µ be the arent of a node ν such that none of the riority leaves of µ are reorted comletely, that is, each riority leaf µ of µ contains at least one rectangle not intersecting Q. Each such rectangle E can be searated from Q by a line containing one of the sides of Q refer to Figure 2. Assume without loss of generality that this is the vertical line x = x min(q) through the left edge of Q, that is, E s right edge lies to the left of Q s left edge, so that x max(e) x min(q). This means that the oint E in fourdimensional sace corresonding to E lies to the left of the axisarallel hyerlane H that intersects the x maxaxis at x min(q). Now recall that T S is basically a fourdimensional kdtree on S (with riority leaves added), and thus that a fourdimensional region Rµ 4 can be associated with µ. Since the query Q visits µ, there must also be at least one rectangle F in the subtree rooted at µ that has x max(f ) > x min(q), so that F lies to the right of H. It follows that Rµ 4 contains oints on both sides of H and therefore H must intersect Rµ. 4 Now observe that the rectangles in the riority leaf µ xmax cannot be searated from Q by the line x = x min(q) through µ y min R 4 µ ν E x = x min (Q) E X H : xmax = xmin(q) F µ xmax G H : y min = y max (Q) F G Q Q y = x max (Q) y = x max (Q) x max Figure 2: The roof of Lemma 2, with µ in the lane (uer figure), and µ in fourdimensional sace (lower figure the x min and y max dimensions are not shown). Note that X = H H is a twodimensional hyerlane in fourdimensional sace. It contains a twodimensional facet of the transformation of the query range into four dimensions. the left edge of Q: Rectangles in µ xmax are extreme in the ositive xdirection, so if one of them lies comletely to the left of Q, then all rectangles in µ s children including ν would lie to the left of Q; in that case ν would not be visited. Since (by definition of ν) not all rectangles in µ xmax intersect Q, there must be a line through one of Q s other sides, say the horizontal line y = y max(q), that searates Q from a rectangle G in µ xmax. Hence, the hyerlane H that cuts the y minaxis at y max(q) also intersects Rµ. 4 y the above arguments, at least two of the threedimensional hyerlanes defined by x min(q), x max(q), y min(q) and y max(q) intersect the region Rµ 4 associated with µ when viewing T S as a fourdimensional kdtree. Hence, the intersection X of these two hyerlanes, which is a twodimensional lane in fourdimensional sace, also intersects Rµ. 4 With the riority leaves removed, T S becomes a fourdimensional kdtree with O(N/) leaves; from a straightforward
5 generalization of the standard analysis of kdtrees we know that any axisarallel twodimensional lane intersects at most O( N/) of the regions associated with the nodes in such a tree [2]. All that remains is to observe that Q defines O(1) such lanes, namely one for each air of sides. Thus O( N/) is a bound on the number of nodes ν that are not riority leaves and are visited by the query rocedure, where not all rectangles in any of the riority leaves below ν s arent are reorted Efficient construction algorithm Note that it is easy to bulkload a seudoprtree T S on a set S of N rectangles in O( N log N) I/Os by simly constructing one node at a time following the definition in Sec tion We will now describe how, under the reasonable assumtion that the amount M of available main memory is Ω( 4/3 ), we can bulkload T S using O( N log M/ N ) I/Os. Our algorithm is a modified version of the kdtree construction algorithm described in [1, 20]; it is easiest described as constructing a fourdimensional kdtree T S on the oints S. In the construction algorithm we first construct, in a rerocessing ste, four sorted lists L xmin, L ymin, L xmax, L ymax containing the oints in S sorted by their x min, y min, x max, and y maxcoordinate, resectively. Then we construct Θ(log M) levels of the tree, and recursively construct the rest of the tree. To construct Θ(log M) levels of T S efficiently we roceed as follows. We first choose a arameter z (which will be exlained below) and use the four sorted lists to find the (kn/z)th coordinate of the oints S in each dimension, for all k {1, 2,..., z 1}. These coordinates define a fourdimensional grid of size z 4 ; we then scan S and count the number of oints in each grid cell. We choose z to be Θ(M 1/4 ), so that we can kee these counts in main memory. Next we build the Θ(log M) levels of T S without worrying about the riority leaves: To construct the root ν of T S, we first find the slice of z 3 grid cells with common x mincoordinate such that there is a hyerlane orthogonal to the x minaxis that asses through these cells and has at most half of the oints in S on one side and at most half of the oints on the other side. y scanning the O(N/(z)) blocks from L xmin that contain the O(N/z) oints in these grid cells, we can determine the exact x minvalue x to use in ν such that the hyerlane H, defined by x min = x, divides the oints in S into two subsets with at most half of the oints each. After constructing ν, we subdivide the z 3 grid cells intersected by H, that is, we divide each of the z 3 cells in two at x and comute their counts by rescanning the O(N/(z)) blocks from L xmin that contain the O(N/z) oints in these grid cells. Then we construct a kdtree on each side of the hyerlane defined by x recursively (cycling through all four ossible cutting directions). Since we create O(z 3 ) new cells every time we create a node, we can ensure that the grid still fits in main memory after constructing z nodes, that is, log z = Θ(log M) levels of T S. After constructing the Θ(log M) kdtree levels, we construct the four riority leaves for each of the z nodes. To do so we reserve main memory sace for the oints in each of the riority leaves; we have enough main memory to hold all riority leaves, since by the assumtion that M is Ω( 4/3 ) we have 4 Θ() Θ(z) = O(M). Then we fill the riority leaves by scanning S and filtering each oint Ri through the kdtree, one by one, as follows: We start at the root of ν of T S, and check its riority leaves ν x min, ν y min, ν xmax, and one by one in that order. If we encounter a nonfull ν ymax leaf we simly lace Ri there; if we encounter a full leaf ν and Ri is more extreme in the relevant direction than the least extreme oint Rj in ν, we relace Rj with Ri and continue the filtering rocess with Rj. After checking ν ymax we continue to check the riority leaves of the child of ν in T S whose region contains the oint we are rocessing; if ν does not have such a child (because we arrived at leaf level in the kdtree) we simly continue with the next oint in S. It is easy to see that the above rocess correctly constructs the to Θ(log M) levels of the seudoprtree T S on S, excet that the kdtree divisions are slightly different than the ones defined in Section 2.1.1, since the oints in the riority leaves are not removed before the divisions are comuted. However, the bound of Lemma 2 still holds: The O(T/) term does not deend on the choice of the divisions, and the kdtree analysis that brought the O( N/) term only deends on the fact that each child gets at most half of the oints of its arent. After constructing the Θ(log M) levels and their riority leaves, we scan through the four sorted lists L xmin, L ymin, L xmax, L ymax and divide them into four sorted lists for each of the Θ(z) leaves of the constructed kdtree, while omitting the oints already stored in riority leaves. These lists contain O(N/z) oints each; after writing the constructed kdtree and riority leaves to disk we use them to construct the rest of T S recursively. Note that once the number of oints in a recursive call gets smaller than M, we can simly construct the rest of the tree in internal memory one node at a time. This way we can make slightly unbalanced divisions, so that we have a multile of oints on one side of each dividing hyerlane. Thus we can guarantee that we get at most one nonfull leaf er subtree of size Θ(M), and obtain almost 100% sace utilization. To avoid having an underfull leaf that may violate assumtions made by udate algorithms, we may make the riority leaves under its arent slightly smaller so that all leaves contain Θ() rectangles. This also imlies that the bound of Lemma 1 still holds. Lemma 3. A seudoprtree can be bulkloaded with N rectangles in the lane in O( N log M/ N ) I/Os. Proof. The initial construction of the sorted lists takes O( N log M/ N ) I/Os. To construct Θ(log M) levels of TS we use O(N/) I/Os to construct the initial grid, as well as O(N/(z)) to construct each of the z nodes for a total of O(N/) I/Os. Constructing the riority leaves by filtering also takes O(N/) I/Os, and so does the distribution of the remaining oints in S to the recursive calls. Thus each recursive ste takes O(N/) I/Os in total. The lemma follows since there are O(log N / log M) = O(log M N ) levels of recursion. 2.2 Twodimensional PRtree In this section we describe how to obtain a PRtree (with degree Θ() and all leaves on the same level) from a seudo PRtree (with degree six and leaves on all levels), while maintaining the O( N/+T/) I/O window query bound. The PRtree is built in stages bottomu: In stage 0 we construct the leaves V 0 of the tree from the set S 0 = S of N inut rectangles; in stage i 1 we construct the nodes V i on level i of the tree from a set S i of O(N/ i ) rectangles,
6 consisting of the minimal bounding boxes of all nodes in V i 1 (on level i 1). Stage i consists of constructing a seudoprtree T Si on S i; V i then simly consists of the (riority as well as normal) leaves of T Si ; the internal nodes are discarded. 3 The bottomu construction ends when the set S i is small enough so that the rectangles in S i and the ointers to the corresonding subtrees fit into one block, which is then the root of the PRtree. Theorem 1. A PRtree on a set S of N rectangles in the lane can be bulkloaded in O( N log M/ N ) I/Os, such that a window query can be answered in O( N/ + T/) I/Os. Proof. y Lemma 3, stage i of the PRtree bulkloading algorithm uses O(( S i /) log M/ ( S i /)) I/Os, which is O((N/ i+1 N ) log M/ ). Thus the comlete PRtree is constructed in O(log O N) N = log N M/ I/Os. i=0 O N log N i+1 M/ To analyze the number of I/Os used to answer a window query Q, we will analyze the number of nodes visited on each level of the tree. Let T i (i 0) be the number of nodes visited on level i. Since the nodes on level 0 (the leaves) corresond to the leaves of a seudoprtree on the N inut rectangles S, it follows from Lemma 2 that T 0 = O( N/+ T/); in articular, for big enough N and, there exists a constant c such that T 0 c N/ + c(t/). There must be T i 1 rectangles in nodes of level i 1 of the PRtree that intersect Q, since these nodes contain the bounding boxes of nodes on level i 1. Since nodes on level i corresond to the leaves of a seudoprtree on the N/ i rectangles in S i, it follows from Lemma 2 that for big enough N and, we have T i (c/ i ) N/ + c(t i 1/). Summing over all O(log N) levels and solving the recurrence reveals that O( N/ + T/) nodes are visited in total. 2.3 Multidimensional PRtree In this section we briefly sketch how our PRtree generalizes to dimensions greater than two. We focus on how to generalize seudoprtrees, since a ddimensional PRtree can be obtained using ddimensional seudoprtrees in exactly the same way as in the twodimensional case; that the ddimensional PRtree has the same asymtotic erformance as the ddimensional seudoprtree is also roved exactly as in the twodimensional case. Recall that a twodimensional seudoprtree is basically a fourdimensional kdtree, where four riority leaves containing extreme rectangles in each of the four directions have been added below each internal node. Similarly, a ddimensional seudoprtree is basically a 2ddimensional kdtree, where each node has 2d riority leaves with extreme rectangles in each of the 2d standard directions. For constant 3 There is a subtle difference between the seudoprtree algorithm used in stage 0 and the algorithm used in stages i > 0. In stage 0, we construct leaves with inut rectangles. In stages i > 0, we construct nodes with ointers to children and bounding boxes of their subtrees. The number of children that fits in a node might differ by a constant factor from the number of rectangles that fits in a leaf, so the number of children might be Θ() rather than. For our analysis the difference does not matter and is therefore ignored for simlicity. d, the structure can be constructed in O( N log M/ N ) I/Os using the same grid method as in the twodimensional case (Section 2.1.3); the only difference is that in order to fit the 2ddimensional grid in main memory we have to decrease z (the number of nodes roduced in one recursive stage) to Θ(M 1/2d ). To analyze the number of I/Os used to answer a window query on a ddimensional seudoprtree, we analyze the number of visited internal nodes as in the twodimensional case (Section 2.1.2); the total number of visited nodes is at most a factor 2d higher, since at most 2d riority leaves can be visited er internal node visited. As in the twodimensional case, O(T/) is a bound on the number of nodes ν visited where all rectangles in at least one of the riority leaves below ν s arent are reorted. The number of nodes ν visited such that each riority leaf of ν s arent contains at least one rectangle not intersecting the query can then be bounded using an argument similar to the one used in two dimensions; it is equal to the number of regions associated with the nodes in a 2ddimensional kdtree with O(N/) leaves that intersect the (2d 2)dimensional intersection of two orthogonal hyerlanes. It follows from a straightforward generalization of the standard kdtree analysis that this is O((N/) 1 1/d ) [2]. Theorem 2. A PRtree on a set of N hyerrectangles in d dimensions can be bulkloaded in O( N log M/ N ) I/Os, such that a window query can be answered in O((N/) 1 1/d + T/) I/Os. 2.4 Lower bound for heuristic Rtrees The PRtree is the first Rtree variant that always answers a window query worstcase otimally. In fact, most other R tree variants can be forced to visit Θ(N/) nodes to answer a query even when no rectangles are reorted (T = 0). In this section we show how this is the case for the acked Hilbert Rtree, the fourdimensional Hilbert Rtree, and the TGS Rtree. Theorem 3. There exist a set of rectangles S and a window query Q that does not intersect any rectangles in S, such that all Θ(N/) nodes are visited when Q is answered using a acked Hilbert Rtree, a fourdimensional Hilbert Rtree, or a TGS Rtree on S. Proof. We will construct a set of oints S such that all leaves in a acked Hilbert Rtree, a fourdimensional Hilbert Rtree, and a TGS Rtree on S are visited when answering a line query that does not touch any oint. The theorem follows since oints and lines are all secial rectangles. For convenience we assume that 4, N = 2 k and N/ = m, for some ositive integers k and m, so that each leaf of the Rtree contains rectangles, and each internal node has fanout. We construct S as a grid of N/ columns and rows, where each column is shifted u a little, deending on its horizontal osition (each row is in fact a HaltonHammersley oint set; see e.g. [8]). More recisely, S has a oint ij = (x ij, y ij), for all i {0,..., N/ 1} and j {0,..., 1}, such that x ij = i + 1/2, and y ij = j/ + h(i)/n. Here h(i) is the number obtained by reversing, i.e. reading backwards, the kbit binary reresentation of i. An examle with N = 64, = 4 is shown in Figure 3. Now, let us examine the structure of each of the three Rtree variants on this dataset.
7 < 1 1 σ Figure 3: Worstcase examle i 1 = k 2 t i 2 = (k + 1)2 t 1 column i 1 i 2 1 σ = 1 2 t Figure 4: TGS artitioning the worstcase examle. A vertical division creates two bounding boxes with a total area of less than i 2 i 1 1. A horizontal division creates two bounding boxes with a total area of more than (i 2 i 1)(1 2σ) > i 2 i 1 1. Two and fourdimensional acked Hilbert Rtree: the Hilbert curve visits the columns in our grid of oints one by one; when it visits a column, it visits all oints in that column before roceeding to another column (we omit the details of the roof from this abstract). Therefore, the acked Hilbert Rtree makes a leaf for every column, and a horizontal line can be chosen to intersect all these columns while not touching any oint. TGS Rtree: The TGS algorithm will artition S into subsets of equal size and artition each subset recursively. The artitioning is imlemented by choosing a artitioning line that searates the set into two subsets (whose sizes are multiles of N/), and then alying binary artitions to the subsets recursively until we have artitioned the set into N subsets of size N/. Observe that on all levels in this recursion, the artitioning line will leave at least a fraction 1/ of the inut on each side of the line. elow we rove that TGS will always artition by vertical lines; it follows that TGS will eventually ut each column in a leaf. Then a line query can intersect all leaves but reort nothing. Suose TGS is about to artition the subset S(i 1, i 2) of S that consists of columns i 1 to i 2 inclusive, with i 2 > i 1, i.e. S(i 1, i 2) = { ij i {i 1,..., i 2}, j {0,..., 1}}. When the greedy slit algorithm gets to divide such a set into two, it can look for a vertical artitioning line or for a horizontal artitioning line. Intuitively, TGS favors artitioning lines that create a big ga between the bounding boxes of the oints on each side of the line. As we will show below, we have constructed S such that the area of the ga created by a horizontal artitioning line is always roughly the same, as is the area of the ga created by a vertical line, with the latter always being bigger. Partitioning with a vertical line would always leave a ga of roughly a square that fits between two columns see Figure 4. More recisely, it would artition the set S(i 1, i 2) into two sets S(i 1, c 1) and S(c, i 2), for some c {i 1 +1,..., i 2}. The bounding boxes of these two sets would each have height less than 1, and their total width would be (c 1 i 1) + (i 2 c), so their total area A v would be less than i 2 i 1 1. The width of a ga around a horizontal artitioning line deends on the number of columns in S(i 1, i 2). However, the more columns are involved the bigger the density of the oints in those columns when rojected on the yaxis, and the lower the ga that can be created see Figure 4 for an illustration. As a result, artitioning with a horizontal line can lead to gas that are wide and low, or relatively high but not so wide; in any case, the area of the ga will be roughly the same. More recisely, when we artition this set by a horizontal line, the total area A h of the resulting bounding boxes must be at least i 2 i 1 4/ (we omit the details from this abstract). Recall that A v is less than i 2 i 1 1. Since 4, we can conclude that A h > A v, and that artitioning with a vertical line will always result in a smaller total area of bounding boxes than with a horizontal line. As a result, TGS will always cut vertically between the columns. 3. EXPERIMENTS In this section we describe the results of our exerimental study of the erformance of the PRtree. We comared the PRtree to several other bulkloading methods known to generate queryefficient Rtrees: The acked Hilbert Rtree (denoted H in the rest of this section), the fourdimensional Hilbert Rtree (denoted H4), and the TGS Rtree (denoted TGS). Among these, TGS has been reorted to have the best query erformance, but it also takes many I/Os to bulkload. In contrast, H is simle to bulkload, but it has worse query erformance because it does not take the extent of the inut rectangles into account. H4 has been reorted to be inferior to H [15], but since it takes the extent into account (like TGS) it should intuitively be less vulnerable to extreme datasets. 3.1 Exerimental setu We imlemented the four bulkloading algorithms in C++ using TPIE [3]. TPIE is a library that rovides suort for imlementing I/Oefficient algorithms and data structures. In our imlementation we used 36 bytes to reresent each inut rectangle; 8 bytes for each coordinate and 4 bytes to be able to hold a ointer to the original object. Each bounding box in the internal nodes also used 36 bytes; 8 bytes for each coordinate and 4 bytes for a ointer to the disk block storing the root of the corresonding subtree. The disk block size was chosen to be 4K, resulting in a maximum fanout of 113. This is similar to earlier exerimental studies, which tyically use block sizes ranging from 1K to 4K or fix the fanout to a number close to 100. As exerimental latform we used a dedicated Dell PowerEdge 2400 workstation with one Pentium III/500MHz rocessor running FreeSD 4.3. A local 36G SCSI disk (IM Ultrastar 36LZX) was used to store all necessary files: the inut data, the Rtrees, as well as temorary files. We restricted the main memory to 128M and further restricted the amount of memory available to TPIE to 64M; the rest was reserved to oerating system daemons. 3.2 Datasets We used both reallife and synthetic data in our exeriments Reallife data As the reallife data we used the tiger/line data [24] of geograhical features in the United States. This data is
8 the standard benchmark data used in satial databases. It is distributed on six CDROMs and we chose to exeriment with the road line segments from two of the CDROMs: Disk one containing data for sixteen eastern US states and disk six containing data from five western US states; we use Eastern and Western to refer to these two datasets, resectively. To obtain datasets of varying sizes we divided the Eastern dataset into five regions of roughly equal size, and then ut an increasing number of regions together to obtain datasets of increasing sizes. The largest set is just the whole Eastern dataset. For each dataset we used the bounding boxes of the line segments as our inut rectangles. As a result, the Eastern dataset had 16.7 million rectangles, for a total size of 574M, and the Western data set had 12 million rectangles, for a total size of 411M. Note that the biggest dataset is much larger than those used in revious works (which only used u to 100,000 rectangles) [15, 12]. Note also that our tiger data is relatively nicely distributed; it consist of relatively small rectangles (long roads are divided into short segments) that are somewhat (but not too badly) clustered around urban areas Synthetic data To investigate how the different Rtrees erform on more extreme datasets than the tiger data, we generated a number of synthetic datasets. Each of these synthetic datasets consisted of 10 million rectangles (or 360M) in the unit square. size(max side): We designed the first class of synthetic datasets to investigate how well the Rtrees handle rectangles of different sizes. In the size(max side) dataset the rectangle centers were uniformly distributed and the lengths of their sides uniformly and indeendently distributed between 0 and max side. When generating the datasets, we discarded rectangles that were not comletely inside the unit square (but made sure each dataset had 10 million rectangles). A ortion of the dataset size(0.001) is shown in Figure 5. asect ratios were fixed to a and the longest sides chosen to be vertical or horizontal with equal robability. We also made sure that all rectangles fell comletely inside the unit square. A ortion of the dataset asect(10) is shown in Figure 6. Note that if the inut rectangles are bounding boxes of line segments that are almost horizontal or vertical, one will indeed get rectangles with very high asect ratio even infinite in the case of horizontal or vertical segments. Figure 6: Synthetic dataset ASPECT(10) skewed(c): In many reallife multidimensional datasets different dimensions often have different distributions, some of which may be highly skewed comared to the others. We designed the third class of datasets to investigate how this affects Rtree erformance. skewed(c) consists of uniformly distributed oints that have been squeezed in the ydimension, that is, each oint (x, y) is relaced with (x, y c ). An examle of skewed(5) is shown in Figure 7. Figure 7: Synthetic dataset SKEWED(5) Figure 5: Synthetic dataset SIZE(0.001) asect(a): The second class of synthetic datasets was designed to investigate how the Rtrees handle rectangles with different asect ratios. The areas of the rectangles in all the datasets were fixed to 10 6, a reasonably small size. In the asect(a) dataset the rectangle centers were uniformly distributed but their cluster: Our final dataset was designed to illustrate the worstcase behavior of the H, H4 and TGS R trees. It is similar to the worstcase examle discussed in Section 2. It consists of clusters with centers equally saced on a horizontal line. Each cluster consists of 1000 oints uniformly distributed in a square surrounding its center. Figure 8 shows a art of the cluster dataset.
9 21.1 Greedy (TGS) Hilbert (H/H4) PRtree (PR) Greedy (TGS) Figure 8: Synthetic dataset CLUSTER Western data 1.2 mln 3.1 mln 14.7 mln Eastern data 1.7 mln 4.4 mln 21.1 mln million blocks read or written Hilbert (H/H4) PRtree (PR) Greedy (TGS) 451 s s s 583 s s s Figure 9: ulkloading erformance on TIGER data: I/O (uer figure) and time (lower figure). 3.3 Exerimental results elow we discuss the results of our bulkloading and query exeriments with the four Rtree variants ulkloading erformance We bulkloaded each of the Rtrees with each of the reallife tiger datasets, as well as with the synthetic datasets for various arameter values. In all exeriments and for all Rtrees we achieved a sace utilization above 99%. 4 We measured the time sent and counted the number of 4K blocks read or written when bulkloading the trees. Note that all algorithms we tested read and write blocks almost exclusively by sequential I/O of large arts of the data; as a result, I/O is much faster than if blocks were read and written in random order. Figure 9 shows the results of our exeriments using the Eastern and Western datasets. oth exeriments yield the same result: The H and H4 algorithms use the same number of I/Os, and roughly 2.5 times fewer I/Os than PR. This is not surrising since even though the three algorithms have the same O( N log M/ N ) I/O bounds, the PR algorithm is much more comlicated than the H and H4 algorithms. The TGS algorithm uses roughly 4.5 times more I/Os than PR, which is also not surrising given that the algorithm makes binary artitions so that the number of levels of recursion is effectively O(log 2 N). In terms of time, the H and H4 algorithms are still more than 3 times faster than the PR algorithm, but the TGS algorithm is only roughly 3 times slower than PR. This shows that H, H4 and PR are all more CPUintensive than TGS. Figure 10 shows the results of our exeriments with the five Eastern datasets. These exeriments show that the H, H4 and PR algorithms scale relatively linearly with dataset N size; this is a result of the log M/ factor in the bulkloading bound being the same for all datasets. The cost of the TGS algorithm seems to grow in an only slightly suerlinear way with the size of the data set. This is a result of the log 2 N factor in the bulkloading bound being almost 4 When Rtrees are bulkloaded to subsequently be udated dynamically, near 100% sace utilization is often not desirable [10]. However, since we are mainly interested in the query erformance of the Rtree constructed with the different bulkloading methods, and since the methods could be modified in the same way to roduce nonfull leaves, we only considered the near 100% utilization case PRtree (PR) Hilbert (H/H4) mln 5.7 mln 9.2 mln 12.7 mln 16.7 mln rectangles Figure 10: ulkloading erformances on Eastern datasets (I/Os) the same for all data sets. In our exeriments with the synthetic data we found that the erformance of the H, H4 and PR bulkloading algorithms was ractically the same for all the datasets, that is, unaffected by the data distribution. This is not surrising, since the erformance should only deend on the dataset size (and all the synthetic datasets have the same size). The PR algorithm erformance varied slightly, which can be exlained by the small effect the data distribution can have on the grid method used in the bulkloading algorithm (subtrees may have slightly different sizes due to the removal of riority boxes). On average, the H and H4 algorithms sent 381 seconds and 1.0 million I/Os on each of the synthetic datasets, while the PR algorithm sent 1289 seconds and 2.6 million I/Os. On the other hand, as exected, the erformance of the TGS algorithm varied significantly over the synthetic datasets we tried; the binary artitions made by the algorithm deend heavily on the inut data distribution. The TGS algorithm was between 4.6 and 16.4 times slower than the PR algorithm in terms of I/O, and between 2.8 and 10.9 times slower in terms of time. Due to lack of sace, we only show the erformance of the TGS algorithm on the size(max side) and asect(a) datasets in Figure 11. The oint datasets, skewed(c) and cluster, were all built in between and seconds Query erformance After bulkloading the four Rtree variants we exerimented with their query erformance; in each of our exeriments we erformed 100 randomly generated queries and comuting their average erformance (a more exact descrition of the queries is given below). Following revious exerimental studies, we utilized a cache (or buffer ) to store internal Rtree nodes during queries. In fact, in all our exeriments we cached all internal nodes since they never occuied more than 6M. This means that when reorting the number of I/Os needed to answer a query, we are in effect reorting the number of leaves visited in order to answer the query. 5 For several reasons, and following revious 5 Exeriments with the cache disabled showed that in our exeriments the cache actually had relatively little effect on
10 % % Hilbert 4D (H4) Hilbert 2D (H) PRtree (PR) Greedy (TGS) 100% % of total area rectangles outut % SIZE(max side) ASPECT(a) Figure 11: ulkloading time in seconds of Todown Greedy Slit on synthetic data sets of 10 million rectangles each. 120% Figure 13: Query erformance for queries with squares of varying size on the Eastern TIGER data. The erformance is given as the number of blocks read divided by the outut size T/. 120% H4 PR H 110% Hilbert 4D (H4) Hilbert 2D (H) 110% Hilbert 4D (H4) Hilbert 2D (H) TGS PRtree (PR) Greedy (TGS) 100% PRtree (PR) Greedy (TGS) % of total area rectangles outut Figure 12: Query erformance for queries with squares of varying size on the Western TIGER data. The erformance is given as the number of blocks read divided by the outut size T/. 100% mln rectangles inut rectangles outut Figure 14: Query erformance for queries with squares of area 0.01 on Eastern TIGER data sets of varying size. The erformance is given as the number of blocks read divided by the outut size T/. exerimental studies [6, 12, 15, 16], we did not collect timing data. Two main reasons for this are (1) that I/O is a much more robust measure of erformance, since the query time is easily affected by oerating system caching and by disk block layout; and (2) that we are interested in heavy load scenarios where not much cache memory is available or where caches are ineffective, that is, where I/O dominates the query time. TIGER data: We first erformed query exeriments using the Eastern and Western datasets. The results are summarized in Figure 12, 13, and 14. In Figure 12 and 13 we show the results of exeriments with square window queries with areas that range from 0.25% to 2% of the area of the bounding box of all inut rectangles. We used smaller queries than revious exerimental studies (for examle, the maximum query in [15] occuies 25% of the area) because our datasets are much larger than the datasets used in revious exeriments without reducing the query size the outthe window query erformance. ut would be unrealistically large and the reorting cost would thus dominate the overall query erformance. In Figure 14 we show the results of exeriments on the five Eastern datasets of various sizes with a fixed query size of 1%. The results show that all four Rtree variants erform remarkably well on the tiger data; their erformance is within 10% of each other and they all answer queries in close to T/, the minimum number of necessary I/Os. Their relative erformance generally agrees with earlier results [15, 12], that is, TGS erforms better than H, which in turn is better than H4. PR consistently erforms slightly better than both H and H4 but slightly worse than TGS. Synthetic data. Next we erformed exeriments with our synthetic datasets, designed to investigate how the different Rtrees erform on more extreme datasets than the tiger data. For each of the datasets size, asect and skewed we erformed exeriments where we varied the arameter to obtain data ranging from fairly normal to rather extreme. elow we summarize our results. The left side of Figure 15 shows the results of our ex
11 340% H H H4 200% TGS H TGS TGS TGS 150% 100% PR H4 H H H4 PR outut rectangles % SIZE(max side) ASPECT(a) SKEWED(c) PR Figure 15: Query erformance for queries with squares of area 0.01 on synthetic data sets. The erformance is given as the number of blocks read divided by the outut size T/. eriments with the dataset size(max side) when varying max side from to 0.2, that is, from relatively small to relatively large rectangles. As queries we used squares with area Our results show that for relatively small inut rectangles, like the tiger data, all the Rtree variants erform very close to the minimum number of necessary I/Os. However, as the inut rectangles get larger, PR and H4 clearly outerform H and TGS. H erforms the worst, which is not surrising since it does not take the extent of the inut rectangles into account. TGS erforms significantly better than H but still worse than PR and H4. Intuitively, PR and H4 can handle large rectangles better, because they rigorously divide rectangles into grous of rectangles that are similar in all four coordinates. This may enable these algorithms to grou likely answers, namely large rectangles, together so that they can be retrieved with few I/Os. It also enables these algorithms to grou small rectangles nicely, while TGS, which strives to minimize the total area of bounding boxes, may be indifferent to the distribution of the small rectangles in the resence of large rectangles. The middle of Figure 15 shows the results of our exeriments with the dataset asect(a), when we vary a from 10 to 10 5, that is, when we go from rectangles (of constant area) with small to large asect ratio. As query we again used squares with area The results are very similar to the results of the size dataset exeriments, excet that as the asect ratio increases, PR and H4 become significantly better than TGS and esecially H. Unlike with the size dataset, PR erforms as well as H4 and they both erform close to the minimum number of necessary I/Os to answer a query. Thus this set of exeriments reemhasizes that both the PRtree and H4tree are able to adot to varying extent very well. The right side of Figure 15 shows the result of our exeriments with the dataset skewed(c), when we vary c from 1 to 9, that is, when we go from a uniformly distributed oint set to a very skewed oint set. As query we used squares with area 0.01 that are skewed in the same way as the dataset (that is, where the corner (x, y) is transformed to (x, y c )) so that the outut size remains roughly the same. As exected, the PR erformance is unaffected by the transformations, since our bulkloading algorithm is based only on the relative order of coordinates: xcoordinates are only comared to xcoordinates, and ycoordinates are only comared to ycoordinates; there is no interaction between them. On the other hand, the query erformance of the three other Rtrees degenerates quickly as the oint set gets more skewed. As a final exeriment, we queried the cluster dataset with long skinny horizontal queries (of area ) through the clusters; the ycoordinate of the leftmost bottom corner was chosen randomly such that the query assed through all clusters. The results are shown in Table 1. As anticiated, the query erformance of H, H4 and TGS is very bad; the cluster dataset was constructed to illustrate the worstcase behavior of the structures. Even though a query only returns around 0.3% of the inut oints on average, the tree: H H4 PR TGS # I/Os: % of the Rtree visited: 37% 94% 1.2% 25% Table 1: Query erformances on synthetic dataset CLUSTER.
12 query algorithm visits 37%, 94% and 25% of the leaves in H, H4 and TGS, resectively. In comarison, only 1.2% of the leaves are visited in PR. Thus the PRtree outerforms the other indexes by well over an order of magnitude. 3.4 Conclusions of the exeriments The main conclusion of our exerimental study is that the PRtree is not only theoretically efficient but also ractically efficient. Our bulkloading algorithm is slower than the acked Hilbert and fourdimensional Hilbert bulkloading algorithms but much faster than the TGS Rtree bulkloading algorithm. Furthermore, unlike for the TGS Rtree, the erformance of our bulkloading algorithm does not deend on the data distribution. The query erformance of all four Rtrees is excellent on nicely distributed data, including the reallife tiger data. On extreme data however, the PRtree is much more robust than the other Rtrees (even though the fourdimensional Hilbert Rtree is also relatively robust). 4. CONCLUDING REMARKS In this aer we resented the PRtree, which is the first Rtree variant that can answer any window query in the otimal O( N/ +T/) I/Os. We also erformed an extensive exerimental study, which showed that the PRtree is not only otimal in theory, but that it also erforms excellent in ractice: for normal data, it is quite cometitive to the best known heuristics for bulkloading Rtrees, namely the acked HilbertRtree [15] and the TGS Rtree [12], while for data with extreme shaes or distributions, it outerforms them significantly. The PRtree can be udated using any known udate heuristic for Rtrees, but then its erformance cannot be guaranteed theoretically anymore and its ractical erformance might suffer as well. Alternatively, we can use the dynamic version of the PRtree using the logarithmic method, which has the same theoretical worstcase query erformance and can be udated efficiently. In the future we wish to exeriment to see what haens to the erformance when we aly heuristic udate algorithms and when we use the theoretically suerior logarithmic method. 5. REFERENCES [1] P. K. Agarwal, L. Arge, O. Procoiuc, and J. S. Vitter. A framework for index bulk loading and dynamization. In Proc. International Colloquium on Automata, Languages, and Programming, ages , [2] P. K. Agarwal, M. de erg, J. Gudmundsson, M. Hammar, and H. J. Haverkort. oxtrees and Rtrees with nearotimal query time. Discrete and Comutational Geometry, 28(3): , [3] L. Arge, O. Procoiuc, and J. S. Vitter. Imlementing I/Oefficient data structures using TPIE. In Proc. Euroean Symosium on Algorithms, ages , [4] L. Arge and J. Vahrenhold. I/Oefficient dynamic lanar oint location. International Journal of Comutational Geometry & Alications, To aear. [5] R. ayer and E. McCreight. Organization and maintenance of large ordered indexes. Acta Informatica, 1: , [6] N. eckmann, H.P. Kriegel, R. Schneider, and. Seeger. The R*tree: An efficient and robust access method for oints and rectangles. In Proc. SIGMOD International Conference on Management of Data, ages , [7] S. erchtold, C. öhm, and H.P. Kriegel. Imroving the query erformance of highdimensional index structures by bulk load oerations. In Proc. Conference on Extending Database Technology, LNCS 1377, ages , [8]. Chazelle. The Discreancy Method: Randomness and Comlexity. Cambridge University Press, New York, [9] D. Comer. The ubiquitous tree. ACM Comuting Surveys, 11(2): , [10] D. J. DeWitt, N. Kabra, J. Luo, J. M. Patel, and J.. Yu. Clientserver aradise. In Proc. International Conference on Very Large Databases, ages , [11] V. Gaede and O. Günther. Multidimensional access methods. ACM Comuting Surveys, 30(2): , [12] Y. J. García, M. A. Lóez, and S. T. Leutenegger. A greedy algorithm for bulk loading Rtrees. In Proc. 6th ACM Symosium on Advances in GIS, ages , [13] A. Guttman. Rtrees: A dynamic index structure for satial searching. In Proc. SIGMOD International Conference on Management of Data, ages 47 57, [14] H. J. Haverkort, M. de erg, and J. Gudmundsson. oxtrees for collision checking in industrial installations. In Proc. ACM Symosium on Comutational Geometry, ages 53 62, [15] I. Kamel and C. Faloutsos. On acking Rtrees. In Proc. International Conference on Information and Knowledge Management, ages , [16] I. Kamel and C. Faloutsos. Hilbert Rtree: An imroved Rtree using fractals. In Proc. International Conference on Very Large Databases, ages , [17] K. V. R. Kanth and A. K. Singh. Otimal dynamic range searching in nonrelicating index structures. In Proc. International Conference on Database Theory, LNCS 1540, ages , [18] S. T. Leutenegger, M. A. Lóez, and J. Edgington. STR: A simle and efficient algorithm for Rtree acking. In Proc. IEEE International Conference on Data Engineering, ages , [19] Y. Manolooulos, A. Nanooulos, A. N. Paadooulos, and Y. Theodoridis. Rtrees have grown everywhere. Technical Reort available at htt://www.rtreeortal.org/, 2003 [20] O. Procoiuc, P. K. Agarwal, L. Arge, and J. S. Vitter. kdtree: A dynamic scalable kdtree. In Proc. International Symosium on Satial and Temoral Databases, [21] J. Robinson. The KD tree: A search structure for large multidimensional dynamic indexes. In Proc. SIGMOD International Conference on Management of Data, ages 10 18, [22] N. Roussooulos and D. Leifker. Direct satial search on ictorial databases using acked Rtrees. In Proc. SIGMOD International Conference on Management of Data, ages 17 31, [23] T. Sellis, N. Roussooulos, and C. Faloutsos. The R + tree: A dynamic index for multidimensional objects. In Proc. International Conference on Very Large Databases, ages , [24] TIGER/Line TM Files, 1997 Technical Documentation. Washington, DC, Setember htt://www.census.gov/geo/tiger/tiger97d.df.
Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11)
Point Location Prerocess a lanar, olygonal subdivision for oint location ueries. = (18, 11) Inut is a subdivision S of comlexity n, say, number of edges. uild a data structure on S so that for a uery oint
More informationBranchandPrice for Service Network Design with Asset Management Constraints
BranchandPrice for Servicee Network Design with Asset Management Constraints Jardar Andersen Roar Grønhaug Mariellee Christiansen Teodor Gabriel Crainic December 2007 CIRRELT200755 BranchandPrice
More informationBeyond the F Test: Effect Size Confidence Intervals and Tests of Close Fit in the Analysis of Variance and Contrast Analysis
Psychological Methods 004, Vol. 9, No., 164 18 Coyright 004 by the American Psychological Association 108989X/04/$1.00 DOI: 10.1037/108989X.9..164 Beyond the F Test: Effect Size Confidence Intervals
More informationLectures on the Dirichlet Class Number Formula for Imaginary Quadratic Fields. Tom Weston
Lectures on the Dirichlet Class Number Formula for Imaginary Quadratic Fields Tom Weston Contents Introduction 4 Chater 1. Comlex lattices and infinite sums of Legendre symbols 5 1. Comlex lattices 5
More informationOutOfCore Algorithms for Scientific Visualization and Computer Graphics
OutOfCore Algorithms for Scientific Visualization and Computer Graphics Course Notes for IEEE Visualization 2002 Boston, Massachusetts October 2002 Organizer: Cláudio T. Silva Oregon Health & Science
More informationRobust Set Reconciliation
Robust Set Reconciliation Di Chen 1 Christian Konrad 2 Ke Yi 1 Wei Yu 3 Qin Zhang 4 1 Hong Kong University of Science and Technology, Hong Kong, China 2 Reykjavik University, Reykjavik, Iceland 3 Aarhus
More informationElementary Number Theory: Primes, Congruences, and Secrets
This is age i Printer: Oaque this Elementary Number Theory: Primes, Congruences, and Secrets William Stein November 16, 2011 To my wife Clarita Lefthand v vi Contents This is age vii Printer: Oaque this
More informationIntroductory Number Theory
Introductory Number Theory Course No. 100 331 Sring 2006 Michael Stoll Contents 1. Very Basic Remarks 2 2. Divisibility 2 3. The Euclidean Algorithm 2 4. Prime Numbers and Unique Factorization 4 5. Congruences
More informationI/OEfficient Batched UnionFind and Its Applications to Terrain Analysis
I/OEfficient Batched UnionFind and Its Applications to Terrain Analysis Pankaj K. Agarwal Lars Arge, Ke Yi Department of Computer Science, Duke University, Durham, NC 0, USA. {pankaj,large,yike}@cs.duke.edu
More informationCMSC 451 Design and Analysis of Computer Algorithms 1
CMSC 4 Design and Analysis of Computer Algorithms David M. Mount Department of Computer Science University of Maryland Fall 003 Copyright, David M. Mount, 004, Dept. of Computer Science, University of
More informationFast PointtoPoint Shortest Path Computations with ArcFlags
Fast PointtoPoint Shortest Path Computations with ArcFlags Ekkehard Köhler, Rolf H. Möhring, and Heiko Schilling Institute of Mathematics, TU Berlin, Germany {Ekkehard.Koehler Rolf.Moehring Heiko.Schilling}@TUBerlin.DE
More informationFrom Point Cloud to Grid DEM: A Scalable Approach
From Point Cloud to Grid DEM: A Scalable Approach Pankaj K. Agarwal 1, Lars Arge 12, and Andrew Danner 1 1 Department of Computer Science, Duke University, Durham, NC 27708, USA {pankaj, large, adanner}@cs.duke.edu
More informationMean shiftbased clustering
Pattern Recognition (7) www.elsevier.com/locate/r Mean shiftbased clustering KuoLung Wu a, MiinShen Yang b, a Deartment of Information Management, Kun Shan University of Technology, YungKang, Tainan
More informationMining Data Streams. Chapter 4. 4.1 The Stream Data Model
Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make
More informationGroup lending, local information and peer selection 1
Journal of Develoment Economics Ž. Vol. 60 1999 7 50 www.elsevier.comrlocatereconbase Grou lending, local information and eer selection 1 Maitreesh Ghatak ) Deartment of Economics, niõersity of Chicago,
More informationEfficient Processing of Joins on Setvalued Attributes
Efficient Processing of Joins on Setvalued Attributes Nikos Mamoulis Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road Hong Kong nikos@csis.hku.hk Abstract Objectoriented
More informationBitmap Index Design Choices and Their Performance Implications
Bitmap Index Design Choices and Their Performance Implications Elizabeth O Neil and Patrick O Neil University of Massachusetts at Boston Kesheng Wu Lawrence Berkeley National Laboratory {eoneil, poneil}@cs.umb.edu
More informationStoring a Collection of Polygons Using Quadtrees
Storing a Collection of Polygons Using Quadtrees HANAN SAMET University of Maryland and ROBERT E. WEBBER Rutgers University An adaptation of the quadtree data structure that represents polygonal maps (i.e.,
More informationComputing the Most Probable String with a Probabilistic Finite State Machine
Comuting the Most Probable String with a Probabilistic Finite State Machine Colin de la Higuera Université de Nantes, CNRS, LINA, UMR6241, F44000, France cdlh@univnantesfr Jose Oncina De de Lenguajes
More informationSizeConstrained Weighted Set Cover
SizeConstrained Weighted Set Cover Lukasz Golab 1, Flip Korn 2, Feng Li 3, arna Saha 4 and Divesh Srivastava 5 1 University of Waterloo, Canada, lgolab@uwaterloo.ca 2 Google Research, flip@google.com
More informationON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION. A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT
ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT A numerical study of the distribution of spacings between zeros
More informationGraphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations
Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations Jure Leskovec Carnegie Mellon University jure@cs.cmu.edu Jon Kleinberg Cornell University kleinber@cs.cornell.edu Christos
More informationDealing with Uncertainty in Operational Transport Planning
Dealing with Uncertainty in Operational Transport Planning Jonne Zutt, Arjan van Gemund, Mathijs de Weerdt, and Cees Witteveen Abstract An important problem in transportation is how to ensure efficient
More informationEssential Web Pages Are Easy to Find
Essential Web Pages Are Easy to Find Ricardo BaezaYates Yahoo Labs Barcelona Spain rbaeza@acmorg Paolo Boldi Dipartimento di Informatica Università degli Studi di ilano Italy paoloboldi@unimiit Flavio
More informationComplexity of UnionSplitFind Problems. Katherine Jane Lai
Complexity of UnionSplitFind Problems by Katherine Jane Lai S.B., Electrical Engineering and Computer Science, MIT, 2007 S.B., Mathematics, MIT, 2007 Submitted to the Department of Electrical Engineering
More informationChord: A Scalable Peertopeer Lookup Service for Internet Applications
Chord: A Scalable Peertopeer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT Laboratory for Computer Science chord@lcs.mit.edu
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationSteering User Behavior with Badges
Steering User Behavior with Badges Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell University Cornell University Stanford University ashton@cs.stanford.edu {dph,
More informationMaximizing the Spread of Influence through a Social Network
Maximizing the Spread of Influence through a Social Network David Kempe Dept. of Computer Science Cornell University, Ithaca NY kempe@cs.cornell.edu Jon Kleinberg Dept. of Computer Science Cornell University,
More informationApproximately Detecting Duplicates for Streaming Data using Stable Bloom Filters
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Fan Deng University of Alberta fandeng@cs.ualberta.ca Davood Rafiei University of Alberta drafiei@cs.ualberta.ca ABSTRACT
More information