R-tree: Indexing Structure for Data in Multidimensional

Similar documents
R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Data Warehousing und Data Mining

CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB

External Memory Geometric Data Structures

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

Vector storage and access; algorithms in GIS. This is lecture 6

Big Data and Scripting. Part 4: Memory Hierarchies

A Note on Maximum Independent Sets in Rectangle Intersection Graphs

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

Data Structures for Moving Objects

Binary Heap Algorithms

Performance of KDB-Trees with Query-Based Splitting*

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

From Last Time: Remove (Delete) Operation

Dynamic 3-sided Planar Range Queries with Expected Doubly Logarithmic Time

GiST. Amol Deshpande. March 8, University of Maryland, College Park. CMSC724: Access Methods; Indexes; GiST. Amol Deshpande.

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

MIDAS: Multi-Attribute Indexing for Distributed Architecture Systems

Data storage Tree indexes

Computational Geometry. Lecture 1: Introduction and Convex Hulls

Analysis of Algorithms I: Binary Search Trees

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Internet Packet Filter Management and Rectangle Geometry

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Persistent Data Structures and Planar Point Location

Closest Pair Problem

Multi-dimensional index structures Part I: motivation

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Heaps & Priority Queues in the C++ STL 2-3 Trees

Colored Range Searching on Internal Memory

Indexing Spatio-Temporal archive As a Preprocessing Alsuccession

CSE 326: Data Structures B-Trees and B+ Trees

Converting a Number from Decimal to Binary

Physical Data Organization

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Using Data-Oblivious Algorithms for Private Cloud Storage Access. Michael T. Goodrich Dept. of Computer Science

Clustering & Visualization

Chapter 13: Query Processing. Basic Steps in Query Processing

Generalizing Database Access Methods

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *

root node level: internal node edge leaf node Data Structures & Algorithms McQuain

Ag + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:

Cluster Analysis for Optimal Indexing

Providing Diversity in K-Nearest Neighbor Query Results

Binary Trees and Huffman Encoding Binary Search Trees

Arithmetic Coding: Introduction

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Load-Balancing Spatially Located Computations using Rectangular Partitions

Line Segment Intersection

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

Algorithms. Algorithms GEOMETRIC APPLICATIONS OF BSTS. 1d range search line segment intersection kd trees interval search trees rectangle intersection

Data Structures Fibonacci Heaps, Amortized Analysis

TheHow and Why of Having a Successful Home Office

ISSUES IN SPATIAL DATABASES AND GEOGRAPHIC INFORMATION SYSTEMS (GIS) HANAN SAMET

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

The DC-tree: A Fully Dynamic Index Structure for Data Warehouses

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications

THE concept of Big Data refers to systems conveying

Carnegie Mellon University. Extract from Andrew Moore's PhD Thesis: Ecient Memory-based Learning for Robot Control

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

5.4 Closest Pair of Points

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Accelerating network security services with fast packet classification

Caching Dynamic Skyline Queries

Lecture 1: Data Storage & Index

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Binary Space Partitions

Questions 1 through 25 are worth 2 points each. Choose one best answer for each.

International Journal of Advance Research in Computer Science and Management Studies

DATABASE DESIGN - 1DL400

B490 Mining the Big Data. 2 Clustering

CSC148 Lecture 8. Algorithm Analysis Binary Search Sorting

An Evaluation of Self-adjusting Binary Search Tree Techniques

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Longest Common Extensions via Fingerprinting

Efficient Updates for OLAP Range Queries on Flash Memory

Data Structures. Jaehyun Park. CS 97SI Stanford University. June 29, 2015

Quality and Efficiency in Kernel Density Estimates for Large Data

Fast Matching of Binary Features

Databases and Information Systems 1 Part 3: Storage Structures and Indices

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Binary Heaps. CSE 373 Data Structures

EE602 Algorithms GEOMETRIC INTERSECTION CHAPTER 27

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

Algorithms and Data Structures

Dynamic Programming. Lecture Overview Introduction

Binary Search Trees CMPSC 122

Large Databases. Abstract. Many indexing approaches for high dimensional data points have evolved into very complex

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. The Best-fit Heuristic for the Rectangular Strip Packing Problem: An Efficient Implementation

Data Mining Practical Machine Learning Tools and Techniques

A Comparison of Dictionary Implementations

Approximation Algorithms

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

CS711008Z Algorithm Design and Analysis

Data Structure and Algorithm I Midterm Examination 120 points Time: 9:10am-12:10pm (180 minutes), Friday, November 12, 2010

Transcription:

R-tree: Indexing Structure for Data in Multidimensional Space Feifei Li (Many slides made available by Ke Yi)

Until now: Data Structures q 4 q 3 q 1 q 2 General planer range searching (in 2-dimensional space): kdb-tree: O( N T B ) query, O( N space B ) B

Other results Many other results for e.g. Higher dimensional range searching Range counting, range/stabbing max, and stabbing queries Halfspace (and other special cases) of range searching Queries on moving objects Proximity queries (closest pair, nearest neighbor, point location) Structures for objects other than points (bounding rectangles) Many heuristic structures in database community

Point Enclosure Queries Dual of planar range searching problem Report all rectangles containing query point (x,y) y x Internal memory: Can be solved in O(N) space and O(log N + T) time Persistent interval tree

Rectangle Range Searching Report all rectangles intersecting query rectangle Q Q Often used in practice when handling complex geometric objects Store minimal bounding rectangles (MBR)

Rectangle Data Structures: R-Tree [Guttman, SIGMOD84] Most common practically used rectangle range searching structure Similar to B-tree Rectangles in leaves (on same level) Internal nodes contain MBR of rectangles below each child Note: Arbitrary order in leaves/grouping order

Example

Example

Example

Example

Example

(Point) Query: Recursively visit relevant nodes

Query Efficiency The fewer rectangles intersected the better

Intuitively Rectangle Order Objects close together in same leaves small rectangles queries descend in few subtrees Grouping in internal nodes? Small area of MBRs Small perimeter of MBRs Little overlap among MBRs

R-tree Insertion Algorithm When not yet at a leaf (choose subtree): Determine rectangle whose area increment after insertion is smallest (small area heuristic) Increase this rectangle if necessary and recurse At a leaf: Insert if room, otherwise Split Node (while trying to minimize area)

Node Split New MBRs

Linear Split Heuristic Determine the furthest pair R 1 and R 2 : the seeds for sets S 1 and S 2 While not all MBRs distributed Add next MBR to the set whose MBR increases the least

Quadratic Split Heuristic Determine R1 and R2 with largest area(mbr of R1 and R2)- area(r1) - area(r2): the seeds for sets S1 and S2 While not all MBRs distributed Determine of every not yet distributed rectangle R j : d 1 = area increment of S 1 R j d 2 = area increment of S 2 R j Choose Ri with maximal d 1 -d 2 and add to the set with smallest area increment

R-tree Deletion Algorithm Find the leaf (node) and delete object; determine new (possibly smaller) MBR If the node is too empty: Delete the node recursively at its parent Insert all entries of the deleted node into the R-tree

R*-trees [Beckmann et al. SIGMOD90] Why try to minimize area? Why not overlap, perimeter, R*-tree: Better heuristics for Choose Subtree and Split Node

R-Tree Variants Many, many R-tree variants (heuristics) have been proposed Often bulk-loaded R-trees are used Much faster than repeated insertions Better space utilization Can optimize more globally Can be updated using previous update algorithms

How to Build an R-Tree Repeated insertions [Guttman84] R + -tree [Sellis et al. 87] R*-tree [Beckmann et al. 90] Bulkloading Hilbert R-Tree [Kamel and Faloutos 94] Top-down Greedy Split [Garcia et al. 98] Advantages: * Much faster than repeated insertions * Better space utilization * Usually produce R-trees with higher quality 22

R-Tree Variant: Hilbert R-Tree Hilbert Curve To build a Hilbert R-Tree (cost: O(N/B log M/B N) I/Os) Sort the rectangles by the Hilbert values of their centers Build a B-tree on top 23

Z-ordering Basic assumption: Finite precision in the representation of each co-ordinate, K bits (2 K values) The address space is a square (image) and represented as a 2 K x 2 K array Each element is called a pixel

Z-ordering Impose a linear ordering on the pixels of the image 1 dimensional problem 11 10 01 00 A 00 01 10 11 B Z A = shuffle(xa, ya) = shuffle( 01, 11 ) = 0111 = (7) 10 Z B = shuffle( 01, 01 ) = 0011

Z-ordering Given a point (x, y) and the precision K find the pixel for the point and then compute the z-value Given a set of points, use a B+-tree to index the z-values A range (rectangular) query in 2-d is mapped to a set of ranges in 1- d

Queries Find the z-values that contained in the query and then the ranges 11 10 01 00 Q A 00 01 10 11 Q B Q A range [4, 7] Q B ranges [2,3] and [8,9]

Handling Regions A region breaks into one or more pieces, each one with different z-value We try to minimize the number of pieces in the representation: precision/space overhead trade-off Z R1 = 0010 = (2) Z R2 = 1000 = (8) 11 10 Z G = 11 ( 11 is the common prefix) 01 00 00 01 10 11

Z-ordering for Regions Break the space into 4 equal quadrants: level-1 blocks Level-i block: one of the four equal quadrants of a level-(i-1) block Pixel: level-k blocks, image level-0 block For a level-i block: all its pixels have the same prefix up to 2i bits; the z-value of the block

Hilbert Curve We want points that are close in 2d to be close in the 1d Note that in 2d there are 4 neighbors for each point where in 1d only 2. Z-curve has some jumps that we would like to avoid Hilbert curve avoids the jumps : recursive definition

Hilbert Curve- example It has been shown that in general Hilbert is better than the other space filling curves for retrieval [Jag90] Hi (order-i) Hilbert curve for 2 i x2 i array H1 H2... H(n+1)

R-trees - variations A: plane-sweep on HILBERT curve!

R-trees - variations A: plane-sweep on HILBERT curve! In fact, it can be made dynamic (how?), as well as to handle regions (how?)

R-trees - variations Dynamic ( Hilbert R-tree): each point has an h -value (hilbert value) insertions: like a B-tree on the h- value but also store MBR, for searches

R-trees - variations Data structure of a node? ~B-tree LHV x-low, ylow x-high, y-high ptr h-value >= LHV & MBRs: inside parent MBR

R-trees - variations Data structure of a node? ~ R-tree LHV x-low, ylow x-high, y-high ptr h-value >= LHV & MBRs: inside parent MBR

Theoretical Musings None of existing R-tree variants has worst-case query performance guarantee! In the worst-case, a query can visit all nodes in the tree even when the output size is zero R-tree is a generalized kdb-tree, so can we achieve O( N / B T / B)? Priority R-Tree [Arge, de Berg, Haverkort, and Yi, SIGMOD04] The first R-tree variant that answers a query by visiting O( N / B T / B) nodes in the worst case * T: Output size It is optimal! * Follows from the kdb-tree lower bound. 37

Roadmap Pseudo-PR-Tree Has the desired O( N / B T / B) worst-case guarantee Not a real R-tree Transform a pseudo-pr-tree into a PR-tree A real R-tree Maintain the worst-case guarantee Experiments PR-tree Hilbert R-tree (2D and 4D) TGS-R-tree 38

Pseudo-PR-Tree 1. Place B extreme rectangles from each direction in priority leaves 2. Split remaining rectangles by x min coordinates (round-robin using x min, y min, x max, y max like a 4d kd-tree) 3. Recursively build sub-trees Query in O( N T B B ) I/Os O(T/B) nodes with priority leaf completely reported O( N B ) nodes with no priority leaf completely reported

Pseudo-PR-Tree: Query Complexity Nodes v visited where all rectangles in at least one of the priority leaves of v s parent are reported: O(T/B) Let v be a node visited but none of the priority leaves at its parent are reported completely, consider v s parent u 2d 4d Q y min = y max (Q) x max = x min (Q)

Pseudo-PR-Tree: Query Complexity The cell in the 4d kd-tree of u is intersected by two different 3-dimensional hyperplanes defined by sides of query Q The intersection of each pair of such 3- dimensional hyper-planes is a 2- dimensional hyper-plane Lemma: # of cells in a d-dimensional kdtree that intersect an axis-parallel f- dimensional hyper-plane is O((N/B) f/d ) So, # such cells in a 4d kd-tree: O( N / B) u Total # nodes visited: O( N / B T / B)

PR-tree from Pseudo-PR-Tree

Query Complexity Remains Unchanged # nodes visited on leaf level B T B N / / Next level: 2 2 / / / / B T B B N B N 3 2 2 3 / / / / / / B T B B N B B N B N

PR-Tree PR-tree construction in O( log ) I/Os B M B B Pseudo-PR-tree in N N O( log ) I/Os B M B B Cost dominated by leaf level N N Updates O(log B N) I/Os using known heuristics * Loss of worst-case query guarantee O(log 2 N) I/Os using logarithmic method B * Worst-case query efficiency maintained Extending to d-dimensions Optimal O((N/B) 1-1/d + T/B) query