R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants



Similar documents
Data Warehousing und Data Mining

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

Big Data and Scripting. Part 4: Memory Hierarchies

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Vector storage and access; algorithms in GIS. This is lecture 6

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

Analysis of Algorithms I: Binary Search Trees

Physical Data Organization

CSE 326: Data Structures B-Trees and B+ Trees

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Indexing Spatio-Temporal archive As a Preprocessing Alsuccession

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

Cluster Analysis for Optimal Indexing

DATABASE DESIGN - 1DL400

CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB

Indexing and Retrieval of Historical Aggregate Information about Moving Objects

Ag + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments

Multi-dimensional index structures Part I: motivation

QuickDB Yet YetAnother Database Management System?

File Management. Chapter 12

PERFORMANCE COMPARISON OF SPATIAL INDEXING STRUCTURES FOR DIFFERENT QUERY TYPES NEELABH PANT. Presented to the Faculty of the Graduate School of

From Last Time: Remove (Delete) Operation

Indexing the Trajectories of Moving Objects in Networks

Data Structures for Moving Objects

root node level: internal node edge leaf node Data Structures & Algorithms McQuain

Databases and Information Systems 1 Part 3: Storage Structures and Indices

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. The Best-fit Heuristic for the Rectangular Strip Packing Problem: An Efficient Implementation

Efficient Updates for OLAP Range Queries on Flash Memory

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Chapter 13: Query Processing. Basic Steps in Query Processing

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

3. The Junction Tree Algorithms

GiST. Amol Deshpande. March 8, University of Maryland, College Park. CMSC724: Access Methods; Indexes; GiST. Amol Deshpande.

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Large Databases. Abstract. Many indexing approaches for high dimensional data points have evolved into very complex

EE602 Algorithms GEOMETRIC INTERSECTION CHAPTER 27

Approximation Algorithms

Clustering UE 141 Spring 2013

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Data storage Tree indexes

Binary Search Trees 3/20/14

Binary Heaps. CSE 373 Data Structures

Performance Evaluation of Main-Memory R-tree Variants

Environmental Remote Sensing GEOG 2021

A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heap-order property

Lecture 1: Data Storage & Index

CIS 631 Database Management Systems Sample Final Exam

External Memory Geometric Data Structures

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Mining Social Network Graphs

International Journal of Advance Research in Computer Science and Management Studies

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Overview of Storage and Indexing

The R*-tree: An Efficient and Robust Access Method for Points and Rectangles+

Computational Geometry. Lecture 1: Introduction and Convex Hulls

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

DPTree: A Balanced Tree Based Indexing Framework for Peer-to-Peer Systems

Part 2: Community Detection

Rotation Operation for Binary Search Trees Idea:

Cluster Analysis: Advanced Concepts

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Colored Range Searching on Internal Memory

Treemaps with bounded aspect ratio

Optimized Data Indexing Algorithms for OLAP Systems

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

Data Structures. Jaehyun Park. CS 97SI Stanford University. June 29, 2015

Clustering & Visualization

Database Systems. Session 8 Main Theme. Physical Database Design, Query Execution Concepts and Database Programming Techniques

Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010

Fast Sequential Summation Algorithms Using Augmented Data Structures

Lecture 2 February 12, 2003

Project Group High- performance Flexible File System 2010 / 2011

Clustering Via Decision Tree Construction

Information Retrieval and Web Search Engines

Database Design Patterns. Winter Lecture 24

A hierarchical multicriteria routing model with traffic splitting for MPLS networks

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

Transcription:

R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions and splits can be complex Accommodates non-point data easily 1 2 R-Tree Balanced (similar to B+ tree) Index node I is an n-dimensional rectangle of the form (I 0, I 1,..., I n-1 ) where each interval is a range [a,b] Leaf node entry: (I, tuple_id) Non-leaf node entry: (I, child_ptr) M is maximum entries per node. m M/2 is a parameter specifying the minimum number of entries per node. Invariants 1. Every leaf (non-leaf) has between m and M records (children) except for the root. 2. Root has at least two children unless it is a leaf. 3. Every index entry is the smallest rectangle that contains the children. (MBR = Minimum Bounding Rectangle). 4. All leaves appear at the same level. 3 4 1

Example Example 5 6 Searching Given a search rectangle S... 1. Start at root and locate all child nodes which intersect S (via linear search). 2. Search the subtrees of those child nodes. 3. When you get to the leaves, return entries whose rectangles intersect S. Searches may require inspecting several paths. Worst case running time is not so good... Searching for R16 7 8 2

Insertion I1 I2 I3 Insertion is done at the leaves Where to put new entry with rectangle R? 1. Start at root. 2. Go down the tree by choosing child whose rectangle needs the least enlargement to include R. In case of a tie, choose child with smallest area. 3. If there is room in the correct leaf node, insert it. Otherwise split the node (to be continued...) 4. Adjust the tree... 5. If the root was split into nodes N 1 and N 2, create new root with N 1 and N 2 as children. R1 R3 R2 R11 I1 I3 R8 R10 R9 I2 R8 R9 R10 9 10 I3 I1 I2 I3 I3 I1 I2 I3 R1 R8 R9 R10 R1 R8 R9 R10 R2 I1 R3 R11 R8 R10 R9 I2 R2 I1 R3 R11 R8 R10 R9 I2 11 12 3

I3 I1 I2 I3 I3 I1 I2 I3 R1 R8 R9 R10 R1 R8 R9 R10 R2 R3 R8 R10 R2 R3 R8 R10 I1 R11 R9 I2 I1 R11 R9 I2 13 14 Splitting Nodes Problem: Divide M+1 entries among two nodes so that it is unlikely that the nodes are needlessly examined during a search. Solution: Minimize the probability of accessing the nodes during a query. Exhaustive algorithm. Quadratic algorithm. Linear time algorithm. Exhaustive Search Minimize the sum of probabilities of accessing the two pages by a query requires knowledge of range or number of nearest neighbors Try all possible combinations. Optimal results Bad running time! 15 16 4

Quadratic Algorithm 1. Find pair of entries E 1 and E 2 that maximizes area(j) - area(e 1 ) - area(e 2 ) where J is covering rectangle. Pick the pair with least affinity, i.e., the pair that wastes maximum space 2. Put E 1 in one group, E 2 in the other. 3. If one group has M-m+1 entries, put the remaining entries into the other group and stop. If all entries have been distributed then stop. 4. For each entry E, calculate d 1 and d 2 where d i is the area increase in covering rectangle of group i when E is added. 5. Find E with maximum d 1 - d 2 and add E to the group whose area will increase the least. Time complexity Algorithm is quadratic in M. Linear in number of dimensions. 17 18,R7 as seeds: R7, as seeds: Choose, as seeds seeds R7 R6 Minimum occupancy guarantee may force and to be assigned to 19 20 5

Linear Algorithm For each dimension, Choose the pair of entries with the largest separation (highest low value and lowest high value). Normalize by dividing by the width of entire set along that dimension. Choose the two entries (and dimension) with the largest normalized separation as the initial seeds. Randomly, but evenly divide the rest of the entries between the two groups. Algorithm is linear in M (capacity); almost no attempt at optimality. Deletion 1. Find the entry to delete and remove it from the appropriate leaf L. 2. Set N=L and Q =. (Q contains to-be-inserted entries) 3. If N is root, go to step 6. Else, let P be N s parent and E N be the entry in P that points to N. 1. If N has less than m entries, delete E N from P and add contents of N to Q. 2. If N has at least m entries then set the rectangle of E N to tightly enclose N. 4. Set N=P and repeat from step 3. 5. *Reinsert leaf entries from Q. Reinsert non-leaf entries from Q higher up so that all leaves are at the same level. 6. If root has 1 child, make the child the new root. 21 22 Space requirements (2kd + p) bytes per index entry for d dimensions k bytes per dimension p bytes for pointer Same for data entries with spatial extent (kd+p) bytes per data entry for point data Trees tend to be very wide and shallow Performance Tests CENTRAL circuit cell (1057 rectangles) Insertion test performance on last 10% inserts. Search test randomly generated rectangles that retrieve about 5% of the data. Deletion test delete every 10 th entry. Page size varies from 128 bytes to 2K M varies from 6 to 102 23 24 6

Insertion performance With linear-time splitting, inserts spend very little time doing splits. Growth with page size as expected. Increasing m reduces insertion cost because the minimum occupancy requirement gets used earlier in the insertion algorithm. Deletion performance Deletion cost affected by m. For large m: More nodes become under-full (occupancy < m). More reinserts take place. More possible splits. Running time is pretty bad for m = M/2. 25 26 Search performance Space Efficiency Stricter node fill criterion leads to smaller index. Search is relatively insensitive to splitting algorithm. Less I/O with larger pages. More CPU cost with larger pages. Smaller values of m reduce average number of entries per node, so less time is spent on search in the node. 27 28 7

Conclusions Linear time splitting algorithm is almost as good as the others. Low node-fill requirement reduces spaceutilization but is not significantly worse than stricter node-fill requirements. R-tree can be added to relational databases. Took more than 10 years! The R*-tree: An Efficient and Robust Access Method for Points and Rectangles Norbert Beckmann, Hans-Peter Kriegel Ralf Schneider, Bernhard Seeger 29 30 R*-tree Optimization on R-tree Minimize area, overlap, and margin (sum of the sides of a rectangle) Insertion at levels above leaf-1, as before at leaf-1 level, choose subtree with minimum overlap overlap(e,node) = sum of area(e entry) for all entry in node only marginally better than R-tree Split strategy M = max capacity, m = min capacity For each dimension, sort M+1 values by the lower value (use upper value to break ties) M=7,m=3 Consider groups containing the first m-1+k and the remaining M+2-m-k entries with k in [1,M-2m+2] Evaluate the area-value, margin-value, and overlap-value for each split point 31 32 8

Split strategy Area-value(split) = area(first group) + area(second group) Smaller area reduces access probability of access Margin-value(split) = margin(first group) + margin(second group) Small margin produces better packing and less overlaps Overlap-value(split) = common area of two groups Minimize common search area Choose split axis as the one containing the smallest Margin-value split Along the split axis, choose the splitting point to be the one that gives the minimum overlap-value. Use area-value to resolve ties. Forced reinserts When a split occurs at level k, sort the entries in overflowing node in a descending order based on the distance of their centroid from the node centroid Remove the first p entries and adjust the bounding rectangle of the overflowing node Reinsert the p removed entries (data or index) Empirical value for p = 30% This reduces overlap and leads to a better structure. 33 34 Test Data (F1) Uniform 100,000 rectangles. (F2) Cluster Centers are distributed into 640 clusters of about 1600 objects each. (F3) Parcel decompose unit square into 100,000 disjoint rectangles and increase area of each rectangle by factor of 2.5. (F4) Real-Data 120,576 rectangles from elevation lines from cartography data. (F5) Gaussian Centers follow 2-dimensional independent Gaussian distribution. (F6) Mixed-Uniform 99,000 uniformly distributed small rectangles and 1,000 uniformly distributed large rectangles. Performance Rectangle intersection query All data rectangles intersecting the query rectangle Point enclosure query All data rectangles containing the query point Rectangle enclosure query All data rectangles containing the query rectangle Spatial joins (intersection) 1K page size, M = 50 35 36 9

Typical Performance Data Storage utilization Spatial Join Test files: (SJ1) 1000 random rectangles from (F3) joined with (F4) (SJ2) 7500 random rectangles from (F3) joined with 7,536 rectangles from elevation lines. (SJ3) Self-join of 20,000 random rectangles from (F3) Relative performance Disk accesses 37 38 Point dataset and range queries Summary of experiments Significant improvement over R-tree No test data for more than two dimensions. R*-tree is robust even for bad data distributions. R*-tree reduces # of splits and is more space efficient than other R-tree variants. R*-tree outperforms all other R-tree variants in page I/O. Problems CPU cost not calculated. Comparison with linear scan performance? 39 40 10