GiST. Amol Deshpande. March 8, 2012. University of Maryland, College Park. CMSC724: Access Methods; Indexes; GiST. Amol Deshpande.



Similar documents
Data Warehousing und Data Mining

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Generalizing Database Access Methods

Physical Data Organization

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB

DATABASE DESIGN - 1DL400

Databases and Information Systems 1 Part 3: Storage Structures and Indices

Large Databases. Abstract. Many indexing approaches for high dimensional data points have evolved into very complex

Lecture 1: Data Storage & Index

External Memory Geometric Data Structures

THE concept of Big Data refers to systems conveying

TheHow and Why of Having a Successful Home Office

Vector storage and access; algorithms in GIS. This is lecture 6

QuickDB Yet YetAnother Database Management System?

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

File Management. Chapter 12

Oracle8i Spatial: Experiences with Extensible Databases

Data Structures for Moving Objects

Chapter 13: Query Processing. Basic Steps in Query Processing

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

Cluster Analysis for Optimal Indexing

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

SQL Query Evaluation. Winter Lecture 23

Indexing and Processing Big Data

BIRCH: An Efficient Data Clustering Method For Very Large Databases

High-Performance Extensible Indexing

Overview of Storage and Indexing

Database Design Patterns. Winter Lecture 24

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

In-Memory Databases MemSQL

Big Data and Scripting. Part 4: Memory Hierarchies

Multi-dimensional index structures Part I: motivation

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

ISSUES IN SPATIAL DATABASES AND GEOGRAPHIC INFORMATION SYSTEMS (GIS) HANAN SAMET

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Data storage Tree indexes

Performance of KDB-Trees with Query-Based Splitting*

Database Systems. Session 8 Main Theme. Physical Database Design, Query Execution Concepts and Database Programming Techniques

Unsupervised Data Mining (Clustering)

Expanding the CASEsim Framework to Facilitate Load Balancing of Social Network Simulations

Clustering & Visualization

Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8

10. Creating and Maintaining Geographic Databases. Learning objectives. Keywords and concepts. Overview. Definitions

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Symbol Tables. Introduction

Partitioning and Divide and Conquer Strategies

Sorting Hierarchical Data in External Memory for Archiving

Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases

The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

Index Internals Heikki Linnakangas / Pivotal

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

A Novel Index Supporting High Volume Data Warehouse Insertions

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Spatial Data Management over Flash Memory

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Storage and File Structure

InfiniteGraph: The Distributed Graph Database

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

MySQL Storage Engines

PETASCALE DATA STORAGE INSTITUTE. Petascale storage issues. 3 universities, 5 labs, G. Gibson, CMU, PI

MIDAS: Multi-Attribute Indexing for Distributed Architecture Systems

COS 318: Operating Systems. File Layout and Directories. Topics. File System Components. Steps to Open A File

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

From Last Time: Remove (Delete) Operation

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Accelerating network security services with fast packet classification

File Management Chapters 10, 11, 12

Graph Database Proof of Concept Report

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Concurrency Control. Chapter 17. Comp 521 Files and Databases Fall

Hypertable Architecture Overview

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

Providing Diversity in K-Nearest Neighbor Query Results

The DC-tree: A Fully Dynamic Index Structure for Data Warehouses

Persistent Binary Search Trees

The Classical Architecture. Storage 1 / 36

Chapter 12 File Management

Chapter 12 File Management. Roadmap

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

CIS 631 Database Management Systems Sample Final Exam

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Data Structures for Databases

A Comparison of Dictionary Implementations

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Physical Design. Phases of database design. Physical design: Inputs.

Efficient Structure Oriented Storage of XML Documents Using ORDBMS

Chapter 12 File Management

Hierarchical Data Visualization. Ai Nakatani IAT 814 February 21, 2007

Learning Outcomes. COMP202 Complexity of Algorithms. Binary Search Trees and Other Search Trees

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

A Deduplication-based Data Archiving System

Large-Scale Spatial Data Management on Modern Parallel and Distributed Platforms

FAWN - a Fast Array of Wimpy Nodes

Transcription:

CMSC724: ; Indexes; : Generalized University of Maryland, College Park March 8, 2012

Outline : Generalized : Generalized

: Why? Most queries have predicates in them Accessing only the needed records key in performance How relations are stored? Heap files: sequential scans, very very fast Index structures: random accesses to the needed data Scan performance increasing much faster than seeks Must perform much better than Scan No point in building indexes on small relations Note the emphasis on queries Utility depends more on query workload than data Why not use in-memory indexes? Data exchange with disks in units of blocks : Generalized

Support iterator interface: open (possibly with selection condition) get_next, close, insert, delete, update_field Performance goals: Disk I/O (or time) for lookups, inserts, deletes cold vs hot lookups Compare to sequential (seek times improving much slower) : Generalized

At a high level: partition: partition a dataset or domain into buckets label: provide a label for each bucket Sometimes hierarchically (trees), sometimes not (hashing) Partitioning is critical for good performance In s, we take the sorted order as a given since natural For all other cases, unclear how to pack : Generalized

Outline : Generalized : Generalized

B+ Trees Balanced; Optimal for 1-d (O(log B n) search/update) (B = 100-500) Utilization kept around 70% or so In practice, deletes do not result in merging of siblings : Generalized

B+ Tree Inserts : Generalized

R-Tree (Points) : Generalized

Quad Tree : Generalized

R-Tree (Rectangles) : Generalized

sented by two predicates: a centroid point (weighted center of mass) and a bounding sphere radius. 2 Each SS-Tree tree (a) (A) 2 1 (d) (c) (e) (B) (A) 3 (B) (b) (a) (b) (c) (d) (e) centroid query datum (record) (a) (b) Figure 1. Similarity search using an SS-tree. (a) Spatial coverage diagram. (b) Tree structure diagram. 2 Even though the SS-tree does center its bounding sphere on the centroid, the bounding sphere need not be the (unique) minimum boundbound well a bor. F that w to rem 3.2. I Th canno rithm tree b ond, u beyon : Generalized and upd cates. 3 is base here be

Key Differentiating Factors Data (1-d vs 2-d vs n-d, points vs intervals vs spatial objects vs images etc...) Query types (equality, range, nearest-neighbor etc..) Balanced (B+-tre, R-Tree) vs Unbalanced (Quad-tree) Balanced predictable, uniform performance, but hard to guarantee Typically requires rearranging of labels, splits etc.. : Generalized

Key Differentiating Factors Data- vs Space-partitioning Data-partitioning: the buckets are disjoint, but the labels may not be May have to follow down the tree along multiple paths (e.g. R*-tree) Space-partitioniing: the labels are disjoint, buckets may not be e.g. Quad-trees, K-D-B trees May have to duplicate pointers to data items in the leaves (e.g. R+-tree) B+-trees: disjoint buckets and disjoint labels : Generalized

Imagine: The data is already stored on disk in some arbitrary order and you are not allowed to change it How would you best build a hierarchical index structure on top for equality queries? Use BloomFilters? No option is going to work well if the data is really arbitrary and you can t find something to order by But an interesting thought exercise E.g. you might discover the third byte is different across blocks, but same within a block Clustering of data is critical Obvious for 1-d data, not so clear otherwise Not academic question: Imagine building an index over a distributed Grid/P2P data : Generalized

Implementation Issues: Concurrency & recovery Very important issue Intertwined to a very complex degree Can t build access methods in vacuum for just querying Cost estimation Query optimizer needs this information Bulk loading Important have to be done very often : Generalized

Outline : Generalized : Generalized

Balanced, 50% utilization In practice, allow getting lower when doing deletes Inserts are more common, something will get inserted there soon O(log d (n)) search, update, delete costs d = order of the tree (number of keys per page) Optimizations Key compression Bulk loading algorithms Faster count queries Maintain counts of tuples in the subtrees at the inner nodes : Generalized

Concurrency: not 2PL - too slow Release locks on upper-level nodes as soon as possible Too many queries want to access them Tricky when doing inserts Higher-level pages may have to be split One Solution: Do preparatory splits when inserting We will talk about this in detail later B-Trees? The inner nodes store pointers to data all pointers to data are at the leaves s make many things significantly easier E.g. Can do a scan on the leaves for range queries : Generalized

Outline : Generalized : Generalized

Indexes B+-tree: Optimal for one-dimensional data (for range/equality queries) Linear hashing, extensible hashing: Only equality queries Multi-dimensional point data Range queries: (20 < age < 30) (10, 000 < salary) Space-filling curves: Impose a linear order on the multi-d data (limited applicability) Grid-files, Quad-trees, K-D-B trees etc... Nearest-Neighbor queries/similarity searches (very common) Many indexing structures designed, no real consensus Golden rule: Must beat sequential scans : Generalized

Indexes Multi-dimensional spatial data (regions, areas etc.) Queries: find all objects that contain this point, find objects that overlap this object variants Intervals (e.g. time periods associated with events) Queries: Find intervals containing this point, find overlapping intervals etc... Several optimality results exists (see work by Lars Arge, Jeff Vitter et al.) XML? Some work, but generally considered very hard : Generalized Search Tree : Generalized

Indexes: A Timeline Multidimensional ; Gaede, Gunther; ACM Surveys 1998 P-Tree (Jagadish 90c) Packed R-Tree (Roussopoulos et al.. 85) Cell Tree (Gunther 88) Cell Tree With Oversize Parallel R-tree Shelves (Gunther 91) (Kamel/Faloutsos 92) Sphere Tree TR*-Tree (Oosterom 90) (Schneider/Kriegel 91) X-tree (Berchtold et al. 96) B-Tree (Bayer et al. 72) R-Tree (Guttman 84) R+-Tree (Sellis et al. 87) R*-Tree (Beckmann et al. 90) TV-Tree (Lin et al. 94) Grid File (Nievergelt et al. 81) BANG File (Freeston 87) Hilbert R-tree P-Tree General Grid File (Schiwietz 93) (Kamel/Faloutsos 94) (Blanken et al. 90) Extendible Hashing (Fagin et al. 79) Linear (Larson 80, Litwin 80) Adaptive K-D-Tree (Bentley 79) Hashing EXCELL (Tamminen 82) Buddy Tree Two-Level Grid File (Seeger/Kriegel 90a) (Hinrichs 85) R-File Multi-Level Grid File Twin Grid File (Hutflesz et al. 90) (Whang/Krishnamurthy 85) (Hutflesz et al. 88b) Extended K-D-Tree (Matsuyama et al. 84) Multi-Layer Grid File (Six/Widmayer 88) Nested Interpolationbased Grid File MOLHPE (Kriegel/Seeger 86) (Ouksel/Mayer 92) Z-Hashing (Hutflesz et al. 88a) BV-Tree (Freeston 95) Filter Tree (Sevcik/Koudas 96) : Generalized K-D-Tree K-D-B-Tree (Robinson 81) Quantile Hashing (Kriegel/Seeger 87) Segment Indexes (Kolovson/Stonebraker 91) (Bentley 75) LSD-Tree lz-hashing Point Quadtree (Klinger 71) BSP-Tree (Fuchs et al. 80) SKD-Tree (Henrich et al. 89) (Hutflesz et a. 91) (Ooi et al. 87) KD2B-Tree (Oosterom 90) G-Tree BD-Tree PLOP-Hashing (Kumar 94a) Region Quadtree (Ohsawa/Sakauchi 83) (Kriegel/Seeger 88) GBD-Tree (Finkel/Bentley 74) (Ohsawa/Sakauchi 90) Space-Filling Curves (Morton 66) Z-Ordering (Orenstein/Merrett 84) Fieldtree Interpolation-Based hb-tree (Lomet/Salzberg 89) DOT hb! -Tree (Evangelidis et al. 95) (Frank 83) Grid File (Ouksel 85) (Faloutsos/Rong 91) 1966 71 75 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 Figure: A Timeline of Indexes (From Multidimensional Access Methods; Gaede, Gunther; ACM Surveys 1998)

Indexes Much work since then as well When reading these papers, ask yourself: Does it beat sequential scan sufficiently? Is the data/workload realistic? Are there other natural workloads on which it may not do well? Little rigor in this area Some theoretical work, but problems not easy Curse of Dimensionality : Generalized

Outline : Generalized : Generalized

R-Tree : Generalized Figure: R-Tree

R-Tree Multi-dimensional, spatial data (points, rectangles) Queries: point in polygon, polygon in polygon, overlaps polygon, contains polygon labels: bounding rectangles Bulk loading? Hard... Search: Follow all paths. Insert: Driven by minimizing area enlargement Split algorithms: exhaustive, quadratic, linear Delete: re-insert if too small (why?) : Generalized

R*-Tree R*-Tree: An improvement over R-Tree Analysis: four optimization metrics? Minimize area covered by a directory rectangle. Minimize overlap Minimize margin Maximize storage utilization Conflict with each other E.g., minimizing area covered conflicts with maximizing storage utilization. : Generalized

R*-Tree Changes: Insertion algorithm slightly different (minimizes overlap at leaf level) Aggressive re-insertion (30% entries re-inserted at the same level) Causes headaches with concurrency Lots of heuristics... backed by experimental analysis... Shown to outperform R-Trees in many experimental studies : Generalized

R+-Tree R+-Tree Space-partitioning version of R*-Tree Forces non-overlapping keys So same data item must be inserted into multiple leaf nodes BUT don t need to follow all paths down to the leaves : Generalized

Outline : Generalized : Generalized

Motivation: Extensibility New applications: GIS, multimedia (e.g. pictures), CAD, libraries, sequence datasets (Bioinformatics) etc... Object-relational systems allow defining new data types What about querying over them? Two proposed solutions: Option 1: Design new index structures Option 2: Try to use an existing index structure E.g. Can use space-filling curves and s to support querying multi-dimensional data Limited applicability (only equality/range queries) What if the app needs new type of query? Postgres paper had an initial discussion : Generalized

Heap B + -tree Heap B + -tree R-tree Query Processing Query Processing New AMs require custom CC&R code R-tree R*-tree SR-tree Storage, Buffer, Log,... Storage, Buffer, Log,... : Generalized (a) Standard ORDBMS (b) ORDBMS with Figure 1: Access method interfaces the database extender s perspective. Figure: From: High-Performance Extensible Indexing; ructure that is easily extensible in both the data types it n index and the query types itcan support. encapsutes core Kornacker; indexing functionality VLDB such1999 as search and update erations, concurrency and recovery. The interface, e the existing extensibility interfaces, defines a set of nctions for implementing an external AM. However, the ist interface raises the level of abstraction, only requiring e AM developer to implement the semantics of the data pe that is being indexed and those operational properties nal handler to catch segmentation violations and bus errors, 2 allocationofadditionalstack space, if necessary, and checking of parameters for NULL values. This makes a UDF call considerably more expensive than a regular function call. In Oracle and DB2, UDFs can be executed in a separate address space, which even adds to the cost. When dividing the full functionality of an AM between the database server and an external extension module, as does, UDF calls become inevitable, which can become a performance prob-

Generalized Search Tree Allows extending data types as well as queries A single data structure that can handle many different index structures So a single code-base How to use? Register six methods with the database system Start inserting/deleting/querying Allows indexing arbitrary types of data Question: Is it always a good idea to use a? No Some data and query workloads not amenable to indexing (scan preferred) Ideas later further developed in Theory of Indexability : Generalized

Key insight: An index structure partitions the input data hierarchically associates a predicate with each subtree, that is true for all data items in the subtree Predicates on a single path from root to a leaf may not agree with each other, but must agree with the leaf Nodes contain between 2 to M entries (except root) Leaf nodes: (p, ptr) ptr: pointer to actual record p: predicate satisfied by the record Non-leaf nodes: (p, ptr) ptr: pointer to another node p: predicate satisfied by all records in the subtree below : Generalized

Need to define 6 functions for a new search tree Consistent(E, q): given a E = (ptr, p), might q be satisfied by some tuple in the subtree below ptr search/querying (search also done when inserting) Union: Find new keys inserts (when add a new E to a page) Compress, Decompress: used for compressing the keys Required to implement common optimizations Penalty, PickSplit: Used for deciding where to insert a new object, and how to split a page if needed Very similar to R-Tree in many regards : Generalized

Algorithms Search: Query q Find all pairs E = (p, ptr) such that consistent(e, q) Follow down all the pointers Somewhat inefficient, can do better for linear orders Insert/Delete: Keep the tree balanced Use the methods Penalty, PickSplit etc, to decide where to insert/delete, how to rearrange Discussion of how to support R*-Tree illustrates the difficulties simulating an index precisely But as with all generalized/extensible approaches, you gain in simplicity what you sacrifice in performance : Generalized

: Analysis Why an index might perform poorly? Predicates at inner nodes not effective traverse down unnecessarily Reason 1: Too much overlap between the data items themselves (e.g. spatial data) Reason 2: Key compression not good, ie., the predicates can t approximate the subtree well (e.g. homework question) Predicates too large in size in number of bytes If predicates are allowed to be large, then search will be more efficient (fewer paths travelled) BUT large predicates tree height increases Trade-off between key compression and search effectiveness : Generalized

: Analysis; Why an index might perform poorly? Poor storage utilization (too much wasted space) Trade-off between this and above factors Better storage utilization increases key overlap Since we may have to force items together that shouldn t be BUT poor storage utilization tree height increases Complex trade-offs that can only be answered given a dataset and a query workload : Generalized

: Using Bloom Filters as Predicates The predicates are Bloom filters of the items in the subtree (as in homework) Only supports equality queries Consistent(E, q): Check if q the Bloom filter Union: Bit-wise union etc... Why bad? If the Bloom Filter size is small (say 10 bits): Too much key overlap All bits in the higher level nodes likely to be set to 1 Many predicates will satisy Consistent(E, q) If the Bloom Filter size large (say 1000 bits): Number of keys per page too low The height of the tree will be large Not sure if anybody has formally analyzed this : Generalized

: Other issues Much later work at Berkeley: Project Website Indexability theory Formalisms for analysis: different types of inefficiencies AmDB: A visual debugger and profiler Concurrency, recovery etc: Not addressed in this paper See High-Performance Extensible Indexing : Generalized

: How extensible is it? Generalizes many ideas, but some limitations Recall the discussion of R*-Trees in the paper From: Generalizing Search...; P. Aoki; ICDE 98 SS-Tree: Similarity search tree For nearest-neighbor queries Records organized in hierarchical clusters For each cluster: store centroid, bounding sphere radius Search: Traverse down the tree looking for the sphere closest to the query point Several Issues: e.g. Search is not depth-first Need a few modifications (see the paper above) : Generalized