Data Warehousing und Data Mining

Similar documents
Multi-dimensional index structures Part I: motivation

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Lecture 1: Data Storage & Index

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

Big Data and Scripting. Part 4: Memory Hierarchies

Datenbanksysteme II: Implementation of Database Systems Implementing Joins

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

DATABASE DESIGN - 1DL400

CSE 326: Data Structures B-Trees and B+ Trees

Vector storage and access; algorithms in GIS. This is lecture 6

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

Clustering & Visualization

Chapter 13: Query Processing. Basic Steps in Query Processing

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

GiST. Amol Deshpande. March 8, University of Maryland, College Park. CMSC724: Access Methods; Indexes; GiST. Amol Deshpande.

SQL Query Evaluation. Winter Lecture 23

Physical Data Organization

Analysis of Algorithms I: Binary Search Trees

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Databases and Information Systems 1 Part 3: Storage Structures and Indices

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

New Approach of Computing Data Cubes in Data Warehousing

Data Structures for Moving Objects

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Indexing Techniques for Data Warehouses Queries. Abstract

Binary Heap Algorithms

Data storage Tree indexes

The Advantages and Disadvantages of Network Computing Nodes

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *

Project Group High- performance Flexible File System 2010 / 2011

Indexing Spatio-Temporal archive As a Preprocessing Alsuccession

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Lecture 10: Regression Trees

File Management. Chapter 12

Big Data & Scripting Part II Streaming Algorithms

A Note on Maximum Independent Sets in Rectangle Intersection Graphs

Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse

Database Systems. Session 8 Main Theme. Physical Database Design, Query Execution Concepts and Database Programming Techniques

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Performance Tuning for the Teradata Database

Overview of Storage and Indexing

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Persistent Data Structures

CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB

Physical DB design and tuning: outline

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Persistent Binary Search Trees

Record Storage and Primary File Organization

Chapter 12 File Management

Principles of Data Mining by Hand&Mannila&Smyth

Unsupervised Data Mining (Clustering)

D B M G Data Base and Data Mining Group of Politecnico di Torino

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

Data Preprocessing. Week 2

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Data Structures and Data Manipulation

CIS 631 Database Management Systems Sample Final Exam

Unit Storage Structures 1. Storage Structures. Unit 4.3

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

Authors. Data Clustering: Algorithms and Applications

SMALL INDEX LARGE INDEX (SILT)

Fast Sequential Summation Algorithms Using Augmented Data Structures

Carnegie Mellon University. Extract from Andrew Moore's PhD Thesis: Ecient Memory-based Learning for Robot Control

Common Core Unit Summary Grades 6 to 8

External Memory Geometric Data Structures

Advanced Oracle SQL Tuning

Partitioning and Divide and Conquer Strategies

Oracle8i Spatial: Experiences with Extensible Databases

A Comparison of Dictionary Implementations

COS 318: Operating Systems

International Journal of Advance Research in Computer Science and Management Studies

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

Data Structure with C

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

Transcription:

Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik

Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data Warehousing und Data Mining 2

Multidimensional Queries Conditions on more than one attribute Combined through AND (intersection) or OR (union) Partial queries: Conditions on some but not all dimensions Selects sub-cubes 2D: All beverage sales in March 2000 4D: All beverage sales in 2000 in Berlin to male customers Ulf Leser: Data Warehousing und Data Mining 3

Composite Indexes month_id product_id Point X Y P1 2 2 P2 2 2 P3 5 7 P4 5 6 P5 8 6 P6 8 9 P7 9 3 Imagine composite index on (X, Y) Box queries: efficiently supported Partial queries All points/rectangles with X coordinate between Efficiently supported All points/rectangles with Y coordinate between Not efficiently supported Ulf Leser: Data Warehousing und Data Mining 4

Composite Index One index with two attributes (X, Y) 1 2 1 4 8 2 8 3 9 1 Prefix of attribute list in index must be present in query The longer the prefix in the query, the better Alternatives Also index (Y, X) 5 attributes: O(5!) orders Combinatorial explosion, infeasible for more than 2 attributes Use independent indexes on each attribute Ulf Leser: Data Warehousing und Data Mining 5

Independent Indexes One index per attribute Index on X Index on Y Partial match query on one attribute: supported Partial match query on many attributes Compute TID lists for each attribute Intersect Ulf Leser: Data Warehousing und Data Mining 6

Independent versus Composite Index Consider 3 dimensions of range 1,...,100 1.000.000 points, uniformly (and randomly) distributed Index blocks hold 50 keys or records Index on each attribute has height 4 Find points with 40 x 50, 40 y 50, 40 z 50 Using independent indexes Using x-index, we generate TID list X ~100.000 Using y-index, we generate TID list Y ~100.000 Using z-index, we generate TID list Z ~100.000 For each index, we have 4+100.000/50=2004 IO TIDs are sorted in sequential blocks, each holding 50 TIDs Hopefully, we can keep the three lists in main memory Intersection yields ~1.000 points with 6012 IO Ulf Leser: Data Warehousing und Data Mining 7

Independent versus Composite Index Consider 3 dimensions of range 1,...,100 1.000.000 points, uniformly (and randomly) distributed Index blocks hold 50 keys or records Index on each attribute has height 4 Find points with 40 x 50, 40 y 50, 40 z 50 Using composite index (X,Y,Z) Number of indexed points doesn t change Key length increases assume blocks hold only 30 (10) keys or records Index has height 5 (6) This is worst case index blocks only 50% filled Total: 5 (6) +1000/30 (10) ~ 38 IO (104) Ulf Leser: Data Warehousing und Data Mining 8

Conclusion 1 We want composite indexes Much less IO Things get worse for bigger d TID lists don t fit into main memory paging, more IO Reading large TIDs lists again and again is more work than scanning relation once Linear scanning of relation might be faster Advantage of composite indexes grows exponentially with number of dimensions and selectivity of predicates Things get complicated if data is not uniformly distributed Dependent attributes (age weight, income, height, ) Clustering of points Histograms Ulf Leser: Data Warehousing und Data Mining 9

Conclusion 2 But: For partial queries, we would need to index all combinations Solution: Use multidimensional indexes Generally an improvement, but many problems prevail Curse of dimensionality remains Most md indexes degrade for many dimensions All md indexes have some worst-case data distribution Usually far from normal or equal distribution Bad space usage, excessive management cost, Alternative: Bitmap indexes Need to load values for all tuples and all dimensions But: Very small memory footprint, fast BIT operations Ulf Leser: Data Warehousing und Data Mining 10

Multidimensional Indexes All dimensions are equally important Neighbors are (hopefully) stored on nearby blocks Locality property Difficult to achieve, very important Supported queries Exact match point queries Partial match point queries, box queries (range queries) Nearest neighbor queries Difference to B*-trees All B-trees need a total order on the keys For more than one dimension, no such order exists Ulf Leser: Data Warehousing und Data Mining 11

Content of this Lecture Multidimensional Indexing Grid-Files General Structure Splits Kd-trees Ulf Leser: Data Warehousing und Data Mining 12

Grid-File??? Referenz Classical multidimensional index structure Simple: searching, indexing, deleting Good for uniformly distributed data Does not handle skewed data very well Many variations Design goals Index structure for point objects Support exact, partial match, range, and neighborhood queries Guaranteed two IO access Under some assumptions All dimensions are treated the same Adapts dynamically to the number of points Ulf Leser: Data Warehousing und Data Mining 13

Fixed Grid-File [Does not adapt to data distribution at all] Idea Split space into equal-spaced cuboids or cells We need maximal and minimal values in all dimensions Directory stores one pointer for each cell Points to block storing pointers to records Directory Index blocks Records / table Ulf Leser: Data Warehousing und Data Mining 14

Operations Problem: one index block can hold only m pointers Deleting a point Compute cell using coordinates Search cell in directory and load index block Search point and delete, if present Index block may become empty Eventually, the entire structure may consist of many empty blocks (unused space) Ulf Leser: Data Warehousing und Data Mining 15

Operations Inserting a point Locate and load index block If free space: insert point If no free space: generate overflow block No adaptation to skewed data distributions May degenerate to linear search if all points fall in the same (set of) block Ulf Leser: Data Warehousing und Data Mining 16

Principle of Grid-Files Partition each dimension into disjoint intervals (scales) The intersection of all intervals defines all grid cells Convex d-dimensional cuboids Grid directory holds pointer to index blocks per cell When cell overflows split cell (no overflow blocks) Each point falls into exactly one grid cell Many grid cells (a region) may point to the same index block Ulf Leser: Data Warehousing und Data Mining 17

Exact Point Search Finding a block for a given point Keep scales for each dimension in memory Look-up point coordinates in scales and derive grid coordinates Map grid coordinates to index block on disk Complexity We assume that the directory is in main memory Other techniques exist, i.e., B*-tree over grid coordinates Load index block (1st IO) Search point in index block Access record following pointer (2nd IO) Guaranteed 2 IO Ulf Leser: Data Warehousing und Data Mining 18

Range Query, Partial Match Query Range query Compute grid cell coordinate for each end point All grid directory entries in that range may contain qualifying points Partial match query Compute partial grid cell coordinates All grid directory entries with these coordinates may contain points Ulf Leser: Data Warehousing und Data Mining 19

Content of this Lecture Multidimensional Indexing Grid-Files General Structure Splits Kd-trees Ulf Leser: Data Warehousing und Data Mining 20

Inserting Points If index block has free space no problem Otherwise (1st split): Split cuboid Choose a dimension and a scale to split Create new scale, create new index block, distribute points according to the chosen split Insert point This implicitly splits all other grid cells with the same scale Choice of dimension and scale to split is difficult Optimally, we would like to split as many very full index blocks as evenly as possible This is an optimization problem in itself Ulf Leser: Data Warehousing und Data Mining 21

Example Imagine one index block holds 3 pointers maximum Note: Usually we have unevenly spaced intervals New point causes overflow Where should we split? Vertical split Splits 2 (3,4)-point blocks Leaves one 3-point block Horizontal split Splits 2 (3,4)-point blocks Leaves one 3-point block Need to consider O(d i n-1 ) regions Ulf Leser: Data Warehousing und Data Mining 22

Inserting Points -2- Otherwise (none 1st-split) Check borders of index block wrt to scales Borders form a minimal cuboid? Proceed as if it was a 1st-time split Otherwise Split region of index block into smaller cells Consider only not yet realized scales as potential split Create new index block, redistribute points Ulf Leser: Data Warehousing und Data Mining 23

Grid-File Example 1 (from J. Gehrke) (N=6) A 1 6 A 2 5 A 1 2 3 4 5 6 3 4 Ulf Leser: Data Warehousing und Data Mining 24

Grid-File Example 2 (N=6) A B A B 1 6 7 3 2 8 9 5 12 10 11 4 A B 1 23 35 47 85 10 6 2 4 6 9 11 12 Ulf Leser: Data Warehousing und Data Mining 25

Grid-File Example 3 (N=6) A 14 B A B 1 13 7 6 C B C 3 15 2 8 9 5 12 10 11 4 A B C 1 37 58 13 7 14 8 10 15 2 4 6 9 11 12 3 5 10 Ulf Leser: Data Warehousing und Data Mining 26

Grid-File Example 4 (N=6) A D 14 B 1 6 A D B B 13 7 C B C B C 16 3 15 2 8 9 5 12 10 11 4 A 1 23 78 13 35 8 13 16 47 14 85 10 15 6 B C D 2 4 6 9 11 12 3 5 10 7 14 15 Ulf Leser: Data Warehousing und Data Mining 27

Grid-File Example 5 (N=6) y 4 y 3 A H I D F B A A H I D D F F B B y 2 y 1 E C G A I G F B E E G F B C C C C B x 1 x 2 x 3 x 4 Ulf Leser: Data Warehousing und Data Mining 28

Deleting Points Search point and delete If regions become almost empty, merge index blocks A merge is the removal of a split Must build larger convex regions This can become very difficult Potentially, more than two regions need to be merged to keep convexity condition Example: Where can we merge regions?? A A H I D D F F B B A I G F B E E G F B C C C C B Ulf Leser: Data Warehousing und Data Mining 29

Conclusions Grid-Files always split parallel to the dimension axes This is not always optimal Use others than rectangles as cells: circles, polygons, etc. Might not disjointly fill the space any more Allow overlaps - R trees Good: Good bucket fill degrees Bad: Grid directory grows very fast Good: Two IO guarantee (if directory fits into memory) Each split finally becomes valid for the entire grid Need not be realized immediately, but restricts later choices Bad adaptation to unevenly skewed data The more dimensions, the worse Ulf Leser: Data Warehousing und Data Mining 30

Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data Warehousing und Data Mining 31

kd-tree??? REF Grid-File disadvantages All hyperregions of the n-dimensional space are eventually split at the same dimension / scale First cell that overflows determines split This choice is global and never undone kd-trees Multidimensional variation of binary search trees Hierarchical splitting of space into regions Regions in different subtrees may use different split positions Better adaptation to clustering of data than Grid-Files kd-tree is a main memory data structure kdb-trees: IO-optimization for layout of inner nodes (not here) Ulf Leser: Data Warehousing und Data Mining 32

General Idea Binary, rooted tree Paths are selected by dimension / value x<3 x 3 Dimensions are not statically assigned to levels of the tree Data points are only stored in leaves A leaf stores all points in a n-dim hypercube with m border planes (m n) y<1 (2,0) y (0,4) (1,1) y<3 (3,1) 1 x<5 (4,6) (3,3) y 3 y< y < 2 7 x 5 (5,6) y 2 (6,4) Ulf Leser: Data Warehousing und Data Mining 33

Example the Brick wall x<3 x 3 y<1 (2,0) y (0,4) (1,1) 1 x<5 y< 7 x 5 (4,9) 10 y<3 (3,1) y 3 (4,6) (3,3) y < 2 y 2 (6,4) 10 Ulf Leser: Data Warehousing und Data Mining 34

Local Adaptation Ulf Leser: Data Warehousing und Data Mining 35

Search Operations Exact point search? Partial match query? Range query? Nearest Neighborhood? Ulf Leser: Data Warehousing und Data Mining 36

Search Operations Exact point search In each inner node, decide direction based on split condition Search leaf for searched point Partial match query If dimension of condition in inner node is part of the query proceed as for exact match Otherwise, follow all children in parallel (multiple search paths) Range Query Ulf Leser: Data Warehousing und Data Mining 37

kd-tree Insertion Search data block of leaf If free space available insert, done Otherwise, chose split dimension and position for this block This is a local decision; remains stable for the future of the subtree Find dimension and split that divides set of points into two sets Consider current points and split in sets of approximately equal size Consider known distributions of values in different dimensions Use alternation scheme for dimensions Finding optimal split points is expensive for high dimensional data (point set needs to be sorted in each dimension) use heuristics Wrong decisions in early splits lead to tree degradation CS students at HU: Don t split at sex, place of birth, But we don t know which points will be inserted in future Ulf Leser: Data Warehousing und Data Mining 38

Summary We only scratched the surface Other: Partitioned hashing, R-tree, Quad-Teee, X tree, hb tree, R+ tree, UB tree, Store objects more than once; other than rectangular shapes; map coordinates into integers; Curse of dimensionality Your intuition plays tricks on you The more dimensions, the more difficult Balancing the tree, finding MBBs, split decisions, etc. All structures begin to degenerate somehow Often, linear scanning of objects is quicker Or: Compute lower-dimensional, relationship-preserving approximations of objects and filter on those Ulf Leser: Data Warehousing und Data Mining 39

Literatur [GG98] Gaede, V. and Günther, O. (1998). "Multidimensional Access Methods." ACM Computing Surveys 30(2): 170-231.??? Gehrke??? KD Tree??? Grid-Files Ulf Leser: Data Warehousing und Data Mining 40