DATABASE DESIGN - 1DL400

Transcription

1 DATABASE DESIGN - 1DL400 Spring 2015 A course on modern database systems!! Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology, Uppsala University,! Uppsala, Sweden 02/03/15 1

2 Introduction to Indexing Elmasri/Navathe ch 16 and 17 Padron-McCarthy/Risch ch 21 and 22 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology, Uppsala University,! Uppsala, Sweden 02/03/15 2

3 Content Files of records Operations on files Scans Operations on scans Ordered and unordered files Index sequential and hash files Index concepts Types of indexes Clustering indexes Ordered indexes, B-trees, B+ trees Unordered indexes, hash indexes Index properties and evaluation metrics 02/03/15 3

4 Files of records A file managed by a DBMS is a sequence of records, where each record is a collection of data values representing a tuple. A file descriptor (or file header) includes Meta-information that describes the file and the records it stores, such as the attribute names and data types of the fields in the records. The records in a file are usually uniform, i.e. they are of the same size and contain the same kind of data values in each position. File records are grouped into blocks of records. The file descriptor contains disk addresses of the file blocks. The blocking factor bfr (or block size) for a file is the (average) number of file records stored in a disk block. 02/03/15 4

5 Typical file operations include:! Operations on disk files OPEN: Readies the file for access, and associates a pointer called a cursor that will refer to a current file record representing a current tuple. FIND: Searches for the first record in a file that satisfies a certain condition, and makes it the cursor position. FINDNEXT: Searches for the next record from the current cursor position that satisfies a certain condition, and makes it the current cursor position. READ: Reads the record in the current cursor position into program variables. DELETE: Removes the record at the current cursor position from the file, usually by marking the record to indicate that it is no longer valid. MODIFY: Changes the values of some fields in the record of the current cursor position, i.e. copies values from program variables to the current record. INSERT: Inserts a new record into the file and makes it the current cursor position., i.e. copies values from program variables into a new record and inserts it at the current cursor position. CLOSE: Terminates access to the file. REORGANIZE: Reorganizes the file records. For example, the records marked deleted are physically removed from the file or batches of new records are merged into the file. 02/03/15 5

6 Streamed access to database operators The result of internal database operators (e.g. a join) is usually represented as scans (~streams) of tuples The cursors represent positions in scans. Cursors can be moved iteratively forward over the scans until end-of-scan is reached! SQL queries are translated by the query optimizer into programs called execution plans.! Execution plans call physical relational algebra operators that are programs that iterate over scans of tuples and iteratively produce new scans of tuples as results.! Scans can be defined over file records index records reverse scans over files and indexes are also possible where scans move backwards records produced (emitted) by some physical relational algebra operator. 02/03/15 6

7 Streamed access to database operators SCAN operators The following basic operators are defined over scans: OPEN: opens the scan for reading tuples and sets the cursor to the first tuple. NEXT: reads the next tuple in the scan into some program variables and makes it the current cursors position of the scan EOS: true if there are no more tuples in the scan CLOSE: closes the scan and releases all its resources. Scans over physical files is represented by file pointers where a new record is read for each NEXT call.! Scans over the result of a physical operator is usually represented as iterator objects with next, eos, and close methods.! Intermediate results may either be materialized as a list of blocks of tuples or generated by some code (e.g. performing a join) when next is called 02/03/15 7

8 Scan-based database operators Examples of physical algebra operators (code) using and producing scans:! FINDALL: emit all tuples in the result of a query. Such a scan operator call is e.g. produced by the query processor to iteratively emit the result of a query. Execution plans usually contain many FINDALL operators.! STOPAFTER n.: iteratively emit the first n tuples in another scan! DISTINCT: iteratively emit the tuples of another scan where duplicate tuples are removed! SORT: emit the tuples of another in sorted orders! INDEXSCAN: Iterative emitting the tuples matching the key of an index! REVESEINDEXSCAN: Iteratively emitting matching index tuples in reverse order! MATERIALIZE: Store each tuple in a scan in a file. 02/03/15 8

9 Unordered files Also called a heap file. New records are inserted at the end of the file. A linear search through the file records is necessary to search for a record. This requires reading and searching half the file blocks on the average, and hence does not scale as the file grows. Record insertion is quite efficient. Reading the records in order of a particular field requires sorting the file records, which does not scale. 02/03/15 9

10 Ordered files Also called a index sequential (ISAM) file. File records are kept sorted by the values of an ordering field. Insertion is rather expensive: records must be inserted in the correct order. It is common to keep a separate unordered overflow (or transaction) file for batching new records to improve insertion efficiency; this is periodically merged with the main ordered file. A binary search can be used to search for a record on its ordering field value. This requires reading and searching log 2 of the file blocks on the average, an improvement over linear search. Reading the records in order of the ordering field is quite efficient. Efficiency is measured in number of read disk blocks. 02/03/15 10

11 Hash files Hashing for disk files is called External Hashing The file blocks are divided into M equal-sized buckets, numbered bucket 0, bucket 1,..., bucket M-1. Typically, a bucket corresponds to one disk block in a file. Each bucket represents one or several records (i.e. tuples) One of the fields of the records is designated to be the hash key of the file. The record with hash key value K is stored in bucket i, where i = h(k), and h is the hashing function. Search and update is very efficient on equality of the hash key. Collisions occur when a new record hashes to a bucket that is already full. An overflow file is kept for storing such records. Overflow records that hash to each bucket can be linked together. Main disadvantages of static external hashing: Fixed number of buckets M is a problem if the number of records in the file grows or shrinks. Ordered access on the hash key is quite inefficient (requires sorting the records). 02/03/15 11

12 Hash files (contd.) 02/03/15 12

13 Hash files - overflow handling using chaining Overflow locations are kept, usually, by extending the array with a number of overflow positions and by adding a pointer to the chain of overflow records for each bucket. In addition, a pointer field is added to each overflow record. A collision is resolved by placing the new record and its key in an unused overflow location and setting the pointer of the occupied hash address location to the address of that overflow location. 02/03/15 13

14 ! Basic index concepts Indexes are data structures used to speed up access to sets of records stored in a database (on disk or in main memory). E.g., author catalog in library An index (file) typically require much less storage space than the original data (file) An index consists of records, called index entries, of the form! key The (search) key of an index is the attribute(s) of the indexed set of tuples used to look up records, e.g. SSN, ISBN. The key is a record of one or several key values, which are usually stored directly in the index entry (e.g. SSN + ACCOUNT#). The value is a tuple that stores the corresponding data values The value field is usually a pointer to a data record storing the values Two basic kinds of indices: value Ordered indices: search keys are stored in sorted order Hash indices: search keys are distributed uniformly across buckets using a hash function. 02/03/15 14

15 Types of indexes Primary Index Defined on an ordered data file The data file is ordered on one ore several key field(s) Includes one index entry for each block in the data file; the index entry has the key field value for the first record in the block, which is called the block anchor This makes primary indexes very compact Clustering Index Defined on an ordered data file The data file is ordered on one or several non-key field(s) unlike the primary index, which requires that the ordering field of the data file has a distinct value for each distinct value of the index. The index value for each distinct value of the search key points to the first data block that contains records with that field value. For example, think on an index of four character string used to index a files ordered on long strings. 02/03/15 15

16 Types of indexes (cont.) Secondary Index A secondary index provides a secondary means of accessing a file for which some primary access already exists. It is a non-clustering index since the indexed records are not ordered by the index keys Retrieving all records pointed to from a secondary index can be very slow if the table is large. Unique index A unique index contains key thus having a single unique value for each key A non unique index, indexes a non-key position and has a set of values for each key value. 02/03/15 16

17 Clustered and unclustered indexes clustered index unclustered (or non-clustered) index 02/03/15 17

18 More index concepts The index is often specified on one attribute of the indexed tuples, but can also be specified on several attributes. The index is sometimes called an access path to the indexed attribute. A search over an index yields a scan whose cursor points to file records Scans over ordered indexes are usually represented data structures representing upper and lower limits of key values in a search along with the cursor Indexes can also be characterized as dense or sparse: A dense index has an index entry for every search key value (and hence every record) in the data file. A secondary index is usually dense. A sparse (or nondense) index, on the other hand, has index entries for only some of the search values. A primary index is usually sparse. 02/03/15 18

19 Ordered indexes - search trees A node in a search tree of order n (q n) with pointers to subtrees below Problems with general search trees: Insertion/deletion of data file records result in changes to the structure of the search tree. There are algorithms that guaranties that the search-tree conditions continue to hold also after modifications. After an update of a search tree one can get the following problems: 1. inbalance in the search tree that usually result in slower search. 2. empty nodes (after deletions) (Figure 18.8 Elmasri/Navathe 6 ed US) To avoid these types of problem one can use different types of B-trees (balanced trees) that fulfil stricter requirements than general search trees. 02/03/15 19

20 Ordered indexes using B-trees and B + -trees Most ordered multilevel indexes use a variation of the highly scalable B-tree data structure (Bayer, Acta Informatica 1(2), 1972) B-trees and B + -trees are automatically rebalanced trees with many children for each node (large fan-out) B-trees and B + -trees are variations of search trees that allow efficient insertion and deletion of new search values. In B-tree and B + -tree, each node occupies one disk block One disk block at the time is read into main memory by the DBMS The DBMS maintains a pool of disk blocks in main memory When pool is full disk blocks are flushed to disk In main memory each B-tree node should have a size close to the cache line size used, to avoid memory cache misses. B-trees also shown excellent in modern main memories with big differences in speed between data in caches and in the rest of the memory (G. Graefe & P-Å. Larsson, B-tree Indexes and CPU Caches, ICDE 2001). 02/03/15 20

21 B-trees and B + -trees indexes Each node is kept between half-full and completely full On the average the nodes are 63% full An insertion into a node that is not full is quite efficient Just fill an empty key/value slot in the node If a node is full the insertion causes a split into two nodes Splitting may propagate to neighboring tree levels A deletion is quite efficient if a node does not become less than half full If a deletion causes a node to become less than half full, it must be merged with neighboring nodes Differences between B-tree and B + -tree In a B-tree, pointers to data records exist at all levels of the tree In a B + -tree, all pointers to data records exists at the leaf-level nodes A B + -tree can have less levels (or higher capacity of search values) than the corresponding B-tree 02/03/15 21

22 B + -tree index files... A B + -tree of order n is a rooted tree satisfying the following properties: All paths from root to leaf are of the same length. Each node that is not a root or a leaf has between!n / 2" and n children. A leaf node has between!(n - 1) / 2" and n - 1 values. Special cases: if the root is not a leaf, it has at least 2 children. If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n - 1) values. Advantage of B + -tree index files: automatically reorganizes itself with small, local, changes, in the face of insertions and deletions. Reorganization of entire file is not required to maintain performance. 02/03/15 22

23 ! B + -tree node structure A typical node! P 1 K 1 P 2... P n-1 K n-1 P n K i are the search-key values P i are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes). The search-keys in a node are ordered K 1 < K 2 < K 3 <...< K n-1 02/03/15 23

24 Leaf nodes in B + -trees All leaf nodes are at the same level and each node has between between!(n 1) / 2" and n - 1 values For i = 1, 2,..., n-1, pointer P i either points to a file record with search-key value K i, or to a bucket of pointers to file records, each record having search-key value K i. Only need bucket structure if search-key does not form a primary key. If L i, L j are leaf nodes and i < j, L i s search-key values are less than L j s search-key values P n points to next leaf node in search-key order P i-1 Brighton P i Downtown P n leaf node Brighton A Downtown A Downtown A account file /03/15 24

25 Non-leaf nodes in B + -trees Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with n pointers: Each internal node (except the root) has between!n / 2" and n children (tree pointers) If the root is an internal node it as at least two children. An internal node with q pointers, q n, has q-1 search keys (values) All the search-keys in the subtree to which P 1 points are less than K 1 For 2 i n-1, all the search-keys in the subtree to which P i points have values greater than or equal to K i-1 and less than K i All the search-keys in the subtree to which P m points are greater than or equal to K m-1 P 1 K 1 P 2... P n-1 K n-1 P n 02/03/15 25

26 Example of a B + -tree Perryridge Mianus Redwood Brighton Downtown Mianus Perryridge Redwood Round Hill B + -tree for account file (n = 3) 02/03/15 26

27 Example of a B + -tree Perryridge Brighton Downtown Mianus Perryridge Redwood Round Hill B + -tree for account file (n = 5) Leaf nodes must have between 2 and 4 values (!(n - 1) / 2" and n - 1, with n = 5 ). Non-leaf nodes other than root must have between 3 and 5 children (!(n / 2" and n with n = 5 ). Root must have at least 2 children. 02/03/15 27

28 Observations about B + -trees Since the inter-node connections are done by pointers, there is no assumption that in the B + -tree, the logically close blocks are physically close. The non-leaf levels of the B + -tree form a hierarchy of sparse indexes. The B + -tree contains a relatively small number of levels (logarithmic in the size of the main file), thus searches can be conducted efficiently. Insertions and deletions to the main file can be handled efficiently, as the index can be restructured in logarithmic time. 02/03/15 28

29 Queries on B + -trees In processing a query, a path is traversed in the tree from the root to some leaf node. Find all records with a search-key value of k. Start with the root node. Examine the node for the smallest search-key value > k. If such a value exists, assume it is K i. Then follow P i to the child node. Otherwise k K m-1, where there are m pointers in the node. Then follow P m to the child node. If the node reached by following the pointer above is not a leaf node, repeat the above procedure on the node, and follow the corresponding pointer. Eventually reach a leaf node. If key K i = k, follow pointer P i to the desired record or bucket. Else no record with search-key value k exists. 02/03/15 29

30 Updates on B + -trees: insertion Find the leaf node in which the search-key value would appear. If the search-key value is already there in the leaf node, record is added to file and if necessary pointer is inserted into bucket. If the search-key value is not there, then add the record to the main file and create bucket if necessary. Then: if there is room in the leaf node, insert (search-key value, record/bucket pointer) pair into leaf node at appropriate position. if there is no room in the leaf node, split it and insert (search-key value, record/ bucket pointer) pair as discussed in the next slide. 02/03/15 30

31 Updates on B + -trees: insertion... Splitting a node: take the n (search-key value, pointer) pairs (including the one being inserted) in sorted order. Place the first!n/2" in the original node, and the rest in a new node. let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node being split. If the parent is full, split it and propagate the split further up. The splitting of nodes proceeds upwards till a node that is not full is found. In the worst case the root node may be split increasing the height of the tree by 1. 02/03/15 31

32 Updates on B + -trees: insertion... Perryridge Mianus Redwood Brighton Downtown Mianus Perryridge Redwood Round Hill Perryridge Downtown Mianus Redwood Brighton Clearview Downtown Mianus Perryridge Redwood Round Hill B + -tree before and after insertion of Clearview 02/03/15 32

33 Updates on B + -trees: deletion Find the record to be deleted, and remove it from the main file and from the bucket (if present). Remove (search-key value, pointer) from the leaf node if there is no bucket or if the bucket has become empty. If the node has too few entries due to the removal, and the entries in the node and a sibling fit into a single node, then Insert all the search-key values in the two nodes into a single node (the one on the left), and delete the other node. Delete the pair (K i-1, P i ), where P i is the pointer to the deleted node, from its parent, recursively using the above procedure. 02/03/15 33

34 Updates on B + -trees: deletion Otherwise, if the node has too few entries due to the removal, and the entries in the node and a sibling do not fit into a single node, then Redistribute the pointers between the node and a sibling such that both have more than the minimum number of entries. Update the corresponding search-key value in the parent of the node. The node deletions may cascade upwards till a node which has!n/2" or more pointers is found. If the root node has only one pointer after deletion, it is deleted and the sole child becomes the root. 02/03/15 34

35 Examples of B + -tree deletion Perryridge Downtown Mianus Redwood Brighton Clearview Downtown Mianus Perryridge Redwood Round Hill Perryridge Mianus Redwood Brighton Clearview Mianus Perryridge Redwood Round Hill Result after deleting Downtown from account The removal of the leaf node containing Downtown did not result in its parent having too few pointers. So the cascaded deletions stopped with the deleted leaf node s parent. 02/03/15 35

36 Examples of B + -tree deletion... Perryridge Downtown Mianus Redwood Brighton Clearview Downtown Mianus Perryridge Redwood Round Hill Mianus Downtown Redwood Brighton Clearview Downtown Mianus Redwood Round Hill Deletion of Perryridge instead of Downtown The deleted Perryridge node s parent became too small, but its sibling did not have space to accept one more pointer. So redistribution is performed. Observe that the root node s search-key value changes as a result. 02/03/15 36

37 B-tree index files Similar to B + -tree, but B-tree allows search-key values to appear only once; eliminates redundant storage of search keys. Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer field for each search key in a nonleaf node must be included. Generalized B-tree leaf node:!! P 1 K 1 P 2... P n-1 K n-1 P n Nonleaf node in B-trees where extra pointers B i are the bucket or file record pointers.! P 1 B 1 K 1 P 2 B 2 K 2... P m-1 B m-1 K m-1 P m 02/03/15 37

38 B-trees example Nodes in a B-tree are uniform: ( (key, value, subtree) ) Each node in a B-tree contains an array of key/value pairs and pointers to subtrees. There is no difference between leaf nodes and intermediate nodes. Assume there are 10 8 tuples in the index Assume a block size of 8192 and that each B-tree node is on the average 63% full. Assume each (key, value, subtree) triple uses 12 bytes. 8192*0.63/12 = 430 = triples per node The average depth of the B-tree will be log 430 (10 8 ) = 3.0 A key/value pair can be accessed with 3 block reads (disk accesses) on the average Root node kept in main memory => 2 blocks to read 02/03/15 38

39 B + -trees example In a B + -tree, intermediate nodes contain keys + pointers to subtrees (no values) Intermediate nodes: ((key, subtree) ) In a B + -tree, only the leaf-level nodes contain pointers to data records: Leaf nodes: ((key, value).). The leaf nodes are linked to enable fast scans. Because there are fewer data pointers in B+-trees, the fan-out (average number of children) becomes larger with B + -trees than with B-trees. Assume a block size of 8192 and that each B-tree node is on the average 63% full. Assume each (key, value) or (key, subtree) pairs uses 8 bytes. 8192*0.63/8 = 645 = pairs per node The average depth of the B-tree will be log 645 (10 8 ) = 2.8 A key/value pair can be accessed with = 1.8 block reads on the average 02/03/15 39

40 Hash indexes Hash indexes organizes the keys with their associated value pointers, into a hash table structure. Hash indexes are very fast for accessing a value (or a set of values) for a given key. Hash indexes do not support range searches. Hash indexes require dynamic hash tables that gracefully grow or shrink without significant delays as the database is updated Regular hashing problematic when files grow and shrink dynamically Dynamic and graceful scale-up and scale-down of tables needed in DBMSs Special hashing techniques have been developed to allow in incremental and dynamic growth and shrinking of the number of file records in hash files. The most used one is called LH, Linear Hashing (Litwin, VLDB 1980) Linear Hashing is also shown to be preferable when storing dynamic hash tables in main memory (P-Å Larson, Dynamic Hash Tables, CACM 31(4), 1988). 02/03/15 40

41 Example of a hash index 02/03/15 41

42 Outline of Linear Hashing (LH) No directory is created One starts with M buckets [0, 1,, M-1],! and the hash function h 0 =k mod 2 0 M,! and the number of bucket divisions n = 0! M 0 M-1 M M If the file load factor becomes to high a new bucket is created M+n and bucket number n is split up between bucket n and M+n via a new hash function h 1 =k mod 2 1 M! Thus we can keep the file load factor, l, within wanted interval by letting it trig possible splits and combinations.! l = r / bfr * N, where l = file load factor, r = no of records, N = no of buckets 02/03/15 42

43 LH is a dynamic hash algorithm Linear Hashing (LH) A hash file has a number of buckets with some capacity b >> 1! Not one hash functions but a series of hash functions h i (k) of keys k as the hash file evolves.! Typically hash by division h i (k) = k mod 2 i, i = 0, 1, 2, 3, 4,.! Buckets split through the replacement of h i with h i+1 ; i = 0, 1, as the file grows! As the file grows and the load (number of keys) of the hash table thereby increases beyond some threshold, buckets are successively split. For split buckets b/2 keys move towards new buckets. Shrinking the files is similarly possible by joining buckets when the table load decreases below some threshold. 02/03/15 43

44 LH example - file evolution b = 4 i = 0 p=0 h 0 (k)=k mod h 0 02/03/15 44

45 LH file evolution b = 4 i = 1 p = 0 h 1 (k)=k mod h 1 h 1 02/03/15 45

46 LH file evolution b = 4 i = 1 p = 0 h 1 (k)=k mod h 1 h 1 02/03/15 46

47 LH file evolution b = 4 i = 1 p = 1 h 2 (k)=k mod h 2 h 1 h 2 02/03/15 47

48 LH file evolution b = 4 i = 1 p = 1 h 2 (k)=k mod h 2 h 1 h 2 02/03/15 48

49 LH file evolution b = 4 i = 1 p = 1 h 2 (k)=k mod h 2 h 2 h 2 h 2 02/03/15 49

50 LH file evolution b = 4 i = 2 p = 0 h 2 (k)=k mod h 2 h 2 h 2 h 2 02/03/15 50

51 Main-memory LH LH shown excellent also for main memory hash tables (P-Å Larson 1988).! No need for specifying size when hash table allocated! Dynamic grow and shrink! No buckets but just overflow chains possible in main memory (i.e. b = 1). As for B-tree a bucket size close to cache line size is preferred.! When say 200% full (twice as many keys as size of table) after insert move p forward and add bucket! When say 50% full after delete move p backwards and delete bucket! Problem: Need dynamically extensible array to hold hash table 02/03/15 51