Lecture 1: Data Storage & Index

Lecture 1: Data Storage & Index R&G Chapter 8-11 Concurrency control Query Execution and Optimization Relational Operators File & Access Methods Buffer Management Disk Space Management Recovery Manager 1

Where are we? Concurrency control Query Execution and Optimization Relational Operators File & Access Methods Buffer Management Disk Space Management Recovery Manager 2

Magnetic Disk Read/write/transfer in blocks (pages) Courtesy to R. Burns 3

A real disk image from Seagate Technology Corporation Arm Platter Actuator Spindle 4

Data access in a disk Access time = seek time + rotational delay + transfer time 5

Disk space manager allocate or de-allocate pages in the disk Abstraction of pages Maintains free blocks Basic Interface: allocate_page, allocate one or more new free pages, remove them from the list of free pages. deallocate_page, de-allocate one or more pages, put them into the list of free pages. Read_page Write_page 6

Where are we? Concurrency control Query Execution and Optimization Relational Operators File & Access Methods Buffer Management Disk Space Management Recovery Manager 7

To avoid always reading/wrting pages from disk use the available memory as buffer pool Divided into frames which contains pages from the disk Buffer Pool Page read/write requests Note: data have to be in RAM for the DBMS to operate on them Disk 8

Page maintenance in a buffer pool (pin_count = 0) 1) Pin a frame when its page is requested (pin_count++) (pin_count = 1) 0) Unpin a frame when its page is released (pin_count--) A page is dirty if it has been modified but not updated on the disk yet 9

How to process a page request? No Already in a frame f i? Yes Increment the pin_count of f i and return f i Exist a non-used frame f j No Choose a frame f j for replacement Yes No Is f j dirty? critical to the performance Yes Read page p into f j and return f j Write the page in f j to disk 10

A page replacement policy determines which frame to be replaced General Rule: keep those pages that might be accessed soon in the future A frame is considered for replacement only if its pin_count == 0. LRU (Least Recently Used) policy: - Choose the one that hasn t been used for the longest time - Implemented as a queue of pages with pin_count == 0 Frame chosen for replacement LRU insert Frame whose pin_count just goes to 0 What is the assumption of LRU? remove Frame whose pin_count goes above 0 11

Clock policy approximates LRU Every frame is associated with a Reference Bit (R). - R is set to 1 when a frame s pin_count goes down to 0. L A B On replacement request: 1. Advance the pointer. 2. If R == 0 and pin_count==0, choose the frame. 3. Else if R == 1, set R to 0 and goes to step 1. J K I C E D Clock has a lower cost than LRU. (Why?) H G F 12

Where are we Concurrency control Query Execution and Optimization Relational Operators File & Access Methods Buffer Management Disk Space Management Recovery Manager 13

Data are abstracted as files of records for higher level DBMS components Relation (Table) Represented as File of Records Stored as How to keep track of - pages in a file? - free space in each page? - records in each page? Pages 14

Directory format: use a directory to indicate the data pages used by a file Header Page Data Page 1 Data Page 2 DIRECTORY Data Page N Free space within a data page can be indicated in the directory entry. Where to find the header page? System Catalog! 16

How are records organized within a page? Rid = (i,n) Page i Rid = (i,2) Rid = (i,1) FREE SPACE 20 16 24 N N... 2 1 # slots SLOT DIRECTORY Pointer to start of free space How to identify a record? Record id (RID) = <Page id, slot id> 18

How the fields are organized in a record? Field 1 Field 2 Field 3 Field 4 Fields with fixed size: just store them contiguously Field1 $ Field2 $ Field3 $ Field4 $ Fields with variable size: use special characters to delimit each field. Field1 Field2 Field3 Field4 Again, directory! 19

Summary How do disks work? Disks read/write/transfer data in the unit of page. Data transfer is the dominant cost of data access. How to reduce disk I/O Keep pages that will be accessed in the future in the memory Replacement policies: LRU, Clock, MRU, and etc. How to organize the data in a disk? Abstracted as file of records Directory can be used to locate pages of a file in a disk locate records in a page locate fields in a record 20

Heap file abstraction enables retrieving records by their RID or scanning records sequentially Record id (RID) = <Page id, slot id> Pages Page Record sequential scan: Look for the header page of a file in the catalog Header Page DIRECTORY Data Page 1 Data Page 2 Data Page N Read each record in each page sequentially 21

What if we want to look up records by their values Example: o Find all students in IMADA o Find all students with a Scores > 10 Solution 1: sequential scan and check the values of each record. o need to read all the pages slow! Solution 2: organize the data in the file by their values: o sorted file (sorted on one field) o use binary search to speed up o How about searching by the value of another field? sequential search again! o High cost when data are updated! 22

Index is a data structure used to speeds up valuebased search of records conditions of the values on one or more fields input Index output the records or locations of the records satisfying the conditions An index contains a collection of data entries. And a data structure to search the data entries matching the search key. o Tree B+ Tree index both equality and range search o Hash table Hash index only equality search An index is stored as a File An index supports the search of one or more fields, which is called the search key of the index 23

Alternatives for Data Entry k* in Index Three alternatives: 1. Actual data record (with key value k) 2. <k, rid of matching data record> 3. <k, list of rids of matching data records> Choice is orthogonal to the indexing technique. Examples of indexing techniques: B+ trees, hashbased structures, R trees, Typically, index contains auxiliary information that directs searches to the desired data entries Can have multiple (different) indexes per file. E.g. file sorted by age, with a hash index on salary and a B+tree index on name. 24

Alternatives for Data Entries (Contd.) Alternative 1: Actual data record (with key value k) If this is used, index structure is a file organization for data records (like Heap files or sorted files). At most one index on a given collection of data records can use Alternative 1. This alternative saves pointer lookups but can be expensive to maintain with insertions and deletions. 25

Alternatives for Data Entries (Contd.) Alternative 2 <k, rid of matching data record> and Alternative 3 <k, list of rids of matching data records> Easier to maintain than Alt 1. If more than one index is required on a given file, at most one index can use Alternative 1; rest must use Alternatives 2 or 3. Alternative 3 more compact than Alternative 2, but leads to variable sized data entries even if search keys are of fixed length. Even worse, for large rid lists the data entry would have to span multiple blocks! 26

Index Classification Clustered vs. unclustered: If order of data records is the same as, or `close to, order of index data entries, then called clustered index. A file can be clustered on at most one search key. Cost of retrieving data records through index varies greatly based on whether index is clustered or not! Alternative 1 implies clustered, but not vice-versa. 27

Clustered vs. Unclustered Index Suppose that Alternative (2) is used for data entries, and that the data records are stored in a Heap file. To build clustered index, first sort the Heap file (with some free space on each block for future inserts). Overflow blocks may be needed for inserts. (Thus, order of data recs is `close to, but not identical to, the sort order.) CLUSTERED Index entries direct search for data entries UNCLUSTERED Data entries Data entries (Index File) (Data file) Data Records Data Records 28

Unclustered vs. Clustered Indexes What are the tradeoffs???? Clustered Pros Efficient for range searches Clustered Cons Expensive to maintain (on the fly or sloppy with reorganization)

B+ Tree is a balanced tree structure Each node in the tree occupies a page Entries in non-leave nodes à called index entries: <key value, page_id> Entries in leaf nodes à called data entries: <key value, RID> OR <key value, list of RID> OR <key value, data record> Root 13 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Search for 5*, 15*, or all data entries >= 24* 32

Insert 8* Go to the correct leave Do recursively: If non-full then else insert the entry split and copy/push up the middle key to the parent node Root 13 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 8* 33

Insert 8* Go to the correct leave Do recursively: If non-full then else insert the entry split and copy/push up the middle key to the parent node Root 5 13 17 24 30 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 34

Insert 8* Go to the correct leave Do recursively: If non-full then else insert the entry split and copy/push up the middle key to the parent node Root 17 Note the difference between copy up and push up 5 13 24 30 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 35

Delete 19* and 20* Go to the correct leave and delete the entry If not at least half full then redistribute with the sibling; if the sibling doesn t have enough entries then merge with the sibling; Root Keep each page at least half full except the root 17 5 13 24 30 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 36

19* and 20* deleted now delete 24* Go to the correct leave and delete the entry If not at least half full then redistribute with the sibling; if the sibling doesn t have enough entries then merge with the sibling; Root note the copy up of middle key 27 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* 37

24* deleted Merge with sibling Root 17 note the deletion of key 27 5 13 30 2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39* Merge could cause re-distribution or merge of ancestor nodes Root 5 13 17 30 2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39* 38

Rethink the cost of accessing all the records with index key 24 If the records are in many different pages à high cost L Clustered Index: the real data records are stored in an order close to the order of data entries in the index. Root 13 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 39

Hash-based Index use hash function to look for the data entries Hash function H(key) outputs an integer h(key) = (a * key + b) usually works well. Static hash index uses N primary pages Data entries are stored at the page H(key) mod N If a primary page is full, add an overflow page h(key) mod N key h 0 1 Problem: too many overflow pages N-1 Primary bucket pages Overflow pages 40

Increase the number of buckets when overflow occurs How about simply increase the number of buckets of a static hash index? requires read and write all the pages of the index! Can we only split the overflowed bucket instead of all of them? 41

Extendible hashing Use an directory one entry for each bucket, which points to the primary page of the bucket If a bucket is overflowed split it into two double the directory if needed 42

insert h(key) = 20 (10100) 4*" 12*"32*"16*" Bucket A" 32*"16*" Bucket A" 00" 01" 10" 11" 1*" 5*" 21*"13*" Bucket B" 10*" Bucket C" 000" 001" 010" 011" 1*" 5*" 21*"13*" 10*" Bucket B" Bucket C" 100" 15*"7*" 19*" Bucket D" 101" 110" 15*"7*" 19*" Bucket D" 4*" 12*" 20*" Bucket A2" (`split image'" of Bucket A)" 111" 4*" 12*" 20*" '" Bucket A2"

insert h(key) = 20 (10100) 4*" 12*"32*"16*" Bucket A" 0" 00" 0" 01" 0" 10" 0" 11" 1" 00" 1" 01" 1" 10" 1" 11" 1*" 5*" 21*"13*" Bucket B" 10*" 15*"7*" 19*" 4*" 12*" 20*" Bucket C" Bucket D" Bucket A2" (`split image'" of Bucket A)"

Summary Index can speed up search by values B+ Tree index is good for range search maintain balance on insert/delete Hash index is good for equality search Static hashing suffers from long overflow chains Extendible hashing avoids bucket overflow by doubling the directory Linear hashing avoids directory by splitting buckets round-robin, and using overflow pages. 48