Architecture and Implementation of Database Management Systems

Transcription

1 Architecture and Implementation of Database Management Systems Prof. Dr. Marc H. Scholl Summer 2004 University of Konstanz, Dept. of Computer & Information Science

2 Module 1: Introduction & Overview Module Outline Web Forms Applications SQL Interface 1.1 What it s all about SQL Commands 1.2 Outline of the Course 1.3 Organizational Matters Plan Executor Parser Operator Evaluator Optimizer Query Processor Transaction Manager Lock Manager Concurrency Control Files and Index Structures Buffer Manager Disk Space Manager Recovery Manager DBMS Index Files Data Files System Catalog Database 2

3 1.1 What it s all about This is a systems-oriented course, with focus on the necessary infrastructure to build a DBMS. This will help to thoroughly analyze, compare, and tune DBMSs for performance-critical applications. 3 While introductory courses have presented the functionality, i.e., the interface, of database management systems (DBMSs), this course will dive into the internals. We will, for example, learn how a DBMS can efficiently organize and access data on disk, knowing that I/O is way more expensive than CPU cycles, translate SQL queries into efficient execution plans, including query rewrite optimization and index exploitation, sort/combine/filter large data volumes exceeding main memory size by far, allow many users to consistently access and modify the database at the same time, takes care of failures and guarantees recovery into a consistent operation after crashes.

4 1.1.1 Overall System Architecture A DBMS is typically run as a back-end server in a (local or global) network, offering services to clients directly or to application servers. Users Clients Request Reply Application Application Server Program 1... Application Program 2... Request Reply Data Server encapsulated data... Objects exposed data Stored Data (Pages) Generally, we call this the 3-tier reference architecture. 4

5 1.1.2 Layered DBMS Architecture Typically, a DBMS implements its functionality in a layered architecture that builds up by incrementally adding more abstractions from the low level of block I/O devices up to the high level of a declarative (SQL) user interface.... Clients Requests Database Server Request Execution Threads Language & Interface Layer Query Decomposition & Optimization Layer Query Execution Layer Access Layer Storage Layer Data Accesses Database 5

6 1.1.3 Storage Structures Whether the DBMS offers relational, object-relational, or other data structures at the user interface, internally they have to be mapped into fixed-length blocks that serve as the basic I/O-unit of transfer between main and secondary memory. Database Page Page Header Ben 55 Las Vegas... Sue 23 Seattle... Joe 29 San Antonio free space forwarding RID Slot Array Extent Table Database Extents 6

7 1.1.4 Access Paths A DBMS typically provides a number of indexing techniques that allow for fast content-based searching of records, such as tree-structured or hash-based methods. Often, the suite of such indexing techniques can be extended to match the requirements of particular applications. Root Node Bob Eve Tom B+-tree Adam Bill Bob Dick Eve Hank Jane Jill Tom RIDs Leaf Nodes 7

8 1.1.5 Query Execution Declarative query specifications, e.g. expressed in SQL, need to be optimized and transformed into efficient query execution plans (QEPs), i.e., sequential or even parallelized programs that compute the results. Projection RID Access Projection Filtering RID List Intersection RID Access Index Scan on AgeIndex Index Scan on CityIndex Fetch Person Record Index Scan on CityIndex Fetch Person Record 8

9 1.1.6 Implementing a Lock Manager Most DBMSs use a locking protocol (e.g., 2PL) for concurrency control. Efficiently implementing the lock manager and exploiting the synchronization primitives offered by the underlying operating system is crucial for a high degree of parallelism. Hash Table indexed by Resource Id Transaction Control Blocks (TCBs) Transaction Id Update Flag Transaction Status Number of Locks... LCB Chain Resource Control Blocks (RCBs) Resource Id Hash Chain... FirstInQueue Lock Control Blocks (LCBs) Transaction Id Resource Id Lock Mode Lock Status NextInQueue LCB Chain

10 1.2 Outline of the Course We will pursue a bottom-up strategy, starting from the block-i/o devices used for secondary storage management and work our way up to the SQL interface. Most part of the lecture is based on the book (Ramakrishnan and Gehrke, 2003). Additional references to other textbooks and related literature will be given when appropriate. Web Forms Plan Executor Operator Evaluator Applications SQL Commands Parser Optimizer SQL Interface Query Processor Transaction Manager Lock Manager Concurrency Control Files and Index Structures Buffer Manager Disk Space Manager Recovery Manager DBMS Index Files Data Files System Catalog Database 10

11 1.3 Organizational Matters Register with the Account Tool. Actively participate in lectures and assignments. There will be a written exam at the end of the semester. Let us know when you have problems or suggestions. 10 copies of the book underlying this course are available in the U KN library. 11

12 Bibliography Elmasri, R. and Navathe, S. (2000). Fundamentals of Database Systems. Addison-Wesley, Reading, MA., 3 edition. Titel der deutschen Ausgabe von 2002: Grundlagen von Datenbanken. Härder, T. (1987). Realisierung von operationalen Schnittstellen, chapter 3. in (Lockemann and Schmidt, 1987). Springer. Härder, T. (1999). Springer. Datenbanksysteme: Konzepte und Techniken der Implementierung. Heuer, A. and Saake, G. (1999). Datenbanken: Implementierungstechniken. Int l Thompson Publishing, Bonn. Lockemann, P. and Dittrich, K. (1987). Architektur von Datenbanksystemen, chapter 2. in (Lockemann and Schmidt, 1987). Springer. Lockemann, P. and Schmidt, J., editors (1987). Datenbank-Handbuch. Springer-Verlag. Mitschang, B. (1995). Anfrageverarbeitung in Datenbanksystemen - Entwurfs- und Implementierungsaspekte. Vieweg. Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. 12

13 Module 2: Storing Data: Disks and Files Module Outline Web Forms Applications SQL Interface 2.1 Memory hierarchy 2.2 Disk space management 2.3 Buffer manager 2.4 File and record organization 2.5 Page formats 2.6 Record formats 2.7 Addressing schemes Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Parser Optimizer Query Processor Recovery Manager DBMS Index Files Data Files System Catalog Database 13

14 2.1 Memory hierarchy Memory in off-the-shelf computer systems is arranged in a hierarchy: Request & & & & CPU CPU Cache (L1, L2) Main Memory (RAM) Magnetic Disk Tape, CD-ROM, DVD Storage Class ff primary secondary tertiary Cost of primary memory 100 cost of secondary storage space of the same size. Size of address space in primary memory (e.g., 2 32 Byte = 4 GB) may not be sufficient to map the whole database (we might even have 2 32 records). DBMS needs to make data persistent across DBMS (or host) shutdowns or crashes; only secondary/tertiary storage is nonvolatile. DBMS needs to bring in data from lower levels in memory hierarchy as needed for processing. 14

15 2.1.1 Magnetic disks Tapes store vast amounts of data ( 20 GB; more for roboter tape farms) but they are sequential devices. Magnetic disks (hard disks) allow direct access to any desired location; hard disks dominate database system scenarios by far. disk arm disk head rotation 1 Data on a hard disk is arranged in concentric rings (tracks) on one or more platters, track 2 tracks can be recorded on one or both surfaces of a platter, cylinder 3 set of tracks with same diameter form a cylinder, platter 4 an array (disk arm) of disk heads, one per recorded surface, is moved as a unit, arm movement 5 a stepper motor moves the disk heads from track to track, the platters steadily rotate. 15

16 track sector block 1 Each track is divided into arc-shaped sectors (a characteristic of the disk s hardware), 2 data is written to and read from disk block by block (the block size is set to a multiple of the sector size when the disk is formatted), 3 typical disk block sizes are 4 KB or 8 KB. Data blocks can only be written and read if disk heads and platters are positioned accordingly. This has implications on the disk access time: 1 Disk heads have to be moved to desired track (seek time), 2 disk controller waits for desired block to rotate under disk head (rotational delay), 3 disk block data has to be actually written/read (transfer time). access time =

17 I Access time for the IBM Deskstar 14GPX 3.5 inch hard disk, 14.4 GB capacity 5 platters of 3.35 GB of user data each, platters rotate at 7200/min average seek time 9.1 ms (min: 2.2 ms [track-to-track], max: 15.5 ms) average rotational delay 4.17 ms data transfer rate 13 MB/s access time 8 KB block 9.1 ms ms + 1 s ms 13 MB/8 KB N.B. Accessing a main memory location typically takes < 60 ns. 17

18 The unit of a data transfer between disk and main memory is a block, if a single item (e.g., record, attribute) is needed, the whole containing block must be transferred: Reading or writing a disk block is called an I/O operation. The time for I/O operations dominates the time taken for database operations. DBMSs take the geometry and mechanics of hard disks into account. Current disk designs can transfer a whole track in one platter revolution, active disk head can be switched after each revolution. This implies a closeness measure for data records r 1, r 2 on disk: 1 Place r 1 and r 2 inside the same block (single I/O operation!), 2 place r 2 inside a block adjacent to r 1 s block on the same track, 3 place r 2 in a block somewhere on r 1 s track, 4 place r 2 in a track of the same cylinder than r 1 s track, 5 place r 2 in a cyclinder adjacent to r 1 s cylinder. 18

19 2.1.2 Accelerating Disk-I/O Goals reduce number of I/Os: DBMS buffer, physical DB-design reduce duration of I/Os: access neighboring disk blocks (clustering) and bulk-i/o: advantage: optimized seek time, optimized rotational delay, minimized overhead (e.g. interrupt handling) disadvantage: I/O path busy for a long time (concurrency!) bulk I/Os can be implemented on top of or inside disk controller... used for mid-sized data access (prefetching, sector-buffering) different I/O paths (declustering) with parallel access advantage: parallel I/Os, minimized transfer time by multiple bandwidth disadvantage: avg. seek time and rotational delay increased, more hardware needed, blocking of parallel transactions advanced hardware or disk arrays (RAID systems)... used for large-size data access 19

20 Performance gains withs parallel I/Os Partition files into equally sized areas of consecutive blocks ( striping )! "! " # $ % &! % " K C H E B B I F = H = A E J J intra-i/o parallelism ) K B J H = C I F = H = A E J J inter-i/o parallelism Striping unit (# logically consecutive bytes on one disk) determines degree of parallelism for single I/O and degree of parallelism between different I/Os: small chunks: high intra-access parallelism but many devices busy not many I/Os in parallel large chunks: low intra-access parallelism, but many I/Os in parallel 20

21 RAID-Systems: Improving Availability Goal: maximize availability... = MT T F MT T F + MT T R where: MT T F =mean time to failure, MT T R=mean time to repair Problem: with N disks, we have N times higher probability for problems! Thus: MT T DL = MT T F N where MT T DL=mean time to data loss Solution: RAID=Redundant array of inexpensive (independent) disks Now we get MT T DL = MT T F N MT T F (N 1) MT T R i.e., we only suffer from data loss, is a second disk fails before the first failed disk has been replaced. 21

22 ... typically reserve one extra disk as a hot spare to replace the failed one immediately. 22 Principle of Operation Use data redundancy to be able to reconstruct lost data, e.g., compute parity information during normal operation... here denotes logical xor (exclusive or) When one of the disks fails, use parity to reconstruct lost data during failure recovery

23 Executing I/Os from/to a RAID System Read Access: to read block number k from disk j, execute a read(b j k, disk j)-operation. Write Access: to write block number k back to disk j, we have to update the parity information, too (let p be the number of the parity disk for block k): i j : read(b i k, disk i ); compute new parity block B p k from contents of all B i k; write(b j k, disk j); write(b p k, diskp); we can do better (i.e., more efficient), though: read(b p k, diskp); compute new parity block B p k write(b j k, disk j); := Bp k Bj k B j k ; write(b p k, diskp); 23

24 Write Access to blocks b on all disks i p: compute new parity block B p k i : write(b i k, disk i ); from contents of all B i p k ; Reconstruction of block b on a failed disk j (let r be the number of the replacement disk): i j : read(b i k, disk i ); reconstruct B j k write(b j k, diskr ); as parity from all Bi j k 24

25 Recovery Strategies off-line: if a disk has failed, suspend normal I/O; reconstruct all blocks of the failed disk and write them to the replacement disk; only afterwards resume normal I/O traffic. on-line: (needs a hot spare disk) resume normal I/O immediately start reconstructing all blocks not yet reconstructed since the crash (in background); allow parallel normal writes: write the block to the replacement disk and update parity; allow parallel normal read I/O: if block has not yet been repaired: reconstruct block if block has already been reconstructed: read block from replacement disk or reconstruct it (load balancing decision) N.B. we can even wait with all reconstruction until first normal read access! 25

26 2.1.3 RAID-Levels There are a number of variants ( RAID-Levels) differing w.r.t. the following characteristics: striping unit (data interleaving) how to scatter (primary) data across disks? fine (bits/bytes) or coarse (blocks) grain? How to compute and distribute redundant information? what kind of redundant information (parity, ECCs) where to allocate redundant information (separate/few disks, all disks of the array) 5 RAID-levels have been introduced, later more levels have been defined. 26

27 RAID Level 0: no redundancy, just striping least storage overhead no extra effort for write-access not the best read-performance! RAID Level 1: mirroring doubles necessary storage space doubles write-access optimized read-performance due to alternative I/O path RAID Level 2: memory-style ECC compute error-correcting codes for data of n disks store onto n 1 additional disks failure recovery: determine lost disk by using the n 1 extra disks; correct (reconstruct) its contents from 1 of those 27

28 More recent levels combine aspects of the ones listed here, or add multiple parity blocks, e.g. RAID 6: two parity blocks per group. 28 RAID Level 3: bit-interleaved parity one parity disk suffices, since controller can easily identify faulty disk! distribute (primary) data bit-wise onto data disks read and write access goes to all disks, therefore, no inter-i/o parallelism, but high bandwidth RAID Level 4: block-interleaved parity like RAID 3, but distribute data block-wise (variable block size) small read I/O goes to only one disk bottleneck: all write I/Os go to the one parity disk RAID Level 5: block-interleaved striped parity like RAID 4, but distribute parity blocks across all disks load balancing best performance for small and large read as well as large write I/Os variants w.r.t. distribution of block

29 non-h = J4 ) 1, mirroring 4 ) 1, ma HO sj OA ) 1, bit-i JA HA = LA p= ) 1,! data interleaving on the byte level data i JA HA = LE C on the block level shading = redundant info block-i JA HA = p= HEJO 4 ) 1, " b?i JA HA = sj HEF p= HEJO4 ) 1, # 2 3 p= HEJOsJ HEF 4 ) 1, $ 29

30 Parity groups Parity is not necessarily computed across all disks within an array, it is possibile to define parity groups (of same or different sizes). disk 1 disk 2 disk 3 disk 4 disk 5 group 1 group 1 group 1 parity 1 group 2 group 2 group 2 parity 2 parity 3 group 3 group 3 group 3 group 4 parity 4 group 4 group 4 parity 5 group 5 group 5 group

31 Selecting RAID levels RAID level 0: improves overall performance at lowest cost; no provision against data loss, best write performance, since no redundancy; RAID levels 0+1: (aka. level 10) superior to level 1, main appl. area is small storage subsystems, sometimes for write-intensive applications. RAID level 1: most expensive version; typically serialize the two necessary I/Os for writes to avoid data loss in case of power failures, etc. RAID levels 2 and 4: are always inferior to levels 3 and 5, resp ly. Level 3 appropriate for workloads with large requests for contiguous blocks; bad for many small requests of a single block. RAID level 5: is a good general-purpose solution. Best performance (with redundancy) for small and large read as well as large write requests. RAID level 6: choice for higher level of reliability. 31

32 2.2 Disk space management Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Applications SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files System Catalog Parser Optimizer You are here! SQL Interface Query Processor Database Recovery Manager DBMS The disk space manager (DSM) encapsulates the gory details of hard disk access for the DBMS, the DSM talks to the disk controller and initiates I/O operations, once a block has been brought in from disk it is referred to as a page a. Sequences of data pages are mapped onto contiguous sequences of blocks by the DSM. The DBMS issues allocate/deallocate and read/write commands to the DSM, which, internally, uses a mapping block-# page-# to keep track of page locations and block usage. a Disk blocks and pages are of the same size. 32

33 2.2.1 Keeping track of free blocks During database (or table) creation it is likely that blocks indeed can be arranged contiguously on disk. Subsequent deallocations and new allocations however will, in general, create holes. To reclaim space that has been freed, the disk space manager either uses a free block list: 1 keep a pointer to the first free block in a known location on disk, 2 when a block is no longer needed, append/prepend this block to the free block list for future use, 3 next pointers may be stored in disk blocks themselves, or free block bitmap: 1 reserve a block whose bytes are interpreted bit-wise (bit n = 0: block n is free), 2 toggle bit n whenever block n is (de-)allocated. Free block bitmaps allow for fast identification of contiguous sequences of free blocks. 33

34 2.3 Buffer manager Web Forms Applications SQL Interface SQL Commands Size of the database on secondary storage size of avail. primary mem. to hold user data. Plan Executor Parser Operator Evaluator Optimizer Query Processor Files and Index Structures Transaction Manager Recovery Buffer Manager Manager Lock Manager Disk Space Manager You are here! Concurrency Control DBMS Index Files System Catalog Data Files Database To scan the entire pages of a 20 GB table (SELECT FROM... ), the DBMS needs to 1 bring in pages as they are needed for inmemory processing, 2 overwrite (replace) such pages when they become obsolete for query processing and new pages require in-memory space. The buffer manager manages a collection of pages in a designated main memory area, the buffer pool, once all slots (frames) in this pool have been occupied, the buffer manager uses a replacement policy to decide which frame to overwrite when a new page needs to be brought in. 34

35 N.B. Simply overwriting a page in the buffer pool is not sufficient if this page has been modified after it has been brought in (i.e., the page is so-called dirty). pinpage / unpinpage buffer pool disk page free frame main memory disk database 35

36 Simple interface for a typical buffer manager Indicate that page p is needed for further processing: function pinpage(p): if buffer pool contains p already then pincount(p) pincount(p) + 1; return address of frame for p; select a victim frame p to be replaced using the replacement policy; if dirty(p ) then write p to disk; read page p from disk into selected frame; pincount(p) 1; dirty(p) false; Indicate that page p is no longer needed as well as whether p has been modified by a transaction (d): function unpinpage(p, d): pincount(p) pincount(p) 1; dirty(p) d; 36

37 N.B. The pincount of a page indicates how many users (e.g., transactions) are working with that page, clean victim pages are not written back to disk, a call to unpinpage does not trigger any I/O operation, even if the pincount for this page goes down to 0 (the page might become a suitable victim, though), a database transaction is required to properly bracket any page operation using pinpage and unpinpage, i.e. a pinpage(p);... read data (records) on page at address a;... unpinpage(p, false); or a pinpage(p);... read and modify data (records) on page at address a;... unpinpage(p, true); A buffer manager typically offers at least one more interface-call: flushpage(p) to force page p (synchronously) back to disk (for transaction mgmt. purposes) 37

38 Two strategic questions 1 How much pretious buffer space to allocate to each of the active transactions (Buffer Allocation Problem)? Two principal approaches: static assignment dynamic assignment 2 Which page to replace when a new request arrives and the buffer is full (Page Replacement Problem)? Again, two approaches can be followed: decide without knowledge on reference pattern presume knowledge on (expected) reference pattern Additional complexity is introduced when we take into account that the DBMS may manage segments of different page sizes: one buffer pool: good space utilization, but fragmentation problem many buffer pools: no fragmentation, but worse utilization; global replacement/assignment strategy may get complicated Possible solution could be to allow for set-oriented pinpages({p})-calls. 38

39 2.3.1 Buffer allocation policies Problem: shall we allocate parts of the buffer pool to each transaction (TX) or let the replacement strategy alone decide on who gets how much buffer space? Properties of a local policy: one TX cannot hurt others TXs are treated equally possibly bad overall utilization of buffer space some TXs may have vast amounts of buffer space occupied by old pages, while others experience internal page thrashing, i.e., suffer from too little space Problem with a global policy: Consider a TX executing a sequential read on a huge relation: all page accesses are references to newly loaded pages; hence, almost all other pages are likely be replaced (following a standard replacement strategy); other TXs cannot proceed without loading in their pages again ( external page thrashing ). 39

40 Typical allocation strategies include: global one buffer pool for all transactions local based on different kinds of data (e.g., catalog, index, data,... ) local each transaction gets a certain fraction of the buffer pool: static partitioning assign buffer budget once for each TX dynamic partitioning adjust a TX s buffer budget according to its past reference pattern some kind of semantic information It is also possible to apply mixed strategies, e.g., have different pools working with different approaches. This complicates matters significantly, though. 40

41 Examples for dynamic allocation strategies 1 Local LRU (cf. LRU replacement, later) keep a separate LRU-stack for each active TX, and a global freelist for pages not pinned by any TX Strategy: i. replace a page from the freelist ii. replace a page from the LRU-stack of the requesting TX iii. replace a page from the TX with the largest LRU-stack 2 Working Set Model (cf. operating systems virtual memory management) Goal: avoid thrashing by allocating just enough buffer space to each TX Approach: observe number of different page requests by each TX within a certain intervall of time (window size τ) deduce optimal buffer budget from this observation, allocate buffer budgets according to the ratio between those optimal sizes 41

42 Implementation of the Working Set Model Let W S(T, τ) be the working set of TX T for window size τ, i.e., W S(T, τ) = {pages referenced by T in the inverall [now τ, now]}. The strategy is to keep, for each transaction T i, its working set, W S(T i, τ) in the buffer. Possible implementation: keep two counters, per TX and per page, resp ly trc(t i )... TX-specific reference counter, lrc(t i, P j )... TX-specific last reference counter for each referenced page P j. Idea of the algorithm: Whenever T i references P j : increment trc(t i ); copy trc(t i ) to lrc(t i, P j ). If a page has to be replaced for T i, select among those with trc(t i ) lrc(t i, P j ) τ. 42

43 2.3.2 Buffer replacement policies The choice of victim frame selection (or buffer replacement) policy can considerably affect DBMS performance. Large number of policies in operating systems and database mgmt. systems. Criteria for victim selection used in some strategies Criteria Age of page in buffer no since last ref. total age none Random FIFO References last LRU CLOCK GCLOCK(V1) all LFU GCLOCK(V2) LRD(V1) DGCLOCK LRD(V2) 43

44 Schematic overview of buffer replacement policies ) * ) * ) * + ) ref to A in buffer ref to C not in buffer. 7 victim page H 4, 8 victim page H gc #!!! $! rc age $ " " # #!!! $!! % % D =? A / + + possibly initialized with weights! "used" bit ref count $ 44

45 Two policies found in a number of DBMSs: 1 LRU ( least recently used ) Keep a queue (often described as a stack) of pointers to frames. In unpinpage(p, d), append p to the tail of queue, if pincount(p) is decremented to 0. To find the next victim, search through the queue from its head and find the first page p with pincount(p) = 0. 2 Clock ( second chance ) Number the N frames in buffer pool 0... N 1, initialize counter current 0, and maintain a bit array referenced[0... N 1], initialized to all 0. In pinpage(p), do reference[p] 1. To find the next victim, consider page current. If pincount(current) = 0 and referenced[current] = 0, current is the victim. Otherwise, referenced[current] 0, current (current + 1) mod N, repeat. Generalization: LRU(k) take timestamps of last k references into account. Standard LRU is LRU(1). 45

46 N.B. LRU as well as Clock are heuristics only. Any heuristic can fail miserably in certain scenarios: A challenge for LRU A number of transactions want to scan the same sequence of pages (e.g., SELECT FROM R) one after the other. Assume a buffer pool with a capacity of 10 pages. 1 Let the size of relation R be 10 or less pages. How many I/Os do you expect? 2 Let the size of relation R be 11 pages. What about the number of I/O operations in this case? Other well-known replacement policies are, e.g., FIFO ( first in, first out ), LIFO ( last in, first out ) LFU ( least frequently used ), MRU ( most recently used ), GCLOCK ( generalized clock ), DGCLOCK ( dynamic GCLOCK ) LRD ( least reference density ), WS, HS ( working set, hot set ) see above, Random. 46

47 LRD least reference density Record the following three parameters trc(t)... total reference count of transaction t, age(p)... value of trc(t) at the time of loading p into buffer, rc(p)... reference count of page p. Update these parameters during a transaction s page references (pinpage-calls). From those, compute mean reference density of a page p at time t as: rd(p, t) := rc(p) trc(t) age(p)... where trc(t) rc(p) 1 Strategy for victim selection: chose page with least reference density r d(p, t)... many variants, e.g., for gradually disregarding old references. 47

48 Exploiting semantic knowledge Query compiler/optimizer... selects access plan, e.g., sequential scan vs. index, estimates number of page I/Os for cost-based optimization. Idea: use this information to determine query-specific, optimal buffer budget Query Hot Set model. Goals: optimize overall system throughput; to avoid thrashing is the most important goal. 48

49 Hot Set with disjoint page sets 1 Only those queries are activated, whose Hot Set buffer budget can be satisfied immediately. 2 Queries with higher demands have to wait until their budget becomes available. 3 Within its own buffer budget, each transaction applies a local LRU policy. Properties: No sharing of buffered pages between transactions. Risk of internal thrashing when Hot Set estimates are wrong. Queries with large Hot Sets block following small queries. (Or, if bypassing is permitted, many small queries can lead to starvation of large ones.) 49

50 Hot Set with non-disjoint page sets 1 Queries allocate their budget stepwise, upto the size of their Hot Set. 2 Local LRU stacks are used for replacement. 3 Request for a page p: (i) If found in own LRU-stack: update LRU-stack. (ii) If found in another transaction s LRU-stack: access page, but don t update the other LRU-stack. (iii) If found in freelist: push page on own LRU-stack. 4 unpinpage: push page onto freelist-stack. 5 Filling empty buffer frames: taken from the bottom of the freelist-stack. N.B. As long as a page is in a local LRU-stack, it cannot be replaced. If a page drops out of a local LRU-stack, it is pushed onto the freelist-stack. A page is replaced only if it reaches the bottom of the freelist-stack before some transaction pins it again. 50

51 Priority Hints Idea: with unpinpage, a transaction gives one of the two possible indications to the buffer manager: preferred page... those are managed in a TX-local parition, ordinary page... managed in a global partition. Strategy: when a page needs to be replaced, 1. try to replace an ordinary page from the global partition using LRU; 2. replace a preferred page of the requesting TX according to MRU. Advantages: much simpler than DBMIN ( Hot Set ), similar performance, easy to deal with too small partitions. 51

52 Prefetching... when the buffer manager receives requests for (single) page(s), it may decide to (asynchronously) read ahead on-demand, asynchronous read-ahead: e.g., when traversing the sequence set of an index, during a sequential scan of a relation,... heuristic (speculative) prefetching: e.g., sequential n-block lookahead (cf. drive or controller buffers in harddisks), semantically determined supersets, index prefetch,... 52

53 2.3.3 Buffer management in DBMSs vs. OSs Buffer management for a DBMS curiously tastes like the virtual memory 1 concept of modern operating systems. Both techniques provide access to more data than will fit into primary memory. So: why don t we use OS virtual memory facilities to implement DBMSs? A DBMS can predict certain reference patterns for pages in a buffer a lot better than a general purpose OS. This is mainly because page references in a DBMS are initiated by higher-level operations (sequential scans, relational operators) by the DBMS itself. Reference pattern examples in a DBMS 1 Sequential scans call for prefetching. 2 Nested-loop joins call for page fixing and hating. Concurrency control protocols often prescribe the order in which pages are written back to disk. Operating systems usually do not provide hooks for that. 1 Generally implemented using a hardware interrupt mechanism called page faulting. 53

54 Double Buffering If the DBMS uses its own buffer manager (within the virtual memory of the DBMS server process), independently from the OS VM manager, we may experience the following: Virtual page fault: page resides in DBMS buffer. However, frame has been swapped out of physical memory by OS VM manager. An I/O operation is necessary that is not visible to the DBMS. Buffer fault: page does not reside in DBMS buffer, frame is in physical memory. Regular DBMS page replacement, requiring an I/O operation. Double page fault: page does not reside in DBMS buffer, frame has been swapped out of physical memory by OS VM manager. Two I/O operations necessary: one to bring in the frame (OS) 2 ; another one to replace the page in that frame (DBMS). = DBMS buffer needs to be memory resident in OS. 2 OS VM does not know dirty flags, hence brings in pages that could simply be overwritten. 54

55 2.4 File and record organization Web Forms Plan Executor Applications SQL Commands Parser SQL Interface We will now turn away from page management and will instead focus on page usage in a DBMS. On the conceptual level, a relational DBMS manages tables of tuples 3, e.g. Operator Evaluator Optimizer Query Processor You are here! Files and Index Structures Transaction Manager Recovery Buffer Manager Manager Lock Manager Disk Space Manager Concurrency Control DBMS Index Files System Catalog Data Files Database A B C true foo... On the physical level, such tables are represented as files of records (tuple ˆ= record), each page holds one or more records (in general, record page ). A file is a collection of records that may reside on several pages. 3 More precisely, table actually means bag here (set of elements with multiplicity 0). 55

56 2.4.1 Heap files The most simple file structure is the heap file which represents an unordered collection of records. As in any file structure, each record has a unique record identifier (rid). A typical heap file interface supports the following operations: create/destroy heap file f named n: createfile(n) / deletefile(f ) insert record r and return its rid: insertrecord(f, r) delete a record with a given rid: deleterecord(f, rid) get a record with a given rid: getrecord(f, rid) initiate a sequential scan over the whole heap file: openscan(f ) N.B. Record ids (rids) are used like record addresses (or pointers). Internally, the heap file structure must be able to map a given rid to the page containing the record. 56

57 To support openscan(f ), the heap file structure has to keep track of all pages in file f ; to support insertrecord(f, r) efficiently, we need to keep track of all pages with free space in file f. Let us have a look at two simple structures which can offer this support Linked list of pages When createfile(n) is called, 1 the DBMS allocates a free page (the file header) and writes an appropriate entry n, header page to a known location on disk; 2 the header page is initialized to point to two doubly linked lists of pages: data page data page linked list of pages with free space header page data data linked list of page page full pages 3 Initially, both lists are empty. 57

58 Remarks: For insertrecord(f, r), 1 try to find a page p in the free list with free space > r ; should this fail, ask the disk space manager to allocate a new page p; 2 record r is written to page p; 3 since generally r p, p will belong to the list of pages with free space; 4 a unique rid for r is computed and returned to the caller. For openscan(f ), 1 both page lists have to be traversed. A call to deleterecord(f, rid) 1 may result in moving the containing page from full to free page list, 2 or even lead to page deallocation if the page is completely free after deletion. Finding a page with sufficient free space is an important problem to solve inside insertrecord(f, r). How does the heap file structure support this operation? (How many pages of a file do you expect to be in the list of free pages?) 58

59 2.4.3 Directory of pages An alternative to the linked list approach is to maintain a directory of pages in a file. The header page contains the first page of a chain of directory pages; each entry in a directory page identifies a page of the file: header page data page data page page directory data page 59

60 Remarks: page directory data pages Free space management is also done via the directory: each directory entry is actually of the form page addr p, nfree, where nfree indicates the actual amount of free space (e.g. in bytes) on page p. I/O operations and free space management For a file of pages, give lower and upper bounds for the number of page I/O operations during an insertrecord(f, r) call for a heap file organized using 1 a linked list of pages, 2 a directory of pages (1000 directory entries/page). linked list lower bound: header page + first page in free list + write r = 3 page I/Os upper bound: header page pages in free list + write r = page I/Os directory (1000 entries/page) lower bound: directory header page + write r = 2 page I/Os upper bound: 10 directory pages + write r = 11 page I/Os 60

61 2.5 Page formats Locating the containing data page for a given rid is not the whole story when the DBMS needs to access a specific record: the internal structure of pages plays a crucial role. For the following discussion, we consider a page to consist of a sequence of slots, each of which contains a record. A complete record identifier then has the unique form page addr, nslot where nslot denotes the slot number on the page. 61

62 2.5.1 Fixed-length records Life is particularly easy if all records on the page (in the file) are of the same size s; getrecord(f, p, n ): given the rid p, n we know that the record is to be found at (byte) offset n s on page p. deleterecord(f, p, n ): copy the bytes of the last occupied slot on page p to offset n s, mark last slot as free; all occupied slots thus appear together at the start of the page (the page is packed). insertrecord(f, r): find a page p with free space s (see previous section); copy r to the first free slot on p, mark slot as occupied. Packed pages and deletions: One problem with packed pages remains, though: calling deleterecord(f, p, n ) modifies the rid of a different record p, n on the same page. If any external reference to this record exists we need to chase the whole database and update rid references p, n p, n. Bad! 62

63 To avoid record copying (and thus rid modifications), we could simply use a free slot bitmap on each page: packed unpacked w/ bitmap slot 0 slot 1 slot N 1 free space slot 0 slot 1 slot 2 N page header 1 M M 1 slot M 1 number of records number of slots Calling deleterecord(f, p, n ) simply means to set bit n in the bitmap to 0, no other rids are affected. Page header or trailer? In both page organization schemes we have positioned the page header at the end of its page. How would you justify this design decision? 63

64 2.5.2 Variable-length records If records on a page are of varying size (cf. the SQL datatype VARCHAR(n)), we have to deal with page fragmentation: In insertrecord(f, r) we have to find an empty slot of size r ; at the same time we want to try to minimize waste. To get rid of holes produced by deleterecord(f, rid), compact the remaining records to maintain a contiguous area of free space on the page. A solution is to maintain a slot directory on each page (compare this with a heap file directory!): rid = <p, N 1> page p rid = <p,1> offset of recd from start of data area rid = <p,0> 24 bytes N N pointer to start of free space slot directory number of entries in slot directory 64

65 Remarks: The slot directory contains entries offset, length where offset is measured in bytes from the data page start. In deleterecord(f, p, n ), simply set the offset of directory entry n to 1; such an entry can be reused during subsequent insertrecord(f, r) calls which hit page p. Directory compaction is not allowed in this scheme (again, this would modify the rids of all records p, n, n > n)! If insertions are much more common than deletions, the directory size will nevertheless be close to the actual number of records stored on the page. N.B. Record compaction (defragmentation) is performed, of course. 65

66 2.6 Record formats This section zooms into the record internals themselves, thus discussing access to single record fields (conceptually: attributes). Attribute values are considered atomic by an RDBMS. Depending on the field types, we are dealing with fixed-length or variablelength fields in a record, e.g. SQL datatype fixed-length? length (# of bytes) 4 INTEGER 4 BIGINT 8 CHAR(n) n, 1 n 254 VARCHAR(n) 1... n, 1 n CLOB(n) 1... n, 1 n 2 31 DATE 4. The DBMS computes and then saves the field size information for the records of a file in the system catalog when it processes the corresponding command CREATE TABLE I Datatype lengths valid for DB2 V

67 2.6.1 Fixed-length fields If all fields in a record are of fixed length, offsets for field access can simply be read off the DBMS system catalog (field fi of size li): f1 f2 f3 f4 l1 l2 l3 l4 base address b address = b+l1+l Variable-length fields If a record contains one or more variable-length fields, other record representations are needed: 1 Use a special delimiter symbol ($) to separate record fields. Accessing field fi means to scan over the bytes of fields f1... f(n 1); 2 for a record of n fields, use an array of n + 1 offsets pointing into the record (the last array entry marks the end of field fn): f1 $ f2 $ f3 $ f4 $ f1 f2 f3 f4 67

68 Final remarks: Variable-length record formats seem to be more complicated but, among other advantages, they allow for the compact representation of SQL NULL values (field f3 is NULL below): f1 $ f2 $ $ f4 f1 f2 f4 Growing a record Consider an update on a field (e.g. of type VARCHAR(n)) which lets the record grow beyond the size of the free space on its containing page. How could the DBMS handle this situation efficiently? Really growing a record For fields of type VARCHAR(n) or CLOB(n) with n > page size we are in trouble whenever a record actually grows beyond page size (the record won t fit on any one page). How could the DBMS file manager cope with this? 68

69 2.7 Addressing schemes What makes a good record ID (rid)? given a rid, it should ideally not take more than 1 page I/O to get to the record itself rids should be stable under all circumstances, such as a record is being moved within a page a record is being moved across pages Why are these goals important to achieve? Consider the fact that rids are used as persistent pointers in a DBMS (indexes, directories, implemenation of CODAYSL sets,...) Conflicting goals! Efficiency calls for direct disk address, while stability calls for some kind of indirection. 69

70 Direct addressing RBA relative byte address: Consider disk file as a persistent virtual address space and use byte-offset as rid. pro: very efficient access to page and to record within page con: no stability at all w.r.t. moving record PP page pointers: Use disk page numbers as rid. pro: very efficient access to page; locating record within page is cheap (inmemory operation) con: stable w.r.t. moving records within page, but not when moving across pages 70

71 Indirect addressing LSN logical sequence numbers: Assign logical numbers to records. Address translation table maps to PPs (or even RBAs). pro: full stability w.r.t. all relocations of records con: additional I/O to translation table (often in the buffer) CODASYL systems call this DBTT (database key translation table):, * A O F D O I E I? D A H A I I A! %! & # ' '! ) 6 = > A, = J A >? % 71

72 Indirect addressing fancy variant LSN/PPP LSN with probable page pointers: Try to avoid extra I/O by adding a probable PP (PPP) to LSNs. PPP is the PP at the time of insertion into the database. If record is moved across pages, PPPs are not updated! pro: full stability w.r.t. all record relocations; PPP can save extra I/O for translation table, iff still correct con: 2 additional page I/Os in case PPP is no longer valid: old page to notice record has moved, second I/O to translation table to lookup new page number 72

73 TID addressing TID tuple identifier with forwarding: Use PP, Slot# -pair as rid (see above). To guarantee stability, leave a forward address on original page, if record has to be moved across pages. For example: access record with given rid= 17, 2 : %!! # # # # Avoid chains of forward addresses! When record has to be moved again: do not leave another forward address, rather update forward on original page! pro: full stability w.r.t. all relocations of records; no extra I/O due to indirection con: 1 additional page I/O in case of forward pointer on original page 73

74 Bibliography Brown, K., Carey, M., and Livny, M. (1996). Goal-oriented buffer management revisited. In Proc. ACM SIGMOD Conference on Management of Data. Chen, P., Lee, E., Gibson, G., R.H.Katz, and Patterson, D. (1994). Raid: Highperformance, reliable secondary storage. ACM Computing Surveys, 26(2): Denning, P. (1968). The working-set model for program behaviour. Communications of the ACM, 11(5): Elmasri, R. and Navathe, S. (2000). Fundamentals of Database Systems. Addison-Wesley, Reading, MA., 3 edition. Titel der deutschen Ausgabe von 2002: Grundlagen von Datenbanken. Härder, T. (1987). Realisierung von operationalen Schnittstellen, chapter 3. in (Lockemann and Schmidt, 1987). Springer. Härder, T. (1999). Springer. Datenbanksysteme: Konzepte und Techniken der Implementierung. Härder, T. and Rahm, E. (2001). Datenbanksysteme: Konzepte und Techniken der Implementierung. Springer Verlag, Berlin, Heidelberg, 2 edition. Heuer, A. and Saake, G. (1999). Datenbanken: Implementierungstechniken. Int l Thompson Publishing, Bonn. Lockemann, P. and Dittrich, K. (1987). Architektur von Datenbanksystemen, chapter 2. in (Lockemann and Schmidt, 1987). Springer. 74

75 Lockemann, P. and Schmidt, J., editors (1987). Datenbank-Handbuch. Springer-Verlag. Mitschang, B. (1995). Anfrageverarbeitung in Datenbanksystemen - Entwurfs- und Implementierungsaspekte. Vieweg. O Neil, E., O Neil, P., and Weikum, G. (1999). An optimality proof of the LRU-k page replacement algorithm. Journal of the ACM, 46(1): Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. Stonebraker, M. (1981). Operating systems support for database management. Communications of the ACM, 14(7):

76 Module 3: File Organizations and Indexes Module Outline Web Forms Applications SQL Interface 3.1 Comparison of file organizations 3.2 Overview of indexes 3.3 Properties of indexes 3.4 Indexes and SQL Plan Executor Operator Evaluator SQL Commands Parser Optimizer! This is a repetition of material from Information Systems. Transaction Manager Lock Manager Concurrency Control Files and Index Structures Buffer Manager Disk Space Manager Query Processor You are here! Recovery Manager DBMS Index Files Data Files System Catalog Database 76

77 A heap file provides just enough structure to maintain a collection of records (of a table). The heap file supports sequential scans (openscan) over the collection, e.g. SELECT FROM A, B R No further operations receive specific support from the heap file. For queries like SELECT A, B FROM R WHERE C > 42 or SELECT FROM ORDER BY A, B R C ASC it would definitely be helpful if the SQL query processor could rely on a particular organization of the records in the file for table R. File organization for table R Which organization of records in the file for table R could speed up the evaluation of the two queries above? 77

78 This section presents a comparison of 3 file organizations: 1 files of randomly ordered records (heap files) 2 files sorted on some record field(s) 3 files hashed on some record field(s).... introduces the index concept: A file organization is tuned to make a certain query (class) efficient, but if we have to support more than one query class, we may be in trouble. Consider: Q SELECT A, B, C FROM R WHERE A > 0 AND A < 100 If the file for table R is sorted on C, this does not buy us anything for query Q. If Q is an important query but is not supported by R s file organization, we can build a support data structure, an index, to speed up (queries similar to) Q. 78

79 3.1 Comparison of file organizations We will now enter a competition in which 3 file organizations are assessed in 5 disciplines: 1 Scan: fetch all records in a given file. 2 Search with equality test: needed to implement SQL queries like SELECT FROM WHERE R C = Search with range selection: needed to implement SQL queries like (upper or lower bound might be unspecified) SELECT FROM WHERE R A > 0 AND A < Insert a given record in the file, respecting the file s organization. 5 Delete a record (identified by its rid), fix up the file s organization if needed. 79

80 3.1.1 Cost model Performing these 5 database operations clearly involves block I/O, the major cost factor. However, we have to additionally pay for CPU time used to search inside a page, compare a record field to a selection constant, etc. To analyze cost more accurately, we introduce the following parameters: Parameter b r D C H Description # of pages in the file # of records on a page time needed to read/write a disk page CPU time needed to process a record (e.g., compare a field value) CPU time taken to apply a hash function to a record Remarks: D 15 ms C H 0.1 µs This is a coarse model to estimate the actual execution time (we do not model network access, cache effects, burst I/O,...). 80

81 ! Aside: Hashing A hashed file uses a hash function h to map a given record onto a specific page of the file. Example: h uses the lower 3 bits of the first field (of type INTEGER) of the record to compute the corresponding page number: h ( 42, true, "foo" ) 2 (42 = ) h ( 14, true, "bar" ) 6 (14 = ) h ( 26, false, "baz" ) 2 (26 = ) The hash function determines the page number only; record placement inside a page is not prescribed by the hashed file. If a page p is filled to capacity, a chain of overflow pages is maintained (hanging off page p) to store additional records with h (... ) = p. To avoid immediate overflowing when a new record is inserted into a hashed file, pages are typically filled to 80 % only when a heap file is initially (re)organized into a hashed file. (We will come back to hashing later.) 81

82 3.1.2 Scan 1 Heap file Scanning the records of a file involves reading all b pages as well as processing each of the r records on each page: Scan heap = b (D + r C) 2 Sorted file The sort order does not help much here. However, the scan retrieves the records in sorted order (which can be big plus): Scan sort = b (D + r C) 3 Hashed file Again, the hash function does not help. We simply scan from the beginning (skipping over the spare free space typically found in hashed files): Scan hash = (100/80) b (D + r C) } {{ } =1.25 Scanning a hashed file In which order does a scan of a hashed file retrieve its records? 82

83 3.1.3 Search with equality test (A = const) 1 Heap file The equality test is (a) on a primary key, (b) not on a primary key: (a) Search heap = 1 /2 b (D + r C) (b) Search heap = b (D + r C) 2 Sorted file (sorted on A) We assume the equality test to be on the field determining the sort order. The sort order enables us to use binary search: Search sort = log 2 b D + log 2 r C (If more than one record qualifies, all other matches are stored right after the first hit.) 3 Hashed file (hashed on A) Hashed files support equality searching best. The hash function directly leads us to the page containing the hit (overflow chains ignored here): (a) (b) Search hash = H + D + 1 /2 r C Search hash = H + D + r C (All qualifying records are on the same page or, if present, in its overflow chain.) 83

84 3.1.4 Search with range selection (A lower AND A upper) 1 Heap file Qualifying records can appear anywhere in the file: Range heap = b (D + r C) 2 Sorted file (sorted on A) Use equality search (with A = lower), then sequentially scan the file until a record with A > upper is found: Range sort = log 2 b D + log 2 r C + n/r D + n C (n denotes the number of hits in the range) 3 Hashed file (sorted on A) Hashing offers no help here as hash functions are designed to scatter records all over the hashed file (e.g., h ( 7,... ) = 7, h ( 8,... ) = 0) : Range hash = 1.25 b (D + r C) 84

85 3.1.5 Insert 1 Heap file We can add the record to some arbitrary page (e.g., the last page). This involves reading and writing the page: Insert heap = 2 D + C 2 Sorted file On average, the new record will belong in the middle of the file. After insertion, we have to shift all subsequent records (in the latter half of the file): Insert sort = log 2 b D + log 2 r C } {{ } search + 1 /2 b (2 D + r C) } {{ } shift latter half 3 Hashed file We pretend to search for the record, then read and write the page determined by the hash function (we assume the spare 20 % space on the page is sufficient to hold the new record): Insert hash = H } {{ + D } + C + D search 85

86 3.1.6 Delete (record specified by its rid) 1 Heap file If we do not try to compact the file (because the file uses free space management) after we have found and removed the record, the cost is: Delete heap = D }{{} search by rid + C + D 2 Sorted file Again, we access the record s page and then (on average) shift the latter half the file to compact the file: Delete sort = D + 1 /2 b (2 D + r C) } {{ } shift latter half 3 Hashed file Accessing the page using the rid is even faster than the hash function, so the hashed file behaves like the heap file: Delete hash = D + C + D 86

87 There is no single file organization that serves all 5 operations equally fast. Dilemma: more advanced file organizations make a real difference in speed! range selections for increasing file sizes (D = 15 ms, C = 0.1 µs, r = 100, n = 10): sorted file heap/hashed file time [s] b [pages] deletions for increasing file sizes (as above, n = 1): sorted file heap/hashed file time [s] b [pages] There exist index structures which offer all the advantages of a sorted file and support insertions/deletions efficiently 1 : B + trees. 1 At the cost of a modest space overhead. 87

88 3.2 Overview of indexes If the basic organization of a file does not support a particular operation, we can additionally maintain an auxiliary structure, an index, which adds the needed support. We will use indexes like guides. Each guide is specialized to accelerate searches on a specific attribute A (or a combination of attributes) of the records in its associated file: 1 Query the index for the location of a record with A = k (k is the search key), 2 The index responds with an associated index entry k (k contains enough information to access the actual record in the file), 3 Read the actual record by using the guiding information in k ; the record will have an A-field with value k. 2 k 1 index 2 k 3..., A = k,... 2 This is true for so-called exact match indexes. In the more general case with similarity indexes, the records are not guaranteed to contain the value k, they are only candidates for having this value. 88

89 We can design the index entries, i.e., the k, in various ways: Variant a b c Index entry k k,..., A = k,... k, rid k, [rid1, rid 2,... ] Remarks: With variant a, there is no need to store the data records in addition the index the index itself is a special file organization. If we build multiple indexes for a file, at most one of these should use variant a to avoid redundant storage of records. Variants b and c use rid(s) to point into the actual data file. Variant c leads to less index entries if multiple records match a search key k, but index entries are of variable length. 89

90 Example: 3 The data file contains name, age, sal records, the file itself (index entry variant a ) is hashed on field age (hash function h1). The index file contains sal, rid index entries (variant b ), pointing into the data file. This file organization + index efficiently supports equality searches on the age and sal keys age h(age)=0 Smith, 44, 3000 Jones, 40, h(sal)= Tracy, 44, h1 h(age)=1 Ashby, 25, 3000 Basu, 33, h2 Bristow, 29, h(sal)= h(age)= Cass, 50, Daniels, 22, 6003 sal data file hashed on age index file <sal,rid> entries refer to the index lookup scheme at the beginning of this section. 90

91 3.3 Properties of indexes Clustered vs. unclustered indexes Suppose, we have to support range selections on records such that lower A upper for field A. If we maintain an index on the A-field, we can 1 query the index once for a record with A = lower, and then 2 sequentially scan the data file from there until we encounter a record with field A > upper. This will work provided that the data file is sorted on the field A: + B tree index file index entries (k*) data records data file 91

92 If the data file associated with an index is sorted on the index search key, the index is said to be clustered. In general, the cost for a range selection grows tremendously if the index on A is unclustered. In this case, proximity of index entries does not imply proximity in the data file. As before, we can query the index for a record with A = lower. To continue the scan, however, we have to revisit the index entries which point us to data pages scattered all over the data file: + B tree index file index entries (k*) data records data file Remarks: If the index entries (k ) are of variant a, the index is obviously clustered by definition. A data file can have at most one clustered index (but any number of unclustered indexes). 92

93 3.3.2 Dense vs. sparse indexes A clustered index comes with more advantages than the improved speed for range selections presented above. We can additionally design the index to be space efficient: To keep the size of the index file small, we maintain one index entry k per data file page (not one index entry per data record). Key k is the smallest search key on that page. Indexes of this kind are called sparse (otherwise indexes are dense). To search a record with field A = k in a sparse A-index, we 1 locate the largest index entry k such that k k, then 2 access the page pointed to by k, and 3 scan this page (and the following pages, if needed) to find records with..., A = k,.... Since the data file is clustered (i.e., sorted) on field A, we are guaranteed to find matching records in the proximity. 93

94 Example: Again, the data file contains name, age, sal records. We maintain a clustered sparse index on field name and an unclustered dense index on field age. Both use index entry variant b to point into the data file: Ashby Cass Smith Ashby, 25, 3000 Basu, 33, 4003 Bristow, 30, 2007 Cass, 50, 5004 Daniels, 22, 6003 Jones, 40, 6003 Smith, 44, 3000 Tracy, 44, Remarks: Sparse Index on Name Data File Dense Index on Age Sparse indexes need 2 3 orders of magnitude less space than dense indexes. We cannot build a sparse index that is unclustered (i.e., there is at most one sparse index per file). SQL queries and index exploitation How do you propose to evaluate query `SELECT MAX(age) FROM employees? How about `SELECT MAX(name) FROM employees? 94

95 3.3.3 Primary vs. secondary indexes Terminology In the literature, you may often find a distinction between primary (mostly used for indexes on the primary key) and secondary (mostly used for indexes on other attributes) indexes. The terminology, however, is not very uniform, so some text books may use those terms for different properties. You might as well find primary denoting variant 1 of indexes according to Section 3.2, while secondary may be used to characterize the other two variants 2 and 3. 95

96 3.3.4 Multi-attribute indexes Each of the index techniques sketched so far and discussed in the sequel can be applied to a combination of attribute values in a straightforward way: concatenate indexed attributes to form an index key, e.g. lastname,firstname searchkey define index on searchkey index will support lookup based on both attribute values, e.g.... WHERE lastname= Doe AND firstname= John... possibly will also support lookup based on prefix of values, e.g.... WHERE lastname= Doe... There are more sophisticated index structures that can provide support for symmetric lookups for both attributes alone (or, to be more general, for all subsets of indexes attributes). These are often called multi-dimensional indexes. A large number of such indexes have been proposed, especially for geometric applications. 96

97 3.4 Indexes and SQL The SQL-92 standard does not include any statement for the specification (creation, dropping) of index structures. In fact, SQL does not even require SQL systems to provide indexes at all! Nonetheless, practically all SQL implementations support one or more kinds of indexes. A typical SQL statement to create an index would look like this: I Index specification in SQL 1 CREATE INDEX IndAgeRating ON Students 2 WITH STRUCTURE = BTREE, 3 KEY = (age,gpa) N.B. SQL-99 does not include indexes either. 97

98 Bibliography Elmasri, R. and Navathe, S. (2000). Fundamentals of Database Systems. Addison-Wesley, Reading, MA., 3 edition. Titel der deutschen Ausgabe von 2002: Grundlagen von Datenbanken. Härder, T. (1999). Springer. Datenbanksysteme: Konzepte und Techniken der Implementierung. Heuer, A. and Saake, G. (1999). Datenbanken: Implementierungstechniken. Int l Thompson Publishing, Bonn. Mitschang, B. (1995). Anfrageverarbeitung in Datenbanksystemen - Entwurfs- und Implementierungsaspekte. Vieweg. Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. 98

99 Module 4: Tree-Structured Indexing Module Outline Web Forms Applications SQL Interface 4.1 B + trees 4.2 Structure of B + trees 4.3 Operations on B + trees 4.4 Extensions 4.5 Generalized Access Path Plan Executor Operator Evaluator 4.6 ORACLE Clusters Files and Index Structures Transaction Manager Lock Manager Concurrency Control SQL Commands Buffer Manager Disk Space Manager Parser Optimizer Query Processor You are here! Recovery Manager DBMS Index Files Data Files System Catalog Database 99

100 4.1 B + trees Here we review an index structure which especially shines if we need to support range selections (and thus sorted file scans): B + trees. B + trees refine the idea underlying binary search on a sorted file by introducing a high fan-out, multi-level path selection mechanism. B + trees provide a balanced index structure that is resistant to data skew and automatically adapts to dynamic inserts and deletes. non leaf level leaf level (sequence set) 100

101 4.2 Structure of B + trees Properties of B + trees: Order: d Occupancy: each non-leaf node holds at least d and at most 2d keys (exception: root may hold at least 1 key) each leaf node holds between d and 2d index entries 1 Fan-out: each non-leaf node holding m keys has m + 1 children (subtrees) Sorted order: all nodes contain entries in ascending key-order child pointer p i (1 i m 1) of an internal node with m keys k 1... k m leads to a subtree where all keys k are k i k < k i+1 ; p 0 points to a subtree with keys k < k 1 and p m to a subtree with keys k k m Balance: all leaf nodes are on the same level Height: because of high fan-out, B + trees have a low height, which is log F N, for N... total number of index entries/records and F... average fan-out 1 depending on the kind of index entries, we sometimes use a separate order d for leaf nodes 101

102 Structure of non-leaf nodes: index entry p k p k p k p m m A leaf node entry with key value k is denoted as k as usual. Note that we can use all index entry variants a... c to implement the leaf entries: for variant a, the B + tree represents the index as well as the data file itself (i.e., a leaf node contains the actual data records): k i = k i,... data values.... for variants b and c, the B + tree lives in a file distinct from the actual data file; the p i are (one or more) rid(s) pointing into the data file: k i = k i, rid or k i = k i, {rid}. Leaf nodes are chained together in a doubly linked list, the so-called sequence set, to support range queries and sorted sequential access efficiently. 102

103 4.3 Operations on B + trees Search: Given search key k and a B + tree of height (= number of levels) h, we need h page accesses to find the index entry for k (successful search) or to determine that there is no corresponding record (unsuccessful search). Range selection starts the search with one of the interval boundaries and uses the sequence set to proceed. Insert: First, we find the right leaf node, insert the entry, if space is left and are done. If no space left, first try to move entries to neighbors, if they have space left. Otherwise, split the leaf node, redistribute entries, insert new separator one level higher. Split may propagate upwards, ultimately tree may grow in height by one. Delete: Search index entry and delete it. If underflow: first try to redistribute entries from neighbors. If they would also underflow: merge two neighbors and delete separator from one level higher. Merge may propagate upwards, ultimately tree may shrink in height by one. 103

104 4.4 Extensions Key compression Recall the analysis of the search I/O effort s in a B + tree for a file of N pages. The fan-out F plays a major role: s [I/Os] F = 10 F = 50 F = 10 F = 250 F = 500 F = 1000 s = log F N e+06 1e+07 N [pages] It clearly pays off to invest effort and try to maximize the fan-out F of a given B + tree implementation. 104

105 Index entries in inner (i.e., non-leaf) B + tree nodes are pairs k i, pointer to p i The representation of p i is prescribed by the DBMS or hardware specifics, and especially for key field types like CHAR( ) or VARCHAR( ), we will have p i k i. To minimize the size of keys observe that key values in inner index nodes are used only to direct traffic to the appropriate leaf page: During the search procedure, we need to find the smallest index i in each visited node such that k i k < k i+1 and then we follow link p i. We do not need the key values k i in their entirety to use them as guides. Rather, we could arbitrarily chose any suitable value k i such that k i separates the values to its left from those to its right. In particular, we can chose as short a value as possible. For text attributes, a good choice can be prefixes of key values. 105

106 Example: To guide a search across this B + tree node ironical irregular... it is sufficient to store the prefixes iro and irr. We must preserve the B + tree semantics, though: All index entries stored in the subtree left of iro have keys k < iro and index entries stored in the right subtree right of iro have keys k iro (and k < irr). Key prefix compression How would a B + tree key prefix compressor alter the key entries in the inner node of this B + tree snippet? irish ironical irregular... irksome iron ironage irrational irreducible 106

107 4.4.2 Bulk loading a B + tree Consider the following database session log (this might as well be commands executed on behalf of a database transaction): 1 connect to database 2 3 db2 => create table foo (A int, bar varchar(10)) 4 DB20000I The SQL command completed successfully. 5 6 insert rows into table foo 7 8 db2 => create index foo A on foo (A asc) The SQL command in line 8 initiates calls to insert( ) a so-called bulk-load. At least this is not as bad as swapping lines 6 and 8. Why? Anyway, we are going to traverse the growing B + tree index from its root down to the leaf pages times. Many DBMS provide a B + tree bulk loading utility to reduce the cost of operations like the above. 107

108 B + tree bulk-loading algorithm: 1 For all records (call their keys k) in the data file, create a sorted list of pages of index leaf entries k. Note: if we are using index entry variants b or c, this does not imply to sort the data file itself. (For variant a, we effectively create a clustered index.) 2 Allocate an empty index root page and let its p 0 page pointer point to the first page of sorted k entries. Example: State of bulk-load utility after step 2 (order of B + tree d = 1): root page 3* 4* 6* 9* 10* 11* 12* 13* 20*22* 23* 31* 35*36* 38*41* 44* (Index leaf pages not yet in B + tree are framed.) Bulk-loading continued Can you anticipate how the bulk loading utility will proceed from this point on? 108

109 We now use the fact that the k are sorted. Any insertion will thus hit the right-most index node (just above the leaf level). Use a specialized bulk insert procedure that avoids B + tree root-to-leaf traversals altogether. 3 For each leaf level page p, insert the index entry minimum key on p, pointer to p into the right-most index node just above the leaf level. The right-most node is filled left-to-right. Splits occur only on the right-most path from the leaf level up to the root. Example (continued): root page * 4* 6* 9* 10* 11* 12* 13* 20*22* 23* 31* 35*36* 38*41* 44* root page * 4* 6* 9* 10* 11* 12*13* 20*22* 23* 31* 35*36* 38*41* 44* 109

110 Example (continued): root page * 4* 6* 9* 10* 11* 12* 13* 20*22* 23* 31* 35*36* 38*41* 44* root page * 4* 6* 9* 10* 11* 12* 13* 20*22* 23* 31* 35*36* 38*41* 44* 110

111 Observations Bulk-loading is more (time-) efficient this way, because tree traversals are saved. Furthermore, less page I/Os are necessary (or, in other words: the buffer pool is utilized more effectively). Finally, as seen in the example, bulk-loading is also more space-efficient: all the leaf nodes in the example have been filled up completely. Space efficiency of bulk-loading How would the resulting tree in the above example look like, if you used the standard insert( ) routine on the sorted list of index entries k? Inserting sorted data into a B + tree yields minimum occupancy of (only) d entries in all nodes. 111

112 4.4.3 A note on order We defined B + trees using the concept of order (parameter d of a B + tree). This was useful for presenting the algorithms, but is hardly ever used in practical implementations, because key values may often be of a variable length datatype, duplicates may lead to variable numbers of rids in an index entry k according to variant c, leaf and non-leaf nodes may have different capacities due to index entries of variant a, key compression may introduce variable length separator values,... Therefore, in practice we relax the order concept and replace it with a physical space criterion, such as, each node needs to be at least half-full. 112

113 4.4.4 A note on clustered indexes A clustered index stores the actual data records inside the index structure (variant a entries). If the index is a B + tree, splitting and merging leaf nodes moves data records from one page to another. Depending on the addressing scheme used, the rid of a record may change, if it is moved to another page! Even with the TID addressing scheme, that allows to move records within a page without any effect and that leaves proxies in case a records moves between pages, this may incur an intolerable performance overhead. To avoid having to update other indexes or to avoid many proxies, some systems use the search key of the clustered index as (location independent) record addresses for other, non-clustered indexes. 113

114 4.5 Generalized Access Path that is, for each key value, there would be two distinct address lists pointing to records of the two files R and S, resp ly. 114 A B + tree structure can also be used to index the records of multiple files at the same time, provided that those files share common attributes. All we have to do is to use index entries that allow for two (or more) data records (variant a ), rids ( b ), or rid-lists ( c ), resp ly, one for each data file that is being indexed by this so-called generalized access path. Example: If the following B + tree would be built on top of two files, R and S, root page * 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* the leaf-level entries, e.g., 19 will be of the form (using variant c here): 19, {rid R 1, rid R 2,... }, {rid S 1, rid S 2,... },

115 Generalized Access Path pointing into two files R and S: 19 R S This structure has been proposed in (Härder, 1978) 115

116 4.6 ORACLE Clusters Commercial systems sometimes also provide mechanisms to use one index for multiple data files. ORACLE clusters are one example of such co-clustered indexes. The following sequence of SQL commands (rough sketch) will create a cluster with an index and use it for the storage of records originating from both tables, R and S: create cluster C... create index C-INDEX on C using A... create table R ( A,... ) in cluster C create table S ( A,... ) in cluster C Records from both tables will be placed into cluster pages according to their A- values, one page per A-value. A-values are part of the page header. R A A S

117 Properties of ORACLE clusters Selections on each table contained in the cluster are efficiently supported through the clustered index. Joins (equi-joins on A) between any of the clustered tables are already materialized. Sequential scans of any of the tables suffer from less compact storage. Index could also be a hash table. Up to 32 tables can be defined in a cluster. 117

118 Bibliography Bayer, R. and McCreight, E. (1972). Organization and maintenance of large ordered indices. Acta Informatica, 1: Bayer, R. and Unterauer, K. (1977). Prefix b-trees. ACM Transactions on Database Systems, 2(1): Comer, D. (1979). The ubiquituous b-tree. ACM Computing Surveys, 11(2): Härder, T. (1978). Implementing a generalized access path structure for a relational database system. ACM Transactions on Database Systems, 3(3): Knuth, D. E. (1973). The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley. Ollmert, H. (1989). Datenstrukturen und Datenorganisation. Oldenbourg Verlag. Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. Teory, T. (1994). Database Modelling & Design, The Fundamental Principles. Morgan Kaufman. 118

119 Module 5: Hash-Based Indexing Module Outline Web Forms Applications SQL Interface 5.1 General Remarks on Hashing 5.2 Static Hashing 5.3 Extendible Hashing 5.4 Linear Hashing Plan Executor Operator Evaluator SQL Commands Parser Optimizer Transaction Manager Lock Manager Concurrency Control Files and Index Structures Buffer Manager Disk Space Manager Query Processor You are here! Recovery Manager DBMS Index Files Data Files System Catalog Database 119

120 5.1 General Remarks on Hashing We now turn to a different family of index structures: hash indexes. Hash indexes are unbeatable when it comes to equality selections, e.g. SELECT FROM WHERE R A = k. If we carefully maintain the hash index while the underlying data file (for relation R) grows or shrinks, we can answer such an equality query using a single I/O operation. (More precisely: it is rather easy to achieve an average of 1.2 I/Os.) Other query types, like joins, internally initiate a whole flood of such equality tests. Hash indexes provide no support for range searches, however (hash indexes are also known as scatter storage). In a typical DBMS, you will find support for B + trees as well as hash-based indexing structures. 120

121 Hashing: Compute addresses from data values In a B + tree world, to locate a record with key k means to compare k with other keys organized in a (tree-shaped) search data structure. Hash indexes use the bits of k itself (independent of all other stored records and their keys) to find (i.e., compute) the location of the associated record. We will briefly review static hashing to illustrate the basic ideas behind hashing. Static hashing does not handle updates well (much like ISAM). Later, dynamic hashing schemes have been proposed, e.g. extendible and linear hashing, which refine the hashing principle and adapt well to record insertions and deletions. In a DBMS context, typically bucket-oriented hashing is used, rather than record-oriented hashing that prevails in in-memory applications. A bucket can contain several records, and it has its own overflow chain. A bucket is a (set of) page(s) on secondary memory. 121

122 5.2 Static Hashing To build a static hash index for an attribute A we need to 1 allocate an area of N (successive) disk pages, the so-called primary buckets (or the hash table), 2 in each bucket, install a pointer to a chain of overflow pages (initially, set this pointer to nil), 3 define a hash function h with range [0... N 1] (the domain of h is the type of A, e.g., if A has the SQL type INTEGER): h : INTEGER [0... N 1] Evaluating the hash function h on a given data value is cheap. It only involves a few CPU instructions. 122

123 The resulting setup looks like this: hash table bucket k h 2... N 1... primary buckets overflow pages A primary bucket and its associated chain of overflow pages is referred to as a bucket ( above). Each bucket contains data entries k (implemented using any of the variants a... c, see Section 3). To perform hsearch(k) (or hinsert(k)/hdelete(k)) for a record with key A = k, 1 apply hash function h to the key value, i.e., compute h(k), 2 access the primary bucket page with number h(k), 3 then search (insert/delete) the record on this page or, if necessary, access the overflow chain of bucket h(k). 123

124 If we are lucky or (somehow) avoid chains of overflow pages alltogether, hsearch(k) needs one I/O operation, hinsert(k) and hdelete(k) need two I/O operations. At least for static hashing, overflow chain management is important: Generally, we do not want hash function h to avoid collisions, i.e., h(k) = h(k ) even if k k (otherwise we would need as many primary bucket pages as different key values in the data file, or even in A s domain). However, we do want h to scatter the domain of the key attribute evenly across [0... N 1] (to avoid the development of extremely long overflow chains for few buckets). Such good hash functions are hard to discover, unfortunately (see next slide). 124

125 The birthday paradox If you consider the people in a group as the domain and use their birthday as hash function h (i.e., h : Person [ ]), chances are already > 50% that two people share the same birthday (collision), if the group has 23 people. Check yourself: 1 Compute the probability that n people all have different birthdays: different birthday(n): if n = 1 then return 1 else return different birthday(n 1) {z } probability that n 1 persons have different birthdays 365 (n 1) 365 {z } probability that n th person has birthday different from first n 1 persons 2... or try to find birthday mates at the next larger party. 125

126 If key values would be purely random we could arbitrarily extract a few bits and use these for the hash function. Real key distributions found in DBMS are far from random, though. Fairly good hash functions may be found by the following two simple approaches: 1 By division. Simply define h(k) = k mod N. This guarantees the range of h(k) to be [0... N 1]. N.B.: If you choose N = 2 d for some d you effectively consider the least d bits of k only. Prime numbers were found to work best for N. 2 By multiplication. Extract the fractional part of Z k (for a specific Z) 1 and multiply by hash table size N (N is arbitrary here): h(k) = N (Z k Z k ). However, for Z = Z /2 w and N = 2 d (w: no. of bits in a CPU word) we simply have h(k) = msb d (Z k), where msb d (x) denotes the d most significant bits of x (e.g., msb 3(42) = 5). 1 Z = ( 5 1)/ is a good choice. See (Knuth, 1973). 126

127 Characteristics of static hashing Clearly, if the underlying data file grows, the development of overflow chains spoils the otherwise predictable hash I/O behaviour (1 2 I/Os). Similarly, if the file shrinks significantly, the static hash table may be a waste of space (data entry slots in the primary buckets remain unallocated). As a rule of thumb, allocating a hash table of size 125% of the expected data capacity, i.e., such that it is only 80% full, will typically give very good results. In the worst case, a hash table can degrade into a linear list (one long chain of overflow buckets). Dynamic hashing schemes have been devised to overcome this problem by adapting the hash function and by combining the use of hash functions and directories guarding the way to the data records. 127

128 5.3 Extendible Hashing Extendible hashing is prepared to adapt to growing (or shrinking) data files. To keep track of the actual primary buckets which are part of our current hash table, we hash via an in-memory bucket directory (ignore the 2 fields for now) 2 : 2 4* 12* 32* 16* bucket A * 5* 21* bucket B h * bucket C directory 2 15* 7* 19* bucket D hash table 2 N.B.: In this figure we have depicted the data entries as h(k) (not as k ). 128

129 To search for a record with key k: 1 apply h, i.e., compute h(k), 2 consider the last 2 bits of h(k) and follow the corresponding directory pointer to find the bucket. Example: To find a record with key k such that h(k) = 5 = 101 2, follow the second directory pointer to bucket B, then use entry 5* to access the record. The meaning of the fields might now become clear: n at hash directory (global depth): Use the last n bits of h(k) to lookup a bucket pointer in this directory. (The directory size is 2 n.) d at bucket (local depth): The hash values h(k) of all data entries in this bucket agree on their last d bits. To insert a record with key k: 1 apply h, i.e., compute h(k), 2 use the last 2 bits of h(k) to lookup the bucket pointer in the directory, 3 if the bucket still has capacity, store h(k) in it, otherwise...? 129

130 Example Insert (no bucket overflow): To insert a record with key k such that h(k) = 13 = , follow the second directory pointer to bucket B (which still has empty slots) and place 13* there: 2 4* 12* 32* 16* bucket A * 5* 21* 13* bucket B h * bucket C directory 2 15* 7* 19* bucket D hash table Example (continued, bucket overflow): Inserting a record with key k such that h(k) = 20 = lets bucket A overflow We thus initiate a bucket split for bucket A. 130

131 1 Split bucket A (creating new bucket A2) and use bit number ( 2 + 1) to redistribute entries: 4 = = = = = bucket A bucket A2 we now need 3 bits to discriminate between old bucket A and new split bucket A2. 2 In this case we double the directory by simply copying its original pages (we now use = 3 bits to lookup a bucket pointer). 3 Let bucket pointer for point to A2 (directory pointer for still points to A): 3 32* 16* bucket A * 5* 21* 13* bucket B h * bucket C * 7* 19* bucket D 111 directory 3 4* 12* 20* bucket A2 131

132 If we split a bucket with local depth d < n (global depth) directory doubling is not necessary: Consider the insertion of record with key k and hash value h(k) = 9 = The associated bucket B is split (creating a new bucket B2) and entries are redistributed. The new local depth of bucket B is 3 (and thus does not exceed the global depth of 3 ). Modifying the directory s bucket pointer for suffices: 3 32* 16* bucket A * 9* bucket B h * bucket C * 7* 19* bucket D 111 directory 3 4* 12* 20* bucket A2 3 5* 21* 13* bucket B2 132

133 Algorithm: Input: Output: hsearch (k) search for hashed record with key value k pointer to hash bucket containing potential hit(s) n n ; // global depth of hash directory b h(k) & (2 n 1); return bucket[b]; 133

134 Algorithm: Input: Output: hinsert (k ) entry k to be inserted new global depth of extendible hash directory n n ; // global depth of hash directory b hsearch(k); if b has capacity then place h(k) in bucket b; return n; else // bucket b overflows, we need to split d d ; // local depth of bucket b create a new empty bucket b2; // redistribute entries of bucket b including h(k) for each h(k ) in bucket b do if h(k ) & 2 d 0 then move h(k ) to bucket b2; d d + 1; // new local depth of buckets b and b2 if n < d + 1 then // we need to double the directory allocate 2 n directory entries bucket[2 n... 2 n+1 1]; copy bucket[ n 1] into bucket[2 n... 2 n+1 1]; n n + 1; n n; return n; bucket[`h(k) & (2 n 1 1) 2 n 1 ] addr(b2); & and denote bit-wise and and bit-wise or (cf. C, C++) The directory entries are accessed via the array bucket[ n 1] whose entries point to the hash buckets. 134

135 Overflow chains? Extendible hashing uses overflow chains hanging off a bucket only as a last resort. Under which circumstances will extendible hashing create an overflow chain?... when too many data values are hashed to the same hash value. Deleting an entry h(k) from a bucket with local depth d may leave this bucket completely empty. Extendible hashing merges the empty bucket and its associated bucket 2 ( n d ) partner buckets. You should work out the details on your own. 135

136 5.4 Linear Hashing Linear hashing can, just like extendible hashing, adapt its underlying data structure to record insertions and deletions: Linear hashing does not need a hash directory in addition to the actual hash table buckets,... but linear hashing may perform bad if the key distribution in the data file is skewed. We will now investigate linear hashing in detail and come back to the two points above as we go along. 136

137 The core idea behind linear hashing is to use an ordered family of hash functions, h 0, h 1, h 2,... (traditionally the subscript is called the hash function s level). We design the family so that the range of h level+1 is twice as large as the range of h level (for level = 0, 1, 2,... ). This relationship is depicted below. Here, h level has range [0... N 1] (i.e., a range of size N): h level+1 >: 8 8 < >< h level: 0 1 N 2 N 1 h level >< h level+1 >< >: >: N 4 N 1 Given an initial hash function h and an initial hash table size N, one approach to define such a family of hash functions h 0, h 1, h 2,... (for level = 0, 1, 2,... ) would be h level (k) = h(k) mod (2 level N). 137

138 The basic linear hashing scheme then goes like this: Start with level = 0, next = 0. The current hash function in use for searches (insertions/deletions) is h level, active hash table buckets are those in h level s range: [ level N 1]. Whenever we realize that the current hash table overflows, e.g., insertions filled a primary bucket beyond c % capacity, or the overflow chain of a bucket grew longer than l pages, or... we split the bucket at hash table position next (not the bucket which triggered the split!): 138

139 1 allocate a new bucket, append it to the hash table (its position will be 2 level N + next), 2 redistribute the entries in bucket next by rehashing them via h level+1 (some entries remain in bucket next, some go to bucket 2 level N + next Which? ), h level+1 3 increment next by next 2 level N 1 2 level N + next 139

140 As we proceed, next walks down the table. Hashing via h level has to take care of next s position: < next : we hit an already split bucket, h level (k) rehash: find record in bucket h level+1 (k) next : we hit a yet unsplit bucket, bucket found = ; buckets already split (h level+1) 9 next bucket to be split >< range of h level >= unsplit buckets (h level ) range of h level+1 >< >: >; 2 level N 1 9 = ; images of already split buckets (h level+ >: hash buckets 140

141 If next is incremented beyond the hash table size...? A bucket split increments next by 1 to mark the next bucket to be split. How would you propose to handle the situation if next is incremented beyond the last current hash table position, i.e., next > 2 level N 1? Answer: If next > 2 level N 1, all buckets in the current hash table are hashed via h level+1 (see prev. slide). Linear hashing thus proceeds in a round-robin fashion: Remark: If next > 2 level N 1, then 1 increment level by 1, 2 reset next to 0 (start splitting from the hash table top again). In general, an overflowing bucket is not split immediately, but due to roundrobin splitting no later than in the following round. 141

142 Running example: Linear hash table: primary bucket capacity of 4, initial hash table size N = 4, level = 0, next = 0: level = 0 h 1 h * 44* 36* next * 25* 5* * 18*10* 30* * 35* 7* 11* hash buckets Insert record with key k such that h 0(k) = 43: level = 0 overflow pages h 1 h * * 25* 5* next * 18*10* 30* * 35* 7* 11* 43* * 36* hash buckets overflow pages 142

143 Insert record with key k such that h 0(k) = 37: level = 0 h 1 h * * 25* 5* 37* next * 18*10* 30* * 35* 7* 11* 43* * 36* hash buckets Insert record with key k such that h 0(k) = 29: level = 0 overflow pages h 1 h * * 25* * 18*10* 30* next * 35* 7* 11* 43* * 36* 101 5* 37* 29* 143

144 Insert 3 records with keys k such that h 0(k) = 22 (66, 34): level = 0 h 1 h 0 Insert record with key k h 0(k) = 50: level = 1 h 1 such that * * next * 25* * 18*10* 34* 001 9* 25* * 35* 7* 11* 43* next * 18*10* 34* 50* * 36* * 35* 11* 101 5* 37* 29* * 36* * 30* 22* 101 5* 37* 29* * 30* 22* * 7* N.B.: Rehashing a bucket means to rehash its overflow chain, too. 144

145 Sketch of implementations bucket[b] denotes the bth bucket in the hash table. Function full( ) is a tunable parameter: whenever full(bucket[b]) evaluates to true we trigger a split. Algorithm: Input: Output: hsearch (k) search for hashed record with key value k pointer to hash bucket containing potential hit(s) b h level (k); if b < next then // bucket b has already been split, // the record for key k may be in bucket b or bucket 2 level N + b, // rehash: b h level+1 (k); return bucket[b]; 145

146 Algorithm: Input: Output: hinsert (k ) entry k to be inserted none b h level (k); if b < next then // bucket b has already been split, rehash: b h level+1 (k); place h(k) in bucket[b]; if full(bucket[b]) then // the last insertion triggered a split of bucket next allocate a new bucket b ; bucket[2 level N + next] b ; // rehash the entries of bucket next for each entry with key k in bucket[next] do place entry in bucket[h level+1 (k )]; next next + 1; // did we split every bucket in the original hash table? if next > 2 level N 1 then // hash table size has doubled, start a new round now level level + 1; next 0; return ; 146

147 hdelete(k) for a linear hash table can essentially be implemented as the inverse of hinsert(k): Algorithm: Input: Output:.. hdelete (k) key k of entry to be deleted none if empty(bucket[2 level N + next]) then // the last bucket in the hash table is empty, remove it remove bucket[2 level N + next] from hash table; next next 1; if next < 0 then // round-robin scheme for deletion level level 1; next 2 level N 1; 147

148 Bibliography Fagin, R., Nievergelt, J., Pippenger, N., and Strong, H. R. (1979). Extendible hashing a fast access method for dynamic files. TODS, 4(3): Knuth, D. E. (1973). The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley. Larson, P.-Å. (1980). Linear hashing with partial expansions. In Proc. Intl. Conf. on Very Large Databases, pages , Montreal, Quebec, Canada. IEEE Computer Society. Litwin, W. (1980). Linear hashing: A new tool for file and table addressing. In Proc. Intl. Conf. on Very Large Databases, pages , Montreal, Quebec, Canada. IEEE Computer Society. Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. 148

149 Module 6: External Sorting Module Outline Web Forms Applications SQL Interface 6.1 Sorting as a building block of query execution 6.2 Two-Way Merge Sort 6.3 External Merge Sort 6.4 Using B + trees for sorting SQL Commands Plan Executor Operator Evaluator You are here! Parser Optimizer Query Processor Transaction Manager Lock Manager Concurrency Control Files and Index Structures Buffer Manager Disk Space Manager Recovery Manager DBMS Index Files Data Files System Catalog Database 149

150 6.1 Sorting as a building block of query execution Now that we have files, buffers, and indexes in our hands, let us proceed and approach real query execution. The DBMS does not execute a query as a large monolithic block, but rather provides a number of specialized routines, the query operators. Each operator is tailored to perform a specific task especially well (fast, timeand/or space-efficient). Operators may be plugged together to form a network of operators, a plan, that is capable of computing a specific query. (in this chapter, we will not discuss how to find a plan for a given SQL query this is a challenging problem of its own, covered later.) This chapter uncovers the details of the implementation of (one of) the most basic and important operator(s): the sort operator. 150

151 Sorting really stands out as a most useful operation. A whole variety of situations benefit from the fact that a file of records is sorted: An SQL query might explicitly request sorted output of records: SELECT FROM ORDER BY A, B, C R A Sorting the records of a file is the first step of the B + tree bulk-loading procedure (see Section 4.4.2). Sorting is most useful if an SQL query explicity requests duplicate elimination Why that? : SELECT DISTINCT A, B, C FROM R Some operators rely on their input files being already sorted, or, more often than not, sorted input files boost some operators performance. 151

152 A file of records is sorted with respect to sort key k and ordering θ, if for any two records r 1, r 2 with r 1 preceding r 2 in the file, we have that their corresponding keys are in θ-order: r 1 θ r 2 r 1.k θ r 2.k. A key may be a single attribute as well as an ordered list of attributes. In the latter case, order is defined lexicographically. Example: k = (A, B), θ = <: r 1 < r 2 r 1.A < r 2.A ( r1.a = r 2.A r 1.B < r 2.B ). As it is one of our primary goals not to restrict the file sizes our DBMS can handle, we face a fundamental problem: How can we sort a file of records whose size exceeds the available main memory size by far (let alone the available buffer manager space)? 152

153 We will approach this task in a two-phase fashion: 1 Sorting a file of arbitrary size is possible even if three pages of buffer space is all that is available. 2 Refine this algorithm to make effective use of larger and thus more realistic buffer sizes. As we go along we will consider a number of further optimizations in order to reduce the overall number of needed page I/O operations. 153

154 6.2 Two-Way Merge Sort The basic idea behind two-may merge sort may be sketched like follows. Let N = 2 s be the size of the input file to be sorted (which is too large to fit in available memory). Pass 0: 1 Read each of the N pages page-by-page, 2 sort the records on each page individually, 3 write the sorted page to disk (the sorted page is referred to as a run). (N.B.: Pass 0 writes N = 2 s sorted runs to disk, only one page of buffer space is used.) Pass 1: 1 Select and read two runs written in Pass 0, 2 merge their records with respect to θ, 3 write the new two-page run to disk (page-by-page). (N.B.: Pass 1 writes 2 s /2 = 2 s 1 runs to disk, three pages of buffer space are used.) 154

155 . Pass n: 1 Select and read two runs written in Pass n 1, 2 merge their records with respect to θ, 3 write the new 2 n -page run to disk (page-by-page). (N.B.: Pass n writes 2 s n runs to disk, three pages of buffer space are used.) Pass s writes a single sorted run (i.e., the complete sorted file) of size 2 s = N to disk. Remarks on the algorithm shown below: The run file run n r contains the r-th run of Pass n. The in-memory sort and merge steps may use standard sorting technology, e.g., QuickSort. 155

156 The algorithm for two-way merge-sort Algorithm: two-way-merge-sort(f, N, θ) Input: Number of pages N = 2 s in input file f, ordering θ Output: θ-sorted file run s 0 written to disk (side-effect) // Pass 0: write 2 s sorted single-page runs for each page r in s 1 do pp pinpage(f, r); f 0 createfile( run 0 r ); sort the records on page pointed to by pp using θ; write page pointed to by pp into file f 0 ; closefile(f 0 ); unpinpage(f, r, false); // Passes 1... s : see next slide

157 // Passes 1... s :... continued from previous slide for n in 1... s do // pairwise merge all runs written in Pass n 1 for r in s n 1 do f 1 openfile( run (n 1) (2 r) ); f 2 openfile( run (n 1) (2 r + 1) ); f 0 createfile( run n r ); for p in n 1 1 do pp 1 pinpage(f 1, p); pp 2 pinpage(f 2, p); // size of f 0 = size of f 1 + size of f 2 merge the records on pages pointed to by pp 1, pp 2 using θ and append the resulting two sorted pages to file f 0 ; unpinpage(f 2, p, false); unpinpage(f 1, p, false); return ; closefile(f 0 ); deletefile(f 1 ); deletefile(f 2 ); 157

158 Example: We are supposed to sort a 7-page file whose pages contain up to two INTEGER values each (records of a single attribute), θ = <: input file 1 page runs 2 page runs 4 page runs 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 2,3 4,4 6,7 8,9 4,7 8,9 1,3 5,6 2 1,2 3,5 6 Pass 0 Pass 1 Pass 2 Pass 3 8 page run 1,2 2,3 3,4 4,5 6,6 7,8 9 The black boxes x (= pages) illustrate what would happen for an 8-page file (these are not read/written in the 7-page file case). 158

159 Analysis of I/O behavior of Two-Way Merge Sort: In each pass we read all N pages in the file, sort/merge, and write N pages out again. The number of passes is 1 + log }{{} 2 N. } {{ } Pass 0 Passes 1...s Overall number of I/O operations: 2 N (1 + log 2 N ). (As expected, this is in O(N log N).) As described here this algorithm uses no more than three pages of buffer space at any point in time (consider the two pinpage( ) calls and the nested merge in Algorithm two-way-merge-sort( )). In reality, many more free buffer pages will be available and we want external sorting to make efficient use of these. We will discuss this refinement next. Using more than three pages? Can you envision how external sorting could make use of, say, B available pages in the database buffer? 159

160 6.3 External Merge Sort External merge sort aims at two improvements over plain two-way merge sort: Try to reduce the number of initial runs (avoid creating one-page runs in Pass 0), try to reduce the number of passes (merge more than 2 runs at a time). As before, let N denote the number of pages in the file to be sorted. B buffer pages shall be available for sorting. Pass 0: 1 Read B pages at a time, 2 use in-memory sort to sort the records on these B pages, 3 write the sorted run to disk. (N.B.: Pass 0 writes N/B runs to disk, each run contains B pages except the last run which may contain less.) 160

161 Passes 1,... (until only a single run is left): 1 Select B 1 runs from previous pass, read a page from each run, 2 perform a (B 1)-way merge and use the B-th page as temporary output buffer: input 1... input 2... output... disk input B 1 B main memory buffers disk 161

162 Analysis of I/O behavior of External Merge Sort: As for the two-way merge sort, in each pass we read, process, and then write all N pages. In Pass 0 we write N/B runs. The number of additional passes thus is log B 1 N/B. Overall number of I/O operations: ( 2 N 1 + log B 1 N/B ). (The same order of magnitude, O(N log N), but now base B 1 as opposed to 2!) 162

163 The I/O savings in comparison to two-way merge sort (B = 3) can be substantial: # of Passes B = 3 (two-way) B = 5 B = 9 B = 17 B = 129 B = e+06 1e+07 1e+08 1e+09 N [pages] 163

164 External merge sort reduces the I/O load, but is considerably more CPU intensive 1. Consider the (B 1)-way merge during passes 1, 2,... : To pick the next record we need to copy into the output buffer, we have to do B 2 comparisons: Example (let B 1 = 4, θ = <): >< >: >< >: >< >: >< >: We can do better if we use a bottom-up selection tree to support the merging: 8 j >< j >: j >< j >: j >< j >: j >< >: 426 j j >< >: 612 j This optimization cuts the number of comparisons down to log 2 (B 1) (for buffer sizes of B 100 this is a considerable improvement). 1 Which is a price we are willing to pay because I/O clearly dominates the overall cost. 164

165 6.3.1 Minimizing the number of initial runs Remember that the number of initial runs (files run 0 r written in Pass 0) determines the number of passes we need to make: ( 2 N 1 + ) log B 1 N/B, (i.e., r = 0... N/B 1). } {{ } Reducing the number of initial runs thus makes for a very desirable optimization. Let us briefly review such an optimization, namely replacement sort (for Pass 0). The buffer space shall have B spare pages for sorting. We dedicate two pages to be input and ouput buffers, respectively. The remaining B 2 buffer pages are called the current set. The buffer setup then looks something like this: input buffer current set 3 output buffer 165

166 Replacement sort then proceeds as follows: 1 Open a run file. 2 Load the next page of the file to be sorted into the input buffer. If the input file is exhausted, go to 4. 3 If there is remaining space in the current set, we move a record from the input buffer (if the input buffer is exhausted, reload it at 2 ) into the current set, go to 3. 4 In the current set, pick the record with the smallest key value k such that k is equal or larger than the last key value we have picked 2. Move this record to the output buffer. If the output buffer is full, append it to the current run. 5 If there is no such record in the current set, close this run. 6 If the input file is exhausted, stop, else go to 3. Example: the record with key k = 8 will be appended to the output next: input buffer current set output buffer 2 If this is the first key we pick for the current run, we assume a last key of k =. 166

167 Replacement sort Let B = 6, i.e., the current set can hold 4 records at a time. The input file contains records with INTEGER key values: Write a protocol of replacement sort by filling out the table below, mark the end of the current run by EOR (the current set has already been populated at step 3 ): current set output

168 Length of initial runs? Our replacement sort protocol suggests that the length of the initial runs indeed increases (in the example, the first run has length 8 = B 2 {z}!). Implement replacement sort (see assignment of the week) to empirically determine the run length if replacement sort is used in Pass 0. Remarks: Step 4 of the replacement sort process will, of course, benefit from techniques like the selection tree, esp. if the current set size is large. Blocked I/O 3 is more efficient than simple page-by-page I/O (which we assumed for simplicity). The presented algorithms may be adjusted to use blocked I/O in a rather straightforward manner. To keep the CPU busy while the input buffer is reloaded (the output buffer is appended to the current run), use double-buffering: Create shadows for the input as well as the output buffer. Let the CPU switch to the double shadow input buffer as soon as the input buffer is empty and asynchronously initiate an I/O operation to reload the original input buffer (the output buffer may be treated similarly). 3 Read/write a sequence of pages at at time. Minimizes seek time and rotational delay. 168

169 6.4 Using B + trees for sorting If our current sorting task matches a B + tree index in the system (i.e., the B + tree uses key k and ordering θ), we may be better off to abandon external sorting and use the index instead. If the B + tree index is clustered, then 1 we know that the data file itself is already θ-sorted, 2 and all we have to do is to read the N pages of the data file. If the B + tree index is unclustered, then 1 in the worst case, we have to initiate one I/O operation per record (not per page in the file)! Remember: + B tree index file index entries (k*) data records data file 169

170 Let p denote the number of data records per page (typical values are p = 10, 100, 1000). The expected number of I/O operations to sort via an unclustered B + tree will thus be p N. 4 I/O operations 1e+09 1e+08 1e+07 1e B+ tree clustered External Merge Sort 10 B+ tree unclustered, p = 10 B+ tree unclustered, p = e+06 1e+07 N [pages] (The plot assumes available buffer space for sorting of B = 257 pages.) For even modest file sizes, therefore, sorting by using an unclustered B + tree index is clearly inferior to external sorting. 4 Ignoring the I/O effort to traverse the B + tree as well as its sequence set, but also ignoring buffer cache hits. 170

171 Bibliography Knuth, D. E. (1973). The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley. Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. 171

172 Module 7: Evaluation of Relational Operators Module Outline Web Forms Applications SQL Interface 7.1 The DBMS s runtime system 7.2 General remarks 7.3 The selection operation 7.4 The projection operation 7.5 The join operation 7.6 A Three-Way Join Operator 7.7 Other Operators 7.8 The impact of buffering 7.9 Managing long pipelines of relational operators Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control SQL Commands You are here! Files and Index Structures Buffer Manager Disk Space Manager Parser Optimizer Query Processor Recovery Manager DBMS Index Files Data Files System Catalog Database 172

173 7.1 The DBMS s runtime system In some sense we can consider the implementation of the relational operators as a database s runtime system: The query plan (network of relational operators), constitutes the program to execute, 1 the relational operators act on files on disk (relations) and implement the behaviour of the plan. 2 The efficient evaluation of the relational operators should be carefully studied and tuned: Each operator implements only a small step of the overall query plan (thus, a plan for a query of modest complexity may easily contain up to 100 operators), the set of relational operators is designed to be small, each operator fulfills multiple tasks. 1 Compare this, e.g., to Java byte codes. 2 Again, in the Java world, this would be comparable to the Java VM. 173

174 Representation of Query Plans As in internal representation of queries, a DBMS typically uses an operator tree, whose internal nodes represent logical (e.g., algebra-style) or physical (e.g., concrete implementation algorithms) operators. Directed arcs connect arguments (inputs) to operators and operators to their output. As a result of query optimization, arguments that are used in multiple places may be connected to several operators, so we may end up with networks of operators, such as: R S sort T 174

175 Logical vs. Physical Operators A typical DBMS provides several implementations for a single relational operator (i.e., instead of we have,, ). For equivalent input file(s), all variants produce an equivalent output file. Equivalent? What do you think is precisely meant by equivalent here? Why don t we just say identical? Terminology: the variants,... are the different physical operators implementing the logical operator. We will discuss physical operators in this chapter. The query optimizer analyzes a given query plan based on its knowledge of the system internals, statistics, and ongoing bookkeeping and selects the best specific variant for each operator. During query optimization, logical operators are replaced by physical ones. 175

176 Physical Properties However, a specific variant may be tailored to exploit several physical properties of the system: the presence or absence of indexes on the input file(s), the sortedness of the input file(s), the size of the input file(s), the available space in the buffer pool (cf., external sorting in Chapter 6), the buffer replacement policy,... Example: The optimizer has marked each edge of the plan to indicate if the records flowing over this edge are sorted with respect to some sort key k or not (sorted:, unsorted: ): s u R s u u u S u u sort u T s s s 176

177 In general, the query optimizer may quite heavily transform the original plan to enable the use of the most efficient physical operators variants. Example (assume that physical operators op can exploit sortedness of their input(s), e.g., might be sort-merge join): R s s s S u sort s s s T s s 177

178 7.2 General remarks The system catalog A relational table can be stored in different file structures, we can create one or more indexes on each table, which are also stored in files. Conversely, a file may contain data from one (or more) table(s) or the entries in an index. Such data is refered to as (primary) data in the database. A relational DBMS maintains information about every table and index it contains. Such descriptive information is itself stored in a collection of special tables, the so-called catalog tables, aka. the system catalog, the data dictionary, the system catalog, or just the catalog. Catalog information includes relation and attribute names, attribute domains, integrity constraints, access privileges, and much more. Also, the query processor (or the query optimizer) draws a lot of information from the system catalog, such as, e.g., file structure for each table, availability of indexes, number of tuples in each relation, number of pages in each file,... We ll come back to some of these later. 178

179 7.2.2 Principal approaches to operator evaluation Algorithms for evaluating relational operators have a lot in common, they are based upon one of the following principles: 1 Indexing. If some form of (selection, join) condition is given, use an index to examine just the tuples that satisfy the condition. In more generality:... to examine a superset of candidate tuples that may satisfy the condition. 2 Iteration. Examine all tuples in an input table, one after the other. Index-only plans:... if there is an index covering all required attributes, we can scan the index instead of the data file. 3 Partitioning. By partitioning on a subset of attributes values, we can often decompose an operation into a less expensive collection of operations on partitions. Sorting and hashing are commonly used partitioning techniques. Devide-and-conquer:... partitioning is an instance of this principle of algorithm design. 179

180 7.3 The selection operation No index, unsorted data Selection (p) reads an input file r in of records and writes those records satisfying predicate p into the output file: Observations: Algorithm: Input: Output: (p, r in, r out) predicate p, input file r in output file r out written (side-effect) out createfile(r out); in openscan(r in ); while (r nextrecord(in)) EOF do if p(r) then appendrecord(out, r); closefile(out); Reading special record EOF from a file indicates the end of the input file. This simple procedure does not require r in to come with any special physical properties (the procedure is exclusively defined in terms of heap files, see Section 2.4.1). Predicate p may be arbitrary. 180

181 Query execution cost We summarize the characteristics of this implementation of the selection operator as follows: p (r in ) input access 3 prerequisites I/O cost file scan (openscan) of r in none (p arbitrary, r in may be a heap file) r in + sel(p) r }{{} in } {{ } input cost output cost r in denotes the number of pages in file r in, r in denotes the number of records (if b records fit on one page, we have r in = r in /b ). 3 Sometimes also called access path in the literature and text books. 181

182 Selectivity sel(p), the selectivity of predicate p, is the fraction of records satisfying predicate p: 0 sel(p) = p (r in ) 1 r in Selectivity What can you say about the following selectivities? 1 sel(true) 2 sel(false) 3 sel(a = 0) 182

183 7.3.2 No index, sorted data If the input file r in is sorted with respect to a sort key k, we can use binary search on r in to find the first record matching predicate p more quickly. To find more hits, scan the sorted file. Obviously, predicate p must match the sort key k in some way. Otherwise we won t benefit from the sortedness of r in. When does a predicate match a sort key? Assume r in is sorted on attribute A in ascending order. Which of the selections below can benefit from the sortedness of r in? 1 A=42 (r in ) 2 A>42 (r in ) 3 A<42 (r in ) 4 A>42 AND A<100 (r in ) 5 A>42 OR A>100 (r in ) 6 A>42 OR A<32 (r in ) 7 A>42 AND A<32 (r in ) 8 A>42 AND B=10 (r in ) 9 A>42 OR B=10 (r in ) 183

184 We defer the treatment of disjunctive predicates (e.g., A > 42 OR A < 32) until later. The characteristics of selection via binary search are: p (r in ) input access prerequisites I/O cost binary search, then sorted file scan of r in r in sorted on key k, p matches sort key k log 2 r in + sel(p) r } {{ } in + sel(p) r } {{ } in } {{ } binary search sorted scan output cost 184

185 7.3.3 B + tree index A clustered B + tree index on r in whose key matches the selection predicate p is clearly the superior method to evaluate p (r in ): Descend the B + tree to retrieve the first index entry to satisfy p. Then scan the sequence set to find more matching records. If the index is unclustered and sel(p) indicates a large number of qualifying records, it pays off to 1 read the index entries k, rid in the sequence set, 2 sort those entries on their rid field, 3 and then access the pages of r in in sorted rid order. Note that lack of clustering is a minor issue if sel(p) is close to 0. Why? p (r in ) input access access of B + tree on r in, then sequence set scan prerequisites clustered B + tree on r in with key k, p matches key k I/O cost 3 } {{ } + sel(p) r in + sel(p) r } {{ } in } {{ } B + tree acc. sorted scan output cost 185

186 7.3.4 Hash index, equality selection A selection predicate p matches a hash index only if p contains a term of the form A = c (assuming the hash index is over key attribute A). We are directly led to the bucket of qualifying records and pay I/O cost only for the access of this bucket 4. Note that sel(p) is likely to be close to 0 for equality predicates. p (r in ) input access prerequisites I/O cost hash table on r in r in hashed on key k, p has term k = c sel(p) r in + sel(p) r } {{ } in } {{ } hash access output cost 4 Remember that this may include access cost for the pages of an overflow chain hanging off the primary bucket page. 186

187 7.3.5 General selection conditions Indeed, selection operations with simple predicates like A θ c (r in ) are a special case only. We somehow need to deal with complex predicates, built from simple comparisons and the boolean connectives AND and OR Conjunctive predicates and index matching Our simple notion of matching a selection predicate with an index can be extended to cover the case where predicate p has a conjunctive form: A 1 θ 1 c 1 } {{ } conjunct AND A 2 θ 2 c 2 AND AND A n θ n c n. Here, each conjunct is a simple comparison (θ i {=, <, >,, }). An index with a multi-attribute key may match the entire complex predicate. 187

188 Matching a multi-attribute hash index. Suppose a hash index is maintained for the 3-attribute key k = (A, B, C) (i.e., all three attributes are input to the hash function). Which types of conjunctive selection predicates p would match this index? p =? p = (A = c 1 B = c 2 C = c 3 ) p = (A = c 1 B = c 2 C = c 3 D = c 4...) nothing else Predicate matching rule for hash indexes: A conjunctive predicate p matches a (multi-attribute) hash index with key k = (A 1, A 2,..., A n ), if p covers the key k, i.e. 1 p A 1 = c 1 A 2 = c 2 A n = c n or 2 p A 1 = c 1 A 2 = c 2 A n = c n φ (conjunct φ is not supported by the index itself and has to be evaluated separately after index retrieval). 188

189 Hint: Matching a multi-attribute B + tree index. We have a B + tree index available on the multi-attribute key (A, B, C), i.e., the B + tree nodes are inserted/searched for using a lexicographic order on the three attributes. What this means is that inside the B + tree two keys k 1 = (A 1, B 1, C 1 ) and k 2 = (A 2, B 2, C 2 ) are ordered according to k 1 < k 2 A 1 < A 2 (A 1 = A 2 B 1 < B 2 ) (A 1 = A 2 B 1 = B 2 C 1 < C 2 ). Which types of conjunctive selection predicates p would match this B + tree index? Given only the conjunct C = 42, how could B + tree search identify the subtree into which to descend in the following situation (B + tree snippet shown): (50,20,30) 189

190 Predicate matching rule for B + tree indexes A conjunctive predicate p matches a (multi-attribute) B + tree index with key k = (A 1, A 2,..., A n ), if p is a prefix of key k, i.e. 1 p A 1 θ 1 c 1 p A 1 θ 1 c 1 A 2 θ 2 c 2 or. p A 1 θ 1 c 1 A 2 θ 2 c 2 A n θ n c n 2 p A 1 θ 1 c 1 φ p A 1 θ 1 c 1 A 2 θ 2 c 2 φ. p A 1 θ 1 c 1 A 2 θ 2 c 2 A n θ n c n φ 190

191 Intersecting rid sets If we find that a conjunctive predicate does not match a single index, its (smaller) conjuncts may nevertheless match distinct indexes. Example: The conjunctive predicate in p q (r in ) does not match an index, but both conjuncts, p and q, do. A typical optimizer might thus decide to transform the original query r in p q into r in p rid q rid denotes an set intersection operator defined by rid equality. 191

192 The selectivity of conjunctive predicates What can you say about the selectivity of the conjunctive predicate p q? sel(p q) =? 192

193 Disjunctive predicates Chosing an intelligent execution plan for disjunctive selection predicates of the general form A 1 θ 1 c 1 A 2 θ 2 c 2 A n θ n c n. is much harder: We are forced to fall back to a naive file scan based evaluation (see Section 7.3.1) as soon as only a single term does not match an index. Why? If all terms are supported by indexes we can exploit a rid-based set union rid to improve the plan: r in A 1 θ 1 c 1. rid A n θ n c n 193

194 The selectivity of disjunctive predicates What can you say about the selectivity of the disjunctive predicate p q? sel(p q) =? Predicates involving attribute attribute comparisons Can you think of a clever query plan for a selection operation like the one shown below? A=B (r in ). 194

195 Bypass Selections Problem: parts of a selection condition may be expensive to check (typically, we assumed this was not the case!), or be very inselective. It is useful to evaluate cheap (and selective) predicates first. Boolean laws used for this include: true P true (evaluating P is not necessary) false P P (only now evaluate P ) Example: Q := σ (F1 F 2 ) F 3 (R), where the selectivities and cost of each part of the selection condition are as follows: formula selectivity cost F 1 s 1 = 0.6 C 1 = 18 F 2 s 2 = 0.4 C 2 = 3 F 3 s 3 = 0.7 C 3 =

196 Evaluation Alternative 1: Bring the selection condition into disjunctive normal form (DNF) it is already in DNF in our case. Push each tuple from the input through each disjunct in parallel. Collect matching tuples from each disjunct (eliminating duplicates!) #=1000 #=700 F 3 #=1000 #=1000 F 2 #=400 #=240 F 1 #= dups. elim d Mean cost per tuple (ignoring cost for duplicate eliminiation!): C }{{} 3 + C 2 + s }{{} 2 C 1 = 50.2 } {{ } upper path lower path: F 2 lower path: F 1 196

197 Evaluation Alternative 2: Bring the selection condition into conjunctive normal form (CNF). CNF [(F 1 F 2 ) F 3 ] = (F 1 F 3 ) (F 2 F 3 ). Push each tuple from the input through each conjunct in a row. Matching tuples survive all conjunct (no duplicate elimination necessary!) Mean cost per tuple: #=1000 F 2 F 3 #=820 F 1 F 3 #=772 C 2 + (1 s 2 ) (C 3 + s 3 (C 1 + (1 s 1 ) C 3 )) + s 2 (C 1 + (1 s 1 ) C 3 ) = Problem: F 3 evaluated multiple times, result could be cached! Mean cost per tuple with caching: C 2 + C 3 + s 2 (1 s 3 ) C 1 =

198 Evaluation Alternative 3: Bypass Plan Goal: eliminate tuples early, avoid duplicates. Introduce Bypass Selection Operator F, which produces two results: true and false outputs. (N.B. the two outputs are disjoint!) Bypass plans are derived from the CNF, i.e., (F 1 F 3 ) (F 2 F 3 ) in our example. Boolean factors and disjuncts in factors are sorted by cost. #=1000 F 3 #=600 #=420 false F 2 F 3 #112 true #=160 #=772 #=400 false F 1 true #=240 Mean cost per tuple (... disjoint union): C 2 + (1 s 2 ) C 3 + s 2 (C 1 + (1 s 1 ) C 3 ) = 40.6 Many variations are possible, e.g., for tuning in parallel environments. 198

199 7.4 The projection operation Projection (l) modifies each record in its input file and cuts off any field not listed in the attribute list l. Example: A B C 1 "foo" 3 1 "bar" 2 A,B 1 "foo" 2 1 "bar" 0 1 "foo" 0 = 1 A B 1 "foo" 1 "bar" 1 "foo" 1 "bar" 1 "foo" = 2 A B 1 "foo" 1 "bar" In general, the size of the resulting file will only be a fraction of the original input file: 1 any unwanted fields (here: C) have been thrown away, and 2 cutting off record fields may lead to duplicate records which have to be eliminated 5 to produce the final result. 5 Remember that we are bound to implement set semantics. 199

200 While step 1 calls for a rather straightforward file scan (indexes won t help much here), it is step 2 which makes projection costly. To implement duplicate elimination we have two principal alternatives: 1 sorting, or 2 hashing Projection based on sorting Sorting is one obvious preparatory step to facilitate duplicate elimination: records with all fields equal will be adjacent to each other after the sorting step. One benefit of a sort-based projection is that operator l output file, i.e.: will write a sorted (See algorithm on next slide.) r in? sort l s 200

201 Algorithm: Input: Output: (l, r in, r out) attribute list l, input file r in output file r out written (side-effect) out createfile(r tmp); in openscan(r in ); while (r nextrecord(in)) EOF do r r with any field cut off not listed in l; appendrecord(out, r ); closefile(out); external-merge-sort(r tmp, r tmp, θ); out createfile(r out); in openscan( run * 0 ); lastr ; while (r nextrecord(in)) EOF do if r lastr then appendrecord(out, r); lastr r; closefile(out); Sort ordering θ? How do we have to specify the ordering θ to make sure the above algorithm works correctly? 201

202 In this algorithm, sorting and duplicate elimination are two separate steps executed in sequence. Marriage of sorting and duplicate elimination? Can you imagine how a DBMS could fold the formerly separate phases ( 1 external merge sort, 2 duplicate elimination) to avoid the two-stage approach? The outline of the external merge sort algorithm is reproduced below. Pass 0: 1 Read B pages at a time, 2 use in-memory sort to sort the records on these B pages, 3 write the sorted run to disk. (N.B.: Pass 0 writes N/B runs to disk, each run contains B pages except the last run which may contain less.) Passes 1,... (until only a single run is left): 1 Select B 1 runs from previous pass, read a page from each run, 2 perform a (B 1)-way merge and use the B-th page as temporary output buffer. 202

203 7.4.2 Projection based on hashing If the DBMS has a fairly large number of buffer pages (B, say) to spare for the l (r in ) operation, a hash-based projection may be an efficient alternative to sorting: Partitioning phase: 1 Allocate all B buffer pages. One page will be the input buffer, the remaining B 1 pages will be used as hash buckets. 2 Read the file r in page-by-page, for each record r cut off fields not listed in l. 3 For each such record, apply hash function h 1(r) = h(r) mod (B 1) which depends on all remaining fields of r and store r in hash bucket h 1(r). (Write the bucket to disk if full. 6 ) input file partitions 2 hash function... h B B 1 disk B main memory buffers disk 6 You may read this as: a bucket s overflow chain resides on disk. 203

204 After partitioning, we are ensured that duplicate elimination is an intra-partition problem only: two identical records r, r have been mapped to the same partition: h 1 (r) = h 1 (r ) r = r. We are not done yet, though. Due to hash collisions, the records in a partition are not guaranteed to be all equal: We need a... h 1 (r) = h 1 (r ) r = r. Duplicate elimination phase: 1 For each partition, read each partition page-by-page. (Buffer page layout as before.) 2 To each record, apply hash function h 2! h 1. Why? 3 If two records r, r collide w.r.t. h 2, check if r = r. If so, discard r. 4 After the entire partition has been read in, append all hash buckets to the result file (which will be free of duplicates). N.B.: The hash-based approach is efficient only if the duplicate elimination phase can be performed in-memory (i.e., any partition may not exceed the buffer size). 204

205 7.4.3 Use of indexes for projection If the index key contains all attributes of the projection, we can use an indexonly plan to retrieve all values from the index pages without accessing the actual data records. Next we apply hashing or sorting to eliminate duplicates from this (much smaller) set of pages. If the index key includes the projected attributes as a prefix, and the index is a sorted index (e.g., a B + tree), we can use an index-only plan, both to retrieve the projected attribute values and to eliminate the duplicates as well. 205

206 7.5 The join operation The semantics of the join operation (r 1 p r 2 ) is most easily described in terms of two other relational operators: r 1 r 2 p r 1 r 2 p ( denotes the cross product operator, predicate p may refer to record fields in files r 1 and r 2.) The are several alternative algorithms that implement r 1 p r 2, and some of them actually implement the above relational equivalence: 1 enumerate all records in the cross product of r 1 and r 2, 2 then pick those record pairs satisfying predicate p. More advanced algorithms try to avoid the obvious inefficency in step 1 (the size of the intermediate result is r 1 r 2 ) and instead try to select early. 206

207 7.5.1 Nested loops join The nested loops join (NL-) is the basic join algorithm variant. Its I/O cost is forbidding, though. Algorithm: (p, r 1, r 2, r out) Input: predicate p, input files r 1,2 Output: output file r out written (side-effect) out createfile(r out); in 1 openscan(r 1); while (r nextrecord(in 1)) EOF do in 2 openscan(r 2); while (r nextrecord(in 2)) EOF do if p(r, r ) then appendrecord(out, r, r ); closefile(out); For obvious reasons, file r 1 is referred to as the outer (relation), while r 2 is commonly called the inner (relation). 207

208 Cost of NL- We can easily modify the algorithm such that for each page of the outer relation (instead of for each record), one scan of the inner relation is initiated. (If we ignored this simple modification, the I/O cost would be a prohibiting r 1 r 2 for the inner loop!) p (r 1, r 2 ) input access file scan (openscan) of r 1,2 prerequisites none (p arbitrary, r 1,2 may be heap files) 7 I/O cost r 1 + r }{{} 1 r 2 } {{ } outer loop inner loop 7 Ignoring the cost to write the result file r out. 208

209 The I/O cost for the simple NL- is staggering since NL- effectively enumerates all records in the cross product of r 1 and r 2. Example: Assume r 1 = 1000 and r 2 = 500, on current hardware, a single I/O operation takes about 10 msec (see Section 2.1.1). The resulting processing time for the NL- of r 1 and r 2 thus amounts to ( ) 10 msec = msec 83 mins. Remark: Swapping the roles of r 1 and r 2 (outer inner) does not buy us much here. This will, however, be different for advanced join algorithms. 8 8 If the DBMS s record field accesses are designed with care we can assume that r 1 p r 2 = r 2 p r

210 7.5.2 Block nested loops join Observe that plain NL- utilizes only 3 buffer pages at a time and otherwise effectively ignores the presence of spare buffer space. Given B pages of buffer space we can easily refine NL- to use the entire available space. The buffer setup is as follows: input files join result h hash table for block of r1 B 2 pages... input buffer (scan r2 page wise) output buffer disk B main memory buffers disk The main idea is to read the outer file r 1 in chunks of B 2 pages (instead of page-by-page as in NL-). Hash table? Which role does the in-buffer hash table over file r 1 play here? 210

211 Algorithm: (p, r 1, r 2, r out) Input: equality predicate p (r 1.A = r 2.B), input files r 1,2 Output: output file r out written (side-effect) out createfile(r out); in 1 openscan(r 1); repeat // try to read a chunk of maximum size (but don t read beyond EOF of r 1) B min(b 2, #remaining blocks in r 1); if B > 0 then read B blocks of r 1 into buffer, hash record r of r 1 to buffer page h(r.a) mod B ; in 2 openscan(r 2); while (r nextrecord(in 2)) EOF do compare record r with records r stored in buffer page h(r.b) mod B ; if r.a = r.b then appendrecord(out, r, r ); until B < B 2; closefile(out); If predicate p is a general predicate, block NL- is still applicable (at the cost of more CPU cycles, since all B 2 in-buffer blocks of r 1 have to be scanned to find a join partner for record r of r 2 ). 211

212 p (r 1, r 2 ) input access chunk-wise file scan of r 1, page-wise file scan of r 2 prerequisites p equality predicate (or arbitrary), r 1,2 may be heap files r1 I/O cost r 1 + r }{{} 2 B 2 outer loop } {{ } inner loop Block NL- beats plain NL- in terms of I/O cost by far. To return to our running Example: Assume, as before, r 1 = 1000 and r 2 = 500, on current hardware, a single I/O operation takes about 10 msec (see Section 2.1.1), and assume B = 100. Resulting processing time for the block NL- of r 1 and r 2 : 1000 ( ) 10 msec = msec = 65 secs (... as opposed to 83 mins before!) 212

213 Which relation is outer? Always use the smaller relation as outer. In the extreme this gives optimal performance. Give details...! Further optimization potential: If not just one page is left for scan of the inner relation, but the buffer pool is split evenly between the two relations, more passes over the inner relation are made, but I/O for inner page reads can be blocked, reducing seek times dramatically. 213

214 7.5.3 Index nested loops join Whenever there is an index on (at least) one of the join relations that matches the join predicate, we can take advantage by making the indexed relation the inner relation of the join algorithm. We do not need to compare the tuples of the outer relation with those of the inner, but rather use the index to retrieve the matches efficiently. Algorithm: (p, r 1, r 2, r out) Input: predicate p, input files r 1,2, index on r 2 Output: output file r out written (side-effect) out createfile(r out); in 1 openscan(r 1); while (r nextrecord(in 1)) EOF do use index on r 2 to find all matches for r appending them to output out; closefile(out); Index nested loops avoids enumeration of the cross-product. 214

215 Cost of index nested loops depends on the available index. p (r 1, r 2 ) input access file scan (openscan) of r 1 index access to r 2 prerequisites index on r 2 matching join predicate p 9 I/O cost r 1 + r }{{} 1 (cost of 1 index access to r 2 ) } {{ } outer loop inner loop This algorithm is especially useful, if the index is a clustered index, furthermore, even with unclustered indexes and few matches per outer tuples, index nested loops outperforms simple nested loops. 9 Ignoring the cost to write the result file r out. 215

216 7.5.4 Sort-merge join In a situation like the one depicted below, sort-merge join might be an attractive alternative to block NL-: r 1 s A=B s r 2 1 Both inputs to the join are sorted (annotation s on the incoming edges), and 2 the join predicate (here: A = B) is an equality predicate. Note that this effectively matches the situation just before the merge step of the two-way merge sort algorithm (see Chapter 6): simply consider join inputs r 1 and r 2 as runs that have to be merged. The merge phase has to be slightly adapted to ensure correct results are produced in a situation like this (with duplicates on both sides): 0 1 A C 1 "foo" 2 "foo" B 2 "bar" 2 "baz" A 4 "foo" R.A=S.B B D 1 true 2 false C 2 true A 3 false 216

217 Notes on the algorithm shown below: The code assumes that any comparison with EOF (besides itself) fails. Function tell(f ) yields the current file pointer of file f. The companion function seek(f, l) moves f s file pointer to position l. Unix: see man ftell and man fseek. Algorithm: (p, r 1, r 2, r out ) Input: equality predicate p (r 1.A = r 2.B), input files r 1,2 Output: output file r out written (side-effect) out createfile(r out ); in 1 openscan(r 1 ); in 2 openscan(r 2 ); r nextrecord(in 1 ); r nextrecord(in 2 ); // continued on next slide

218 //... continued from previous slide; while r EOF r EOF do while r.a < r.b do r nextrecord(in 1 ); while r.a > r.b do r nextrecord(in 2 ); l tell(in 2 ); while r.a = r.b do // repeat the scan of r 2 (implements the from previous slide) seek(in 2, l); r getrecord(in 2 ); // while we find matching records in r 2... while r.a = r.b do appendrecord(out, r, r ); r nextrecord(in 2 ); r nextrecord(in 1 ); r r ; closefile(out); 218

219 Summary and analysis of sort-merge join: p (r 1, r 2) input access sorted file scan of both r 1,2 prerequisites p equality predicate r 1.A = r 2.B, r 1 sorted on A, r 2 sorted on B I/O cost best case: r 1 + r 2 If if A is key in R 1 worst case: r 1 r 2 If if all (r 1, r 2)-pairs match I/O performance figures. Example: Just like before r 1 = 1000 and r 2 = 500, on current hardware, a single I/O operation takes about 10 msec. Resulting processing time for the sort-merge join of r 1 and r 2 : best case: worst case: ( ) 10 msec = msec = 15 sec ( ) 10 msec = msec 83 mins 219

220 Final remarks on sort-merge join: If either (or both) of R, S are not available in sorted order according to the join attribute(s), we can obtain the sort order by introducing an explicit sort step into the execution plan before the join operator. If we need to do explicit sorting before the join, we can combine the last merge phase of the (merge) sorting with the join (at the expense of slightly higher memory requirements). 220

221 7.5.5 Hash joins Hash join algorithms (there are quite a few!) follow a simple idea of partitioning: Instead of one big join compute many small joins: use the same hash function h to split r 1 and r 2 into k partitions, join each of the k pairs of partitions of r 1,2 separately. Due to hash partitioning, join partners from r 1 and r 2 can only be found in matching partitions i (hash joins only work for equi-joins!) Since the k small joins are independent of each other, this provides good parallelization potentials! The principal idea behind hash joins is the algorithmic divide-and-conquer paradigm. 221

222 Conceptually, a hash join is devided into a partitioning phase (or building phase) and a probing phase (or matching phase). The building phase scans each input relation in turn, filling k buckets. The probing phase scans each of the k buckets once, and computes a small join (hopefully in memory), e.g., using another hash function h 2. Partitions of R and S hash function h2 Join Result h2 Hash table for partition Ri (k < B-1 pages) Input buffer (To scan Si) Output buffer Disk B main memory buffers Disk 222

223 Algorithm: (p, r 1, r 2, r out ) Input: equality-predicate p, input files r 1,2 Output: output file r out written (side-effect) // building phase: in 1 openscan(r 1 ); while (r nextrecord(in 1 )) EOF do add r to buffer page h(r) // flushing buffer pages as they fill closefile(r 1 ); in 2 openscan(r 2 ); while (s nextrecord(in 2 )) EOF do add s to buffer page h(s) // flushing buffer pages as they fill closefile(r 2 ); // continued on next slide

224 //... continued from previous slide // probing phase: out createfile(r out ); for l = 1,..., k do // build in-memory hash table for r l 1, using h 2 for each tuple r in r l 1 do read r and insert it into hash table position h 2 (r); // scan r l 2 and probe for matching r l 1 tuples for each tuple s in r l 2 do read s and probe hash table using h 2 (s); for matching r 1 tuples r, appendrecord(out, r, s ); clear hash table for next partition; closefile(out); 224

225 Cost of this hash join Ignoring memory bottlenecks, this ( Grace Hash Join ) algorithm reads each page of r 1,2 exactly once in the building phase and writes about the same amount of pages out for the partitions. The probing phase reads each partition once. p (r 1, r 2 ) input access file scan (openscan) of r 1,2 prerequisites equi-join, r 1,2 may be heap files I/O cost r 1 + r 2 + r } {{ } 1 + r 2 } {{ } read write } {{ } building phase + r 1 + r 2 } {{ } probing phase = 3 ( r 1 + r 2 ) Ignoring the cost to write the result file r out. 225

226 I/O performance figures. Example: Just like before r 1 = 1000 and r 2 = 500, on current hardware, a single I/O operation takes about 10 msec. Resulting processing time for the hash join of r 1 and r 2 : 3 ( ) 10 msec = msec = 45 sec More elaborate hash join algorithms deal, e.g., with the case that partitions do not fit into memory during the probing phase. 226

227 Memory Requirements for Grace Hash Join We have to try to fit each hash partition into memory for the probing phase. Hence, to minimize partition size, we have to maximize the number of partitions. While partitioning, we need 1 buffer page per partition and 1 input buffer. With B buffers, we can thus generate B 1 partitions. This gives partitions of size R B 1 (for equal distribution). The size of an (in-memory) hash table for the probing phase needs to be f R B 1, for some fudge factor f a little large than 1. During the probing phase, we need to keep one such in-memory hash table, one input buffer plus one output buffer in memory, which results in B > f R B In summary, we thus need approximately B > f R pages of buffer space for the Grace Hash Join to perform well. If one or more partitions do not fit into main memory during the probing phase, this degrades performance significantly. 227

228 Utilizing Extra Memory Suppose we are partitioning R (and S) into k partitions where B > f R k, i.e. we can build an in-memory hash table for each partition. The partitioning phase needs k + 1 buffers, which leaves us with some extra buffer space of B (k + 1) pages. If this extra space is large enough to hold one partition, i.e., B (k +1) f R k, we can collect the entire first partition of R in memory during the partitioning phase and need not write it to disk. Similarly, during the partitioning of S, we can avoid storing its first partition on disk and rather immediately probe the tuples in S s first partition against the in-memory first partition of R and write out results. At the end of the partitioning phase for S, we are already done with joining the first partitions. The savings obtained result from not having to write out and read back in the first partitions of R and S. This version of hash join is called Hybrid Hash Join. 228

229 7.5.6 Semijoins Origin: Distributed DBMSs. (here: transport cost dominates I/O-cost) Remember: Semijoin R S := π R (R S) R Idea: to compute the distributed join between two relations R, S stored on different nodes N R, N S (assuming we want the result on N R ; let the common attributes be J): 1 Compute π J (R) on N R. 2 Send the result to N S. 3 Compute π J (R) S on N S. 4 Send the result to N R. 5 Compute R (π J (R) S) on N R. N.B. Step 3 computes the semijoin between S and R. This algorithm is preferable over sending all of S to N R, if (C tr denotes transport cost, depending on size of transfered data): C tr (π J (R)) + C tr (S R) < C tr (S). 229

230 Example: Semijoin Let relations R and S be given as R A B S B C D This yields π B (R) B S R B C D Cost of Semijoin: C tr = = 15 whereas sending all of S has C tr =

231 7.5.7 Summary of join algorithms No single join algorithm performs best under all circumstances. Choice of algorithm affected by sizes of relations joined, size of available buffer space, availability of indexes, form of join condition, selectivity of join predicate, available physical properties of inputs (e.g., sort orders), desirable physical properties of output (e.g., sort orders),... Performance differences between good and bad algorithm for any given join can be enormous. Join algorithms have been subject to intensive research efforts, particularly also in the context of parallel DBMSs. 231

232 7.6 A Three-Way Join Operator Within the INGRES project at UC Berkeley, a three-way join operator has been developed. Observations: Suppose we want to compute the join R A S B T, where A is an attribute common to R and S, B is common to S and T. This is an instance of a (three-way) star join with S as the center relation. Using only traditional (two-way) join algorithms, choices will include left-deep NL--plans (with or without index) iterating over, say, S as outer, using either of R or T as first inner and the other of two as second inner relation. When thinking of simple NL--algorithms, this means that for each combination of matching SR- (or ST -) tuple, we have to iterate over all of T (or S), resulting in a complexity on the order of O(n m k), for n, m, k the size of the involved relations (either in terms of number of tuples or number of pages). This roughly corresponds to three levels of nested loops. 232

233 Disadvantage: This three-way join algorithm makes optimization even more complex, since a sequence of two binary (logical) operators needs to be mapped to a single ternary (physical) operator. 233 The INGRES Three-Way Join Algorithm Idea: Scan the center relation, S in our example. For each tuple s S do: Find all matching R-tuples r and collect them in a temporary space S (e.g., using a nested loop or an index). Find all matching T -tuples t and collect them in a temporary space T (e.g., using a nested loop or an index). Append to the output the product (i.e., all combinations) of the one s tuple with the r and t tuples from the two temporary spaces R and S. N.B.: this corresponds to only two levels of nested loops, one outer loop (over S), with two loops inside, but one after the other, hence a complexity of only O(n (m + k)).

234 7.7 Other Operators Set Operations Intersection and Cross Product... are implemented as special joins : for intersection, use equality on all attributes as join condition; for the product, use true ; hence, there is no need to further consider those. With Union and Difference,... the challenge lies in duplicate identification. based on sorting and one based on hashing. There are two approaches, one Work out the details on your own

235 7.7.2 Aggregates The language SQL supports a number of aggregation operators (such as, sum, avg, count, min, max). Basic algorithm: scan the whole relation and maintain some running information during that scan. Compute the aggregate value from the running information upon completion of the scan: Aggregate sum avg count min max Running Information Total of values read Total, Count of values read Count of values read Smallest value read Largest value read Grouping: if aggregation is combined with grouping, we first have to do the grouping, using hashing or sorting (or an appropriate index). Then, use the running information on a per-group basis. Index-only: sometimes, aggregate values can be computed without accessing the data records at all, by just using an available index

236 7.8 The impact of buffering Effective use of the buffer pool is crucial for efficient implementations of a relational query engine. Several operators use the size of available buffer space as a parameter. Keep the following in mind: 1 When several operators execute concurrently, they share the buffer pool. 2 Using an unclustered index for accessing records makes finding a page in the buffer rahter unlikely and dependent on (rather unpredictably!) the size of the buffer. 3 Furthermore, each page access is likely to refer to a new page, therefore, the buffer pool fills quickly and we obtain a high level of I/O activity. 4 If an operation has a repeated pattern of page accesses, a clever replacement policy and/or sufficient number of buffers can speed up the operation significantly. Examples of such patterns are: 236

237 Simple nested loops join: for each outer tuple, scan all pages of the inner relation. If there is enough buffer space to hold entire inner relation, the replacement policy is irrelevant. Otherwise it is critical: LRU will never find a needed page in the buffer ( Sequential Flooding problem, see Section 2.3) MRU gives best buffer utilization, the first B 2 pages of the inner will always stay in the buffer. Nested block join: for each block of the outer, scan all pages of the inner relation. Since only one unpinned page is available for the scan of the inner, the replacement policy makes no difference. Index nested loop join: for each tuple in the outer, use the index to find matching tuples in the inner relation. For duplicate values in the join attributes of the outer relation, we obtain repeated access patterns for the inner tuples and the index. The effect can be maximized by sorting the outer tuples on the join attributes. 237

238 7.9 Managing long pipelines of relational operators Note that any relational operator that we have been discussing takes a parameter r out, i.e., a file (name) to be written to hold the operator s output. In some sense, we are using secondary storage as a one-way communication channel between operators in a plan. Consequences of this approach: 1 We pay for the (substantial) I/O effort to feed into and read from this communication channel. 2 The operators in a plan are executed in sequence, the first result record is produced not before the last relational operator in the pipeline executes: r 1 r 2 p tmp 1 l tmp 2 q tmp 3... tmp n k N.B.: No more than three temporary files tmp i need to exist at any point in time during execution. 238

239 Architecting the query processor in this fashion bears much resemblance with using the Unix shell like this: 1 # report all large MP3 audio files 2 #... below the current working directory 3 $ find. -size +1MB > tmp1 4 $ xargs file < tmp1 > tmp2 5 $ grep -i MP3 < tmp2 > tmp3 6 $ cut -d: -f1 < tmp3 7 output tmp[0-9] 8 $ rm Unix supports another type of communication channel, the pipe, which lets the participating commands exchange data character-by-character: 1 # report all large MP3 audio files 2 #... below the current working directory 3 $ find. -size +1MB xargs file grep 4 output -i MP3 cut -d: -f1 239

240 The execution of the pipe is driven by the rightmost command: 1 To produce a line of output, cut only needs to see the next line in its input: grep is requested to produce this input. 2 To produce this line of output, grep only needs to see the next line in its input: xargs is requested to produce this input As soon as find has produce a line of output, it is passed through the pipe, transformed by xargs, grep, and cut and then echoed to the terminal. In the database world, this mode of executing a pipepline (a query plan) is called streaming: A streaming query processor avoids to write temporary files (the tmp i ) whenever possible, operators communicate their output record-by-record (or block-by-block), a result records appears as soon as it is available (as opposed to when the complete result has been computed 11 ). 11 This is of major importance in interactive DBMS environments (ad-hoc query interfaces). 240

241 Example: 1 $ grep foo 2 XML 3 foobar 4 foobar 5 What does foo mean anyway? 6 What does foo mean anyway? 7 Enough already 8 ^D 9 $ Note, however, that we have to modify the implementations of our relational operators to support streaming. Currently, all operators consume their input as a whole, then write their output file as a whole, and only then return control to the query processor. 241

242 7.9.1 Streaming Interface To support streaming we need a record-by-record calling convention. New operator interface (let denote a relational operator):.reset() Operator is requested to reset so that a call to.next() will produce the first result record..next() The operator is requested to produce the next record of its result. Returns EOF if all result records have been requested already. 242

243 Example (implementation of p (r in )): Algorithm: Input: in.reset(); (p, in).reset() predicate p, in-bound stream in Algorithm: Input: Output:.(p, in).next() predicate p, in-bound stream in next record of selection result (or EOF ) while (r in.next()) EOF do if p(r) then // immediately return if next result record found return r; return EOF ; 243

244 Given a query plan like the one shown below, query evaluation is driven by the query processor like this (just like in the Unix shell): 1 The whole plan is initially reseted by calling reset() on the root operator, i.e., q.reset(). 2 The reset() call is forwarded through the plan by the operators themselves (see.reset() on previous slide). 3 Control returns to the query processor. 4 The root is requested to produce its next result record, i.e., the call q.next() is made. 5 Operators forward the next() request as needed. As soon as the next result record is produced, control returns to the query processor again. r 1 r 2 scan p l q scan 244

245 In short, the query processor uses the following routine to evaluate a query plan: Algorithm: Input: Output: eval (q) root operator of query plan q query result sent to terminal q.reset(); while (r q.next()) EOF do print(r); print("done."); 245

246 A streaming scan operator. Complete the implementation below to provide a streaming file scan operator: Algorithm: Input:... scan(f ).reset() filename f Algorithm: Input: Output:... scan(f ).next() filename f next record in file f or EOF 246

247 A streaming NL- operator. Complete the implementation below to provide a streaming NL- operator (see 7.5.1): Algorithm: (p, in 1, in 2).reset() Input: predicate p, in-bound streams in 1,2... Algorithm: (p, in 1, in 2).next() Input: predicate p, in-bound streams in 1,2 Output: next record in join result or EOF

248 Below is a code snippet used in a real DBMS product. The overall structure of this code almost perfectly matches the recent discussion: 1 /* efltr -- apply filter predicate pred to stream 3 Filter the in-bound stream, only stream elements that fulfill e->pred 4 contribute to the result. No index support whatsoever. 5 */ 6 erc eop FLTR(eOp *ip) 7 { 8 eobj FLTR *e = (eobj FLTR *)eobj(ip); 9 10 /* Challenge the in-bound stream until it is exhausted... */ 11 while (eintp(e->in)!= eeos) { 12 eintp(e->pred); 13 /*... or a stream element fulfills predicate e->pred */ 14 if (et as bool(eval(e->pred))) { 15 eval(ip) = eval(e->in); 16 return eok; 17 } 18 } 19 return eeos; 20 } erc eop FLTR RST(eOp *ip) 23 { 24 eobj FLTR *e = (eobj FLTR *)eobj(ip); ereset(e->in); 27 ereset(e->pred); return eok; 30 } 248

249 7.9.2 Demand-Driven vs. Data-Driven Streaming The iterator interface as shown above implements a demand-driven query processing infrastructure: consumers (later operators) request more input (by calling next()) from their producers (earlier operators) whenever they are ready to process the input. Demand-driven streaming minimizes ressource requirements and wasted effort in case a user/client does not want to see the whole result. In contrast, data-driven streaming requires more ressources, uses a different query processing infrastructure, and can exploit more parallelism. Each operator starts (asynchronously) to work on its input as soon and as fast as possible. Output is enqueued into a pipeline to the consumers as it occurs. The pipelines need to do buffering and/or to suspend producers. An operator only needs to wait, if there is no more input yet, or if the outputpipeline is full. 249

250 Bibliography Graefe, G. (1993). Query evaluation techniques for large databases. ACM Computing Surveys, 25(2): Kemper, A., Moerkotte, G., Peithner, K., and Steinbrunn, M. (1994). Optimizing disjunctive queries with expensive predicates. In Snodgrass, R. T. and Winslett, M., editors, Proc. ACM SIGMOD Conference on Management of Data, pages , Minneapolis, MS. ACM Press. Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. Steinbrunn, M., Peithner, K., Moerkotte, G., and Kemper, A. (1995). Bypassing joins in disjunctive queries. In Dayal, U., Gray, P., and Nishio, S., editors, Proc. Intl. Conf. on Very Large Databases, pages , Zurich, Switzerland. Morgan Kaufmann. Wong, E. and Youssefi, K. (1976). Decompostion A strategy for query processing. ACM Transactions on Database Systems, 1(3):

251 Module 8: Selectivity Estimation Module Outline Web Forms Applications SQL Interface 8.1 Query Cost and Selectivity Estimation 8.2 Database profiles 8.3 Sampling 8.4 Statistics maintained by commercial DBMS Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Parser Optimizer You are here! Query Processor Recovery Manager DBMS Index Files Data Files System Catalog Database 251

252 8.1 Query Cost and Selectivity Estimation The DBMS has a number of alternative implementations available for each (logical) algebra operator. Selecting an implementation for each of the operators in a plan is one of the tasks of the query optimizer. The applicability of implementations may depend on various physical properties of the argument files (or streams), such as sort orders, availability of indexes,... Among the applicable plans, the optimizer selects the ones with the least expected cost. The cost of the implementation algorithms is determined by the size of the input arguments. Therefore, it is crucial to know, compute, or estimate the sizes of arguments. This chapter presents some techniques to measure the quantitative characteristics of database files or input/output streams. 252

253 Principal Approaches There are two principal approaches to estimate sizes of relations: 1. Maintain a database profile, i.e., statistical information about numbers and sizes of tuples, distribution of attribute values and the like, as part of the database catalog (meta information) during database updates. Calculate similar parameters for (intermediate) query results based upon a (simple) statistical model during query optimization. Typically, the statistical model is based upon the uniformity and independence assumptions. Both are typically not valid, but they allow for very simple calculations. In order to provide more accuracy, the system can record histograms to more closely approximate value distributions. 2. Use sampling techniques, i.e., gather necessary characteristics of a query plan (input relations and intermediate results) by running the query on a small sample and by extrapolating to the full input size at query execution time. Crucial decision here is to find the right balance between size of the sample and resulting accuracy of estimation. 253

254 8.2 Database profiles Simple profiles Keep statistical information in the database catalogs. For each relation, record R... the number of records for each relation R, R... the number of disk pages allocated by those records, s(r)... record size s(r) and blocksize b as an alternative ( R = R / b s(r) ) V (A, R)... the number of distinct values of attribute A in relation R,... and possibly more. Based on these, develop a statistical model, typically, a very simple one, using primitive assumptions. 254

255 Typical assumptions In order to obtain simple formulae, assume one of the following: Uniformity & independence assumption: all values of an attribute appear with the same probability; values of different attributes are distributed independent of each other. Simple, yet rarely realistic assumption. Worst case assumption: no knowledge available at all, in case of selections assume all records match the condition. Unrealistic assumption, can only be used for computing upper bounds. Perfect knowledge assumption: exact distribution of values is known. Unrealistic assumption, can only be used for computing lower bounds. Typically use uniformity assumption. 255

256 Selectivity estimation under uniformity assumption Using the parameters mentioned above, and assuming uniform and independent value distributions, we can compute the characteristics of intermediate results for the (logical) operators: 1. Selections Q := A=c(R) Selectivity: sel(a = c) = 1/V (A, R) uniformity! Number records: Q = sel(a = c) R Record size: s(q) = s(r) Number Attribute values: V (A, Q) = { 1, for A = A, c( R, V (A, R), Q ), otherwise. 256

257 The number c( R, V (A, R), Q ) of distinct values in an attribute A after the selection, is estimated using a well-known formula from statistics. The number c of distinct colors obtained by drawing r balls from a bag of n balls in m different colors is c(n, m, r): r, for r < m/2, c(n, m, r) = (r + m)/3, for m/2 r < 2m, m, for r 2m 257

258 Element tests, e.g., Q := A IN (...)(R): An approximation for the selectivity could be obtained by multiplying the selectivity for an equality selection sel(a = c) with the number of elements in the list of values Other Selection Conditions Equality between attributes, e.g., Q := A=B(R): An approximation for the selectivity could be sel(a = B) = 1/ max(v (A, R), V (B, R)). This assumes that each value in the smaller attribute (i.e., the one with fewer distinct values) has a corresponding match in the other attribute. Range selections, e.g., Q := A>c(R): If the system also keeps track of the miminum and maximum value of each attribute (denoted High(A, R) and Low(A, R) hereinafter), we could approximate the selectivity by sel(a > c) = High(A, R) c High(A, R) Low(A, R).

259 3. Projections Q := L(R) Estimating the number of result tuples is difficult. Typically use: V (A, R), for L = {A}, R, for key(r) L, Q = R, without duplicate elimination, min ( R, A i L V (A i, R) ), otherwise s(q) = s(a i ) A i L V (A i, Q) = V (A i, R) for A i L 259

260 4. Unions Q := R S Q R + S s(q) = s(r) = s(s) same schema! V (A, Q) V (A, R) + V (A, S) 5. Differences Q := R S max (0, R S ) Q R s(q) = s(r) = s(s) V (A, Q) V (A, R) same schema! 260

261 6. Products Q := R S 7. Joins Q := R F S Q = R S s(q) = s(r) + s(s) { V (A, R), for A sch(r) V (A, Q) = V (A, S), for A sch(s) This is the most challenging operator for selectivity estimation! A few simple cases are: No common attributes (sch(r) sch(s) = ), or join predicate F = true : R F S = R S Join attribute, say A, is key in one of the relations, e.g., in R, and assuming the inclusion dependency π A (S) π A (R) : Q = R 261

262 In the more general case, again assuming inclusion dependencies, π A (S) π A (R) or π A (R) π A (S), we can use two estimates: Q = R S V (A, R) or Q = R S V (A, S) typically use the smaller of those two estimates, i.e., R S Q = max (V (A, R), V (A, S)) s(q) = s(r) + s(s) V (A, Q) A sch(r) sch(s) { min (V (A, R), V (A, S)), V (A, X), s(a) for natural join for A sch(r) sch(s) for A sch(x) 262

263 Selectivity estimation for composite predicates For selections with composite predicates, we compute the selectivities of the individual parts of the condition and combine them appropriately. Combining estimates...? Here we need our second (unrealistic but simplifying) assumption: we assume independence of attribute value distributions, for under this assumption, we can easily compute: Conjunctive predicates, e.g., Q := A=c 1 B=c 2 (R): sel(a = c 1 B = c 2 ) = sel(a = c 1 ) sel(b = c 2 ), which gives Q = R V (A, R) V (B, R). Disjunctive predicates, e.g., Q := A=c 1 B=c 2 (R): sel(a = c 1 B = c 2 ) = sel(a = c 1 )+sel(b = c 2 ) sel(a = c 1 ) sel(b = c 2 ), which gives Q = R V (A, R) + V (B, R) V (A, R) V (B, R). 263

264 8.2.2 Histograms Observation: in most cases, attribute values are not uniformly distributed across the domain of an attribute. to keep track of non-uniform value distribution of an attribute A, maintain a histogram to approximate the actual distribution: 1 Divide attribute domain into adjacent intervals by selecting boundaries b i dom(a). 2 Collect statistical parameters for each such interval, e.g., number of tuples with b i 1 < t(a) b i, number of distinct A-values in that interval. Two types of histograms can be used: equi-width histograms : intervals all have the same length, equi-depth histograms : intervals all contain the same number of tuples Histograms allow for more exact estimates of (equality and range) selections. I Example of a product using histograms The commercial DBMS Ingres used to maintain histograms. 264

265 Example of Approximations Using Histograms For skewed, i.e., non-uniform, data distributions, working without histograms gives bad estimates. For example, we would compute a selectivity sel(a > 13) = = 3, which is far from exact, since we see that, in fact, 9 tuples qualify Actual value distribution D Uniform distribution approximating D The error in estimation using the uniformity assumption is especially large for those values, which occur very often in the database. This is bad, since for these, we would actually need the best estimates! Histograms partition the range of (actual) values into smaller pieces ( buckets ) and keep the number of tuples and/or distinct values for each of those intervals. 265

266 Example of Approximations Using Histograms (cont d) With the equi-width histogram given below, we would estimate 5 result tuples, since the selection covers a third of the last bucket of the histogram (assuming uniform distribution within each bucket of the histogram). Using the equi-depth histogram also given below, we would, in this case, estimate the exact result (9 tuples), since the corresponding bucket contains only a single attribute value. 10,00 9,00 Equiwidth 10,00 9,00 Equidepth 8,00 8,00 7,00 7,00 6,00 6,00 5,00 5,00 5,00 5,00 5,00 5,00 5,00 5,00 4,00 4,00 3,00 2,67 2,67 2,67 3,00 2,00 1,00 1,33 1,33 1,33 1,00 1,00 1,00 2,00 1,00 0,00 0, Typically, equi-depth histograms provide better estimates than equi-width histograms. Compressed histograms keep separate counts for the most frequent values (say, 7 and 14 in our example) and maintain an equi-depth (or whatever) histogram of the remaining values. 266

267 8.3 Sampling Idea: if maintaining a database profile is costly and error-prone (i.e., may provide far-off estimates), may be it is better to dispense with this idea at all. Rather, if a complex query is to be optimized, execute the query on a small sample to collect accurate statistics. Problem: How small shall/can the sample be? Small enough to be executed efficiently, but large enough to obtain useful characteristics. How precisely can we extrapolate? Good prediction requires large (enough) samples.... select a value for either one of those parameters and derive the other one from that. 267

268 Three approaches found in the literature 1 adaptive sampling : try to achieve a given precision with minimal sample size. 2 double (two-phase) sampling : first, obtain a coarse picture from a very small sample, just good enough to calculate necessary sample size for a useful precision in second step. 3 sequential sampling : sliding calculation of characteristics, stops when estimated precision is good enough. 268

269 8.4 Statistics maintained by commercial DBMS The research prototype System/R introduced the concept of a database profile recording simple statistical parameters about each stored relation, each defined index, each segment ( table space ). In addition to the parameters mentioned above (tuple and page cardinalities, number of distinct values), the system stores the current minimum and maximum attribute values (to keep track of the active domain and to provide input for the estimation of range queries). For each index (only B + tree-indexes are supported by System/R), store height of the index tree and numer of leaf nodes. Many commercial DBMSs have adopted the same or similar catalog information for their optimizers. 269

270 N.B.: per segment information is used for estimating the cost of a segment scan. System/R stores more than one relation in a segment and uses the TID addressing scheme. Therefore, there is no way of doing a relation scan. If no other plan is available, the system will have to scan all pages in a segment! 270 Parameters of a DB profile in the System/R catalogs per segment NP number of used pages per relation Card(R) = R, number of tuples PCard(R) = R, number of blocks per index ICard(I) = V (A, R) number of distinct values in indexed attribute MinKey(I) minimum attribute value in indexed attribute MaxKey(I) maximum attribute value in indexed attribute NIndx(I) number of leaf nodes NLevels(I) height of index tree Clustered? is this a clustered index (yes/no) All values are approximations only! They are not maintained during database updates (to avoid hot spots ), rather they can be updated explicitly via SQL s update statistics command.

271 System/R estimation of selectivities Obviously, System/R can use ICard(I)-values for estimating the selectivity sel(a = c) for simple attribute-equals-constant -selections in those cases, where an index on attribute A is available. But, what can we do for the other cases? If there is no index on A, System/R arbitrarily assumes sel(a = c) = For selection conditions A = B, the system uses sel(a = B) = 1 max (ICard(I 1 ), ICard(I 2 )) if indexes I 1 and I 2 are available for attributes A and B, respectively. This estimation assumes an inclusion dependency, i.e., each value from the smaller index, say I 1, has a matching value in the other index. Then, given a value a for A, assume that each of the ICard(I 2 ) values for B is equally likely. Hence, the number of tuples that have a given A-value a as their B-value is 1 ICard(I. 2) 1 If only one attribute has an index, assume selectivity sel(a = B) = ICard(I) ; if neither attribute has an index, assume the ubiquitous ,

272 These formulae are used whether or not A and B are from the same relation. Notice the correspondence with our estimation of join selectivity above!... System/R estimation of selectivities (cont d) For range selections A > c, exploit the MinKey and MaxKey parameters, if those are present (i.e., if an index is available). sel(a > c) = MaxKey(I) c MaxKey(I) MinKey(I) If A is not an arithmetic type or there is no index, a fraction less than half is arbitrarily chosen. Similar estimates can be derived for other forms of range selections. For selections of the form A IN (List of values), compute the selectivity of A = c and multiply by the number of values in the list. Note that this number can be the result of a complex selectivity estimation in case of SQL s nested subqueries. However, never use a resulting value greater than 1 2, since we believe that each selection eliminates at least half of the input tuples. 272

273 Other systems I Estimating query characteristics in commercial DBMSs DB2, Informix, and Oracle use one-dimensional equal height histograms. Oracle switches to duplicate counts for each value, whenever there are only few distinct values. MS SQL Server uses one-dimensional equal area histograms with some optimization (compression of adjacent ranges with similar distributions). SQL Server creates and maintains histograms automatically, without user interaction. Sampling is typically not used directly in commercial systems. Sometimes, utilities use sampling for estimating statistics or for building histograms. Sometimes sampling is used for load balancing during parallelization. 273

274 Bibliography Ceri, S. and Pelagatti, G. (1984). Distributed Databases, Principles and Systems. McGraw- Hill. Ling, Y. and Sun, W. (1992). A supplement to sampling-based methods for query size estimation in a database system. SIGMOD Record, 21(4): Ling, Y. and Sun, W. (1995). An evaluation of sampling-based size estimation methods for selections in database systems. In Yu, P. S. and Chen, A. L. P., editors, Proc. 11th IEEE Int l Conf. on Data Engineering, pages , Taipei, Taiwan. Mannino, M. V., Chu, P., and Sager, T. (1988). Statistical profile estimation in database systems. ACM Computing Surveys, 20(3): Mullin, J. (1993). Estimating the size of a relational join. Information Systems, 18(3): Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. 274

275 Module 9: Query Optimization Module Outline Web Forms Applications SQL Interface 9.1 Outline of Query Optimization 9.2 Motivating Example 9.3 Equivalences in the relational algebra 9.4 Heuristic optimization 9.5 Explosion of search space 9.6 Dynamic programming strategy (System R) Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Parser Optimizer You are here! Query Processor Recovery Manager DBMS Index Files Data Files System Catalog Database 275

276 9.1 Outline of Query Optimization The success of relational database technology is largely due to the systems ability to automatically find evaluation plans for declaratively specified queries. Given some (SQL) query Q, the system 1 parses and analyzes Q, 2 derives a relational algebra expression E that computes Q, 3 transforms and simplifies E, and 4 annotates the operators in E with access methods and operator algorithms to obtain an evaluation plan P. Discussed here: Task 3 is often called algebraic (or re-write) query optimization, while task 4 is also called non-algebraic (or cost-based) query optimization. 276

277 9.2 Motivating Example From query to plan Example: List the airports from which flights operated by Swiss (airline code LX) fly to any German (DE) airport. Airport : code country name FRA DE Frankfurt ZRH CH Zurich MUC DE Munich. Flight : from to airline FRA ZRH LX ZRH MUC LX FRA MUC US. SQL query Q: SELECT FROM WHERE f.from Flight f, Airport a f.to = a.code AND f.airline = LX AND a.country = DE 277

278 From query to plan... SQL query Q: SELECT FROM WHERE f.from Flight f, Airport a f.to = a.code AND f.airline = LX AND a.country = DE Relational algebra expression E that computes Q: from airline= LX country= DE Airport to=code Flight 278

279 From query to plan... Relational algebra expression E that computes Q: from airline= LX country= DE Airport to=code Flight One (of many) plan(s) P to evaluate Q: from scan airline= LX country= DE scan Airport heap scan to=code NL- Flight index scan on to 279

280 9.3 Equivalences in the relational algebra Two relational algebra expressions E 1, E 2 are equivalent if on every legal database instance the two expressions generate the same set of tuples. Note: the order of tuples is irrelevant Such equivalences are denoted by equivalence rules of the form E 1 E 2 (such a rule may be applied by the system in both directions, ). We know those equivalence rules from the course Information Systems. 280

281 p(e 1 q E 2 ) E 1 p q E Some equivalence rules 1 Conjunctive selections can be deconstructed into a sequence of individual selections: p 1 p 2 (E) p 1 (p 2 (E)) 2 Selection operations are commutative: p 1 (p 2 (E)) p 2 (p 1 (E)) 3 Only the last projection in a sequence of projections is needed, the others can be omitted: L 1 (L 2 ( L n (E) )) L 1 (E) 4 Selections can be combined with Cartesian products and joins: i) p(e 1 E 2 ) E 1 p E 2 ii)

282 Pictorial description of 4 i): p E 1 E 2 p E 1 E 2 282

283 5 Join operations are commutative: E 1 p E 2 E 2 p E 1 6 i) Natural joins (equality of common attributes) are associative: (E 1 E 2 ) E 3 E 1 (E 2 E 3 ) ii) Generals joins are associative in the following sense: (E 1 p E 2 ) q r E 3 ) E 1 p q (E 2 r E 3 ) where predicate r involves attributes of E 2, E 3 only. 7 Selection distributes over joins in the following ways: i) If predicate p involves attributes of E 1 only: p(e 1 q E 2 ) p(e 1 ) q E 2 ii) If predicate p involves only attributes of E 1 and q involves only attributes of E 2 : p q(e 1 r E 2 ) p(e 1 ) r q(e 2 ) (this is a consequence of rules 7 (a) and 1 ). 283

284 8 Projection distributes over join as follows: L 1 L 2 (E 1 p E 2 ) L 1 (E 1 ) p L 2 (E 2 ) if p involves attributes in L 1 L 2 only and L i contains attributes of E i only. 9 The set operations union and intersection are commutative: E 1 E 2 E 2 E 1 E 1 E 2 E 2 E 1 10 The set operations union and intersection are associative: (E 1 E 2 ) E 3 E 1 (E 2 E 3 ) (E 1 E 2 ) E 3 E 1 (E 2 E 3 ) 284

285 11 The selection operation distributes over, and \: p(e 1 E 2 ) p(e 1 ) p(e 2 ) p(e 1 E 2 ) p(e 1 ) p(e 2 ) p(e 1 \ E 2 ) p(e 1 ) \ p(e 2 ) Also: p(e 1 E 2 ) p(e 1 ) E 2 p(e 1 \ E 2 ) p(e 1 ) \ E 2 (this does not apply for ) 12 The projection operation distributes over : L(E 1 E 2 ) L(E 1 ) L(E 2 ) 285

286 9.4 Heuristic optimization Query optimizers use the equivalence rules of relational algebra to improve the expected performance of a given query in most cases. The optimization is guided by the following heuristics: (a) Break apart conjunctive selections into a sequence of simpler selections (rule 1 preparatory step for (b)). (b) Move down the query tree for the earliest possible execution (rules 2, 7, 11 reduce number of tuples processed). (c) Replace pairs by (rule 4 (a) avoid large intermediate results). (d) Break apart and move as far down the tree as possible lists of projection attributes, create new projections where possible (rules 3, 8, 12 reduce tuple widths early). (e) Perform the joins with the smallest expected result first. 286

287 Heuristic optimization: example SQL query Q: SELECT FROM WHERE AND AND p.ticketno Flight f, Passenger p, Crew c f.flightno = p.flightno AND f.flightno = c.flightno f.date = AND f.to = FRA p.name = c.name AND c.job = Pilot ( What would be a natural language formulation of Q?) 287

288 SELECT FROM WHERE AND AND p.ticketno Flight f, Passenger p, Crew c f.flightno = p.flightno AND f.flightno = c.flightno f.date = AND f.to = FRA p.name = c.name AND c.job = Pilot Canonical relational algebra expression (reflects the semantics of the SQL SELECT- FROM-WHERE block directly): p.ticketno f.flightno=p.flightno f.flightno=c.flightno c.job= Pilot Flight f Crew c Passenger p 288

289 Heuristic optimization: example 1 Break apart conjunctive selection to prepare push-down of selections: p.ticketno f.flightno=p.flightno f.flightno=c.flightno f.date= f.to= FRA p.name=c.name Flight f c.job= Pilot Crew c Passenger p 289

290 Heuristic optimization: example 2 Push down selection as far as possible (but no further!): p.ticketno f.flightno=c.flightno p.name=c.name f.flightno=p.flightno c.job= Pilot Crew c Passenger p f.to= FRA f.date= Flight f 290

291 Heuristic optimization: example 3 Re-unite sequences of selections into single conjunctive selections: p.ticketno f.flightno=c.flightno p.name=c.name f.flightno=p.flightno c.job= Pilot f.to= FRA f.date= Passenger p Crew c Flight f 291

292 Heuristic optimization: example 4 Introduce projections to reduce tuple widths: p.ticketno f.flightno=c.flightno p.name=c.name f.flightno=p.flightno c.flightno,c.name f.flightno c.job= Pilot Crew c f.to= FRA f.date= p.ticketno,p.flightno,p.name Flight f Passenger p 292

293 Heuristic optimization: example 5 Combine cartesian products and selections into joins: p.ticketno f.flightno f.flightno=p.flightno f.flightno=c.flightno p.name=c.name c.flightno,c.name c.job= Pilot Crew c f.to= FRA f.date= p.ticketno,p.flightno,p.name Flight f Passenger p 293

294 Heuristic optimization: example 6 Relation Passenger presumably is the largest relation, re-order the joins (associativity of general joins, rule 6 ii)): p.ticketno f.flightno f.flightno=c.flightno f.flightno=p.flightno p.name=c.name c.flightno,c.name p.ticketno,p.flightno,p.name Passenger p f.to= FRA f.date= c.job= Pilot Flight f Crew c 294

295 Choosing an evaluation plan When the optimizer annotates the resulting algebra expression E it needs to consider the interaction of the chosen operator algorithms/access methods. Choosing the cheapest (in terms of I/O) algorithm for each operation independently may not yield overall cheapest plan P. Example: merge join may be costlier than nested loops join (operands need to be sorted first), but yields output in sorted order (good for subsequent duplicate elimination, selection, grouping,... ) We need to consider all possible plans and then choose the best one in a cost-based fashion. 295

296 9.5 Explosion of search space Consider finding the best join order for the query R 1 R 2 R 3 R 4 Several join tree shapes (due to associativity, commutativity of ): R 1 R 2... R 1 R 2 bushy R 3 R 4... R 1 R 2 R 4 R 3 left-deep # of different join orders for an n-way join: R 3 R 4 right-deep (2n 2)! (n 1)! (n = 7 : , n = 10 : ) 296

297 Restricting the search space Fact: query optimization will not be able to find the overall best plan. Instead: optimizers tries to avoid the really bad plans (I/O cost of different plans may differ substantially!) Restrict the search space: consider left-deep join orders only (left is outer relation, right is inner): R 1 R 2 R 4 R 3 Left-deep trees may be evaluated in a fully-pipelined fashion (inner input is relation), intermediate results need not be written to temporary files, (Block) NL- may profit from available indexes on inner relation. Number of possible left-deep join orders for n-way join is n! 297

298 Single relation plans Optimizer enumerates (generates) all possible plans to assess their cost. If query involves a single relation R only: Single relation plans: Consider each available method (e.g., heap scan, (un)clustered index scan) to access the tuples of a single relation R i. Keep the access method involving the least estimated cost. 298

299 Cost estimates for single relation plans (System R style) IBM System R ( 1970s): first successful relational database system, introduced most of the query optimization techniques still in use today. Pragmatic yet successful cost model for access methods on rel. R: Access method Cost { Height(I) + 1 if I is B + tree access primary key index I 2.2 if I is hash index clustered index I matching predicate p ( I + R ) sel(p) 1 unclustered index I matching predicate p sequential scan ( I + R ) sel(p) R 1 If sel(p) is unknown, assume 1/

300 Cost estimates for a single relation plan Query Q: SELECT FROM WHERE A R B = c Database profile: R = 500, R = , V (B, R) = 10 Q 1/V (B, R) R = 1/ = tuples retrieved } {{ } sel(b=c) 1 Database maintains clustered index I B ( I B = 50) on attribute B: cost = ( I B + R ) 1/V (B, R) = ( ) 1/10 = 55 pages 300

301 Cost estimates for a single relation plan 2 Database maintains unclustered index I B ( I B = 50) on attribute B: cost = ( I B + R ) 1/V (B, R) = ( ) 1/10 = pages 3 No index support, use sequential file scan to access R: cost = R = 500 pages To evaluate query Q, use clustered index I B 301

302 Plans for multiple relation (join) queries We need to make sure not to miss the best left-deep join plan. Degrees of freedom left: 1 For each base relation in the query, consider all access methods. 2 For each join operation, select a join algorithm. How many possible query plans are left now? Back-of-envelope calculation (query with n relations) Assume j join algorithms available, i indexes per relation: #plans n! j n 1 (i + 1) n Example: with n = 3 relations and j = 3, i = 2: #plans 3! =

303 Plan enumeration 1 : example setup Example query (n = 3): SELECT FROM WHERE a.name, f.airline, c.name Airport a, Flight f, Crew c f.to = a.code AND f.flightno = c.flightno (Airport = A, Flight = F, Crew = C) Assumptions: Available join algorithms: hash join, block NL-, block INL- Available indexes: clustered B + tree index I on attribute Flight.to, I = 50 A = 500, 80 tuples/page F = 1000, 100 tuples/page C = F A tuples fit on a page 303

304 Plan enumeration 2 : candidate plans Enumerate n! left-deep join trees (3! = 6): C A F F A C C F A A C F F C A A F C Prune plans with (note: no join predicate between A, C) immediately! 4 candidate plans remain. 304

305 Plan enumeration 3 : join algorithm choices Candidate plan: C A F Possible join algorithm choices: NL- NL- NL- C H- C A F A F H- H- NL- C H- C A F A F Repeat for remaining 3 candidate plans. 305

306 Plan enumeration 4 : access method choices NL- Candidate plan: NL- C A F Possible access method choices: NL- NL- NL- C heap scan INL- C heap scan heap scan A F heap scan heap scan A F index scan on F.to Repeat for remaining candidate plans. 306

307 Plan enumeration 5 : cost estimation Estimate cost for candidate plan: NL- INL- C heap scan heap scan A F index scan Cost heap scan A: 500 (pages) Cost of A F : A sel(a.code = F.to) ( F + I ) = F.to is key / ( ) A F = A F /100 = F /100 = /100 = (pages) Cost of (A F ) C: A F C = = Total estimated cost: =

308 Plan enumeration 5 : cost estimation Current candidate plan: Remember: A = 500, F = 1 000, C = 10 A F = NL- NL- C heap scan heap scan A F heap scan NL-: scan left input + scan right input once for each page in left input H- (assume 2 passes): 2 (scan both inputs + hash both inputs into buckets) + read hash buckets with join partners Total estimated cost: A + A F + A F B = =

309 Plan enumeration 5 : cost estimation Current candidate plan: Remember: A = 500, F = 1 000, C = 10 A F = H- NL- C heap scan heap scan A F heap scan NL-: scan left input + scan right input once for each page in left input H- (assume 2 passes): 2 (scan both inputs + hash both inputs into buckets) + read hash buckets with join partners Total estimated cost: A + A F + 2 A F + 2 B + (A F ) B = =

310 Plan enumeration 5 : cost estimation Current candidate plan: Remember: A = 500, F = 1 000, C = 10 A F = NL- H- C heap scan heap scan A F heap scan NL-: scan left input + scan right input once for each page in left input H- (assume 2 passes): 2 (scan both inputs + hash both inputs into buckets) + read hash buckets with join partners Total estimated cost: 2 ( A + F ) + A F + A F B = 2 ( ) =

311 Plan enumeration 5 : cost estimation Current candidate plan: Remember: A = 500, F = 1 000, C = 10 A F = H- H- C heap scan heap scan A F heap scan NL-: scan left input + scan right input once for each page in left input H- (assume 2 passes): 2 (scan both inputs + hash both inputs into buckets) + read hash buckets with join partners Total estimated cost: 2 ( A + F ) + A F + 2 ( A F + B ) + B = 2 ( ) ( ) + 10 =

312 Repeated enumeration of identical sub-plans The plan enumeration reconsiders the same sub-plans over and over again. Cost and result size of sub-plan indepedent of larger embedding plan: NL- H- NL- NL- C scan NL- C scan H- C scan scan A F scan scan A F scan scan A F scan H- NL- H- H- C scan INL- scan A C scan INL- C scan F scan scan A F index scan A F index! Idea: Remember already considered sub-plans in memoization data structure. Resulting approach known as dynamic programming. 312

313 9.6 Dynamic programming strategy (System R) Divide plan enumeration into n passes (for a query with n joined relations): 1 Pass 1 (all 1-relation plans): Find best 1-relation plans for each relation (i.e., select access method) 2 Pass 2 (all 2-relation plans): Find best way to join plans of Pass 1 to another relation (generate left-deep trees: sub-plans of Pass 1 appear as outer in join). 3 Pass n (all n-relation plans): Find best way to join plans of Pass n 1 to the nth relation (sub-plans of Pass n 1 appear as outer in join) A k 1 relation sub-plan P is not combined with a kth relation R unless there is a join condition between the relations in P and R or all join conditions already present in P (avoid if possible). 313

314 Plan enumeration: pruning, interesting orders For each sub-plan obtained this way, remember cost and result size estimates! Pruning: For each subset of relations joined, keep only cheapest sub-plan overall + cheapest sub-plans that generate an intermediate result with an interesting order of tuples. Interesting order determined by presence of SQL ORDER BY clause in the query presence of SQL GROUP BY clause in the query join attributes of subsequent equi-joins (prepare for merge-). 314

315 System R style plan enumeration Example query: SELECT FROM WHERE a.name, f.airline, c.name Airport a, Flight f, Crew c f.to = a.code AND f.flightno = c.flightno Now assume: Available join algorithms: merge-, block NL-, block INL- Available indexes: clustered B + tree index I on A.code, height(i) = 3, I leaf = 500 A = , 5 tuples/page F = 10, 10 tuples/page C = 10, 20 tuples/page 10 F A tuples fit on a page, 10 F C tuples fit on a page 315

316 System R: Pass 1 (1-relation plans) Access methods for A: 1 heap scan cost = A = index scan on A.code, index I cost = I + A = = Keep 1 and 2 since 2 has interesting order on attribute to which is a join attribute. Access method for F : 1 heap scan cost = F = 10 Access method for C: 1 heap scan cost = C =

317 System R: Pass 2 (2-relation plans) Start with 1-relation plan to access A as outer:? Heap scan of A as outer: 1? = NL- A F cost = F = = ? = M- (assume 2-way sort/merge): cost = F + F = Index scan of A as outer: 3? = NL- cost = F = = ? = M- (assume 2-way sort/merge): cost = F + F = Keep 4 only (N.B. uses interesting order in non-optimal sub-plan!) 317

318 System R: Pass 2 (cont d) Start with F as outer:? A as inner: F A/C? 1? = NL-, heap scan A cost = F + F A = ? = INL-, index scan A cost = F + F (height(i) + 1) = (3 + 1) = 410 3? = M-, heap scan A cost = F + A + 2 ( F + A ) = ? = M-, index scan A cost = F + 2 F = Keep! C as inner: 5? = NL- cost = F + F C = = 110 6? = M- Keep! cost = F + C + 2 ( F + C ) = ( ) =

319 System R: Pass 2 (cont d) Start with C as outer:? C F 1? = NL- cost = C + C F = = 110 2? = M- cost = C + F + 2 ( C + F ) = ( ) = 60 Keep! N.B. C A not enumerated because of cross product ( ) avoidance. 319

320 System R: further pruning of 2-relation plans A F : M- INL- 1 index A F scan 2 scan F A index cost = , order on to cost = 410, no order C F : M- M- 3 scan C F scan 4 scan F C scan cost = 60, order on flightno cost = 60, order on flightno Keep 2 and 3 or 4 (order in 1 not interesting for subsequent join(s)). 320

321 System R: Pass 3 (3-relation plans) Best (A F ) sub-plan: cost = 410, no order, A F = 10 NL- INL- C 1 scan F A index cost = A F C = = M- INL- C scan F A index cost = C + 2 ( A F + C ) = ( ) =

322 System R: Pass 3 (cont d) Best (C F ) sub-plan: cost = 60, order on flightno, C F = 10, C F = 100 NL- M- A scan scan F C scan cost = M- M- A scan scan F C scan cost = INL- M- A index scan F C scan cost = = 460 M- M- A index scan F C scan cost =

323 System R: And the winner is... INL- M- A index cost = 460 Observations: scan F C scan Best plan mixes join algorithms and exploits indexes. Worst plan had cost > (exact cost unknown due to pruning). Optimization yielded 1000-fold improvement over worst plan! 323

324 Bibliography Astrahan, M. M., Schkolnick, M., and Kim, W. (1980). Performance of the System R access path selection mechanism. In IFIP Congress, pages Chamberlin, D., Astrahan, M., Blasgen, M., Gray, J., King, W., Lindsay, B., Lorie, R., Mehl, J., Price, T., Putzolu, F., Selinger, P., Schkolnick, M., Shultz, D., Traiger, I., Wade, B., and Yost, R. (1981). History and evaluation of System/R. Communications of the ACM, 24(10): Jarke, M. and Koch, J. (1984). Query optimization in database systems. ACM Computing Surveys, 16(2): Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. W. Kim, D.S. Reiner, D. B., editor (1985). Query Processing in Database Systems. Springer-Verlag. 324

325 Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Applications SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files System Catalog Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files Applications SQL Commands System Catalog Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files Applications SQL Commands System Catalog Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files Applications SQL Commands System Catalog Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files Applications SQL Commands System Catalog Parser Optimizer Parser Optimizer Parser Optimizer Parser Optimizer Parser Optimizer SQL Interface Query Processor Database Recovery Manager DBMS SQL Interface Query Processor Database Recovery Manager DBMS SQL Interface Query Processor Database Recovery Manager DBMS SQL Interface Query Processor Database Recovery Manager DBMS SQL Interface Query Processor Database Recovery Manager DBMS Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Applications SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files System Catalog Applications SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files System Catalog Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files Applications SQL Commands System Catalog Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files Applications SQL Commands System Catalog Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files Applications SQL Commands System Catalog Parser Optimizer Parser Optimizer Parser Optimizer Parser Optimizer Parser Optimizer SQL Interface Query Processor Database Recovery Manager Recovery Manager DBMS SQL Interface Query Processor Database DBMS SQL Interface Query Processor Database Recovery Manager DBMS SQL Interface Query Processor Database Recovery Manager DBMS SQL Interface Query Processor Database Recovery Manager DBMS Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Applications SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files System Catalog Applications SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files System Catalog Applications SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files System Catalog Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files Applications SQL Commands System Catalog Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files Applications SQL Commands System Catalog Parser Optimizer Parser Optimizer Parser Optimizer Parser Optimizer Parser Optimizer SQL Interface Query Processor Database Recovery Manager Recovery Manager DBMS SQL Interface Query Processor Database Recovery Manager DBMS SQL Interface Query Processor Database Recovery Manager DBMS SQL Interface Query Processor Database DBMS SQL Interface Query Processor Database Recovery Manager DBMS Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Applications SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files System Catalog Applications SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files System Catalog Applications SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files System Catalog Applications SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files System Catalog Files and Index Structures Buffer Manager Disk Space Manager Index Files Data Files Applications SQL Commands System Catalog Parser Optimizer Parser Optimizer Parser Optimizer Parser Optimizer Parser Optimizer SQL Interface Query Processor Database Recovery Manager Recovery Manager DBMS SQL Interface Query Processor Database Recovery Manager DBMS SQL Interface Query Processor Database Recovery Manager DBMS SQL Interface Query Processor Database Recovery Manager DBMS SQL Interface Query Processor Database DBMS Module 10: Parallel Query Processing Module Outline We are here Objectives in Parallelizing DBMSs 10.2 Speed-Up and Scale-Up 10.3 Opportunities for Parallelization in RDBMSs 10.4 Examples for parallel query execution plans 325

326 10.1 Objectives in Parallelizing DBMSs Thus far, we have (implicitly or explicitly) been considering a DBMS in a client-server architecture: One DBMS server operates on the data stored on its local disks, on behalf of any of the numerous clients issuing requests to the server over a (local or global) network. All the data-intensive work, as well as, e.g., transaction management, is done on the (single) server. The following considerations may lead us to parallel or distributed architectures: High-performance. A single server, implemented on a sequential single-processor machine may not be able to provide the necessary performance (repsonse-time, throughput). High-availability. A single server represents a single point-of-failure. Hardware or software problems as well as network disconnection turn all database operations down. Extensibility. Accomodating increasing demands in terms of database size and/or performance will definitely hit hard limits in a single-server architecture. 326

327 Architectures for Parallel Databases Typical parallel architectures include: Shared Memory: Shared Disk: Shared Nothing: Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Global Shared Memory Each of these has its own advantages and potential problems. For example, shared memory is easiest to program, while shared nothing scales best. 327

328 10.2 Speed-Up and Scale-Up Speed-Up. Given a constant problem size (e.g., database size and transaction load), how does the performance (e.g., response time) increase with increased hardware ressources (e.g., number of processors and/or disks)? Scale-Up. When the problem size increases, can we achieve the same performance with hardware ressources increased correspondingly? Overview of metrics for parallel processing Problem/Database Size constant variable Ressource constant Size-Up Utilization variable Speed-Up Scale-Up 328

329 Problems with speed-up Considering the response-time performance indicator, speed-up (w.r.t. increased number of processors used) is defined as rt-speed-up(n) = reponse-time with 1 processor response-time with n processors Similarly, we can use the throughput (number of transactions per second) indicator and define throughput with n processors tput-speed-up(n) = throughput with 1 processor Problem: we can not achieve linear speed-up, rather... ideal optimal real Amdahl s Law: 1 Speed-Up seq. part + par.part /n e.g., start-up and synchr. overhead, sub-optimal load-balancing,

330 10.3 Opportunities for Parallelization in RDBMSs Relational DBMSs offer a large potential to exploit parallelism: Data parallelism. Queries operate on large data sets. Data sets can be partitioned and each partition can be handled by a separate, parallel thread. Challenge: avoid skew in partitioning the data. Pipelined parallelism. Queries consist of (pipelined) sequences of operators. Each operator can be executed by a separate, pipelined thread. See our earlier discussion on pipelining. Operator parallelism. For many operators, their internal execution algorithms can be parallelized into several threads. For example, parallel join-algorithms. 330

331 Different kinds of parallelism Depending on what is performed in parallel, the following systematics have been developed: Inter-transaction parallelism: several transaction are run in parallel. (This is the standard in all DBMSs.) Intra-transaction p.: Inter-query p.: several queries within a transaction are run in parallel (needs asynchronous SQL-I/F) Intra-query Formen p.: within der Parallelität one SQL-call, multiple tasks are run in parallel Inter-operator p.: operators constituting a query are run in parallel... was Intra-operator wird parallel ausgeführt? p.: a single operator is implemented via a parallel algorithm BOT Select... Select... Insert... EOT BOT Select... Select... Insert... EOT BOT EOT BOT EOT Interquery-Parallelität Intraquery-Parallelität/ Intraoperator-Parallelität 331

332 10.4 Examples for parallel query execution plans Parallel join algorithms There are a number of parallel join algorithms. nested loops join (aka. broadcast join): The most simple one is parallel 1. Partitioning phase: broadcast records of outer relation to nodes holding inner. 2. Join phase: locally compute (partial) joins on nodes holding inner. S R R S S S R R S S This algorithm can be used for non-equi joins, too. 332

333 Parallel associative join If the inner relation (S) is stored in partitions (according to the join attributes) and the join is an equi-join, then we can 1. distribute puter tuples to the matching partition of the inner, 2. compute the (partial) joins locally on the nodes storing the inner partitions. S R R S S S R R S S 333

334 Parallel (simple) hash join 1. Partition outer (R) using some hash function h, send records to join node indicated by hash value. 2. Partition inner (S) using same hash function h, send records to join node indicated by hash value. 3. Locally compute (partial) joins on all join nodes. S R R S S S R R S S N.B.: a node can be scan and join node at the same time. 334

335 Parallel asymmetric hash join (see earlier discussion of hash joins.) 1. Building phase: scan and distribute outer according to some hash function h. 2. Probing phase: combine scan/distribute of inner with locally computing the join. S R R S S S R R S S N.B.: again, a node can be scan and join node at the same time. 335

336 Parallel hybrid hash join (see earlier discussion of hybrid hash joins.) 1. Building phase: scan and distribute outer according to some hash function h, keep first bucket in memory. 2. Probing phase: combine scan/distribute of inner with locally computing the join for the first bucket. S R R S S S R R S S 336

337 Parallelizing join trees Left-deep join trees can be fully pipelined. As such, they offer good potential for (pipelining) inter-operator parallelism. When we consider parallel hash joins, though, we observe that each building phase falls within its own, sequential execution phase: Example: consider the join of four relations R 1, R 2, R 3, R 4 and the left-deep join tree shown below; each join is implemented as an asymmetric hash join. J3 B3 P3 J2 S4 implem ented by B2 P2 S4 J1 S3 B1 P1 S3 S1 S2 S = Scan J= Join B = Build P= Probe The execution of the query proceeds in 4 sequential steps (the tasks within each step are executed in parallel): 1. {S 1, B 1 } 2. {S 2, P 1, B 2 } 3. {S 3, P 2, B 3 } 4. {S 4, P 3 } S1 S2 337

338 Analysis of parallelizing left-deep join trees PROs: no more than 2 hash tables have to kept in memory at the same time probing relation is always a base table CONs: rather limited degree of parallelism size of hash tables (build phase) depends on join selectivity (difficult to estimate accurately) 338

339 Parallelizing right-deep join trees Example: consider the same join of four relations R 1, R 2, R 3, R 4 as before, but now look at the right-deep join tree shown below; each join is again implemented as an asymmetric hash join. J3 B3 P3 S4 J2 implem ented by S4 B2 P2 S3 J1 S3 B1 P1 S2 S1 S2 S1 Now, the execution of the query can be split into only 2 sequential steps (parallelizing the tasks within each step): 1. {S 2, B 1, S 3, B 2, S 4, B 3 } 2. {S 1, P 1, P 2, P 3 } more parallelism (parallel scans, all probing phases in a single pipeline) all build-relations are base tables, hence better size-estimates much higher memory requirements (all build-tables) 339

340 Bibliography Özsu, M. and Valduriez, P. (1991). Principles of Distributed Database Systems. Prentice Hall. Rahm, E. (1994). Mehrrechner-Datenbanksysteme Grundlagen der verteilten und parallelen Datenbankverarbeitung. Addison-Wesley, Bonn. Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. 340

341 2 Storing Data: Disks and Files Memory hierarchy Magnetic disks Accelerating Disk-I/O RAID-Levels Disk space management Keeping track of free blocks Contents 1 Introduction & Overview What it s all about Overall System Architecture Layered DBMS Architecture Storage Structures Access Paths Query Execution Implementing a Lock Manager Outline of the Course Organizational Matters

342 2.3 Buffer manager Buffer allocation policies Buffer replacement policies Buffer management in DBMSs vs. OSs File and record organization Heap files Linked list of pages Directory of pages Page formats Fixed-length records Variable-length records Record formats Fixed-length fields Variable-length fields Addressing schemes File Organizations and Indexes Comparison of file organizations Cost model Scan Search with equality test (A = const) Search with range selection (A lower AND A upper)

343 3.1.5 Insert Delete (record specified by its rid) Overview of indexes Properties of indexes Clustered vs. unclustered indexes Dense vs. sparse indexes Primary vs. secondary indexes Multi-attribute indexes Indexes and SQL Tree-Structured Indexing B + trees Structure of B + trees Operations on B + trees Extensions Key compression Bulk loading a B + tree A note on order A note on clustered indexes Generalized Access Path ORACLE Clusters

344 7 Evaluation of Relational Operators The DBMS s runtime system General remarks The system catalog Principal approaches to operator evaluation The selection operation No index, unsorted data No index, sorted data B + tree index Hash-Based Indexing General Remarks on Hashing Static Hashing Extendible Hashing Linear Hashing External Sorting Sorting as a building block of query execution Two-Way Merge Sort External Merge Sort Minimizing the number of initial runs Using B + trees for sorting

345 7.3.4 Hash index, equality selection General selection conditions Conjunctive predicates and index matching Intersecting rid sets Disjunctive predicates Bypass Selections The projection operation Projection based on sorting Projection based on hashing Use of indexes for projection The join operation Nested loops join Block nested loops join Index nested loops join Sort-merge join Hash joins Semijoins Summary of join algorithms A Three-Way Join Operator Other Operators Set Operations

346 9 Query Optimization Outline of Query Optimization Motivating Example Equivalences in the relational algebra Heuristic optimization Explosion of search space Dynamic programming strategy (System R) Aggregates The impact of buffering Managing long pipelines of relational operators Streaming Interface Demand-Driven vs. Data-Driven Streaming Selectivity Estimation Query Cost and Selectivity Estimation Database profiles Simple profiles Selectivity estimation under uniformity assumption Selectivity estimation for composite predicates Histograms Sampling Statistics maintained by commercial DBMS

347 10 Parallel Query Processing Objectives in Parallelizing DBMSs Speed-Up and Scale-Up Problems with speed-up Opportunities for Parallelization in RDBMSs Examples for parallel query execution plans Parallel join algorithms Parallelizing join trees