Topic for today Principles of Database Management Systems Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom) How to represent data on disk and in memory "Executive Summary": Principles are rather simple, but there are lots of variations in the details DBMS 2001 Notes 3: Representing Data 1 DBMS 2001 Notes 3: Representing Data 2 Principles of Data Layout Overview Data Items here Attributes of relational tuples (or objects) represented by sequences of bytes called fields Fields grouped together into records representation of tuples or objects Records stored in blocks File: collection of blocks that forms a relation (or the extent of an object class) Records Blocks Files Memory DBMS 2001 Notes 3: Representing Data 3 DBMS 2001 Notes 3: Representing Data 4 What are the data items we want to store? a salary, a name a date, a picture What we have available: Bytes 8 bits DBMS 2001 Notes 3: Representing Data 5 To represent: Integer (short): 2 bytes ( -32000 +32000) e.g., 35 is 00000000 00100011 Integer (long): 4 bytes ( -2x10 9 + 2x10 9 ) Real, floating point (SQL FLOAT) 4 or 8 bytes arithmetic interpretation by hardware DBMS 2001 Notes 3: Representing Data 6 1
To represent: Characters various coding schemes suggested, most popular is ASCII Example (8 bit ASCII): A: 01000001 a: 01100001 5: 00110101 LF: 00001010 To represent: Boolean e.g., TRUE FALSE 1111 1111 0000 0000 Application specific e.g., RED 1 GREEN 2 BLUE 3 YELLOW 4 Can we use less than 1 byte/code? Yes, but only if desperate... DBMS 2001 Notes 3: Representing Data 7 DBMS 2001 Notes 3: Representing Data 8 To represent: To represent: Dates, e.g.: Integer: # days since Jan 1, 1900 8 chars: YYYYMMDD 7 chars: YYYYDDD 10 chars: YYYY-MM-DD (SQL2) (not YYMMDD! Why?) Time, e.g. Integer: seconds since midnight chars: HH:MM:SS[.FF ] (SQL2) String of characters Null terminated e.g., c a t Length given e.g., - Fixed length 3 c a t DBMS 2001 Notes 3: Representing Data 9 DBMS 2001 Notes 3: Representing Data 10 To represent: Bag of bits Key Point Length Bits Fixed length items Variable length items - usually length given at beginning DBMS 2001 Notes 3: Representing Data 11 DBMS 2001 Notes 3: Representing Data 12 2
Also Type of an item: Tells us how to interpret (plus size if fixed) normally indicated in the record schema (see in a moment) Record - Collection of related fields E.g.: Employee record: name field, salary field, date-of-hire field,... DBMS 2001 Notes 3: Representing Data 13 DBMS 2001 Notes 3: Representing Data 14 Types of records: Main choices: FIXED vs VARIABLE FORMAT FIXED vs VARIABLE LENGTH Fixed format A SCHEMA (not record) contains following information - number of fields - type of each field - order in record - meaning of each field DBMS 2001 Notes 3: Representing Data 15 DBMS 2001 Notes 3: Representing Data 16 Example: fixed format and length Employee record (1) E#, 2 byte integer (2) E.name, 10 char. Schema (3) Dept, 2 byte code Variable format Record itself contains format; Self Describing 46 F o r d 02 83 J o n e s 01 Records DBMS 2001 Notes 3: Representing Data 17 DBMS 2001 Notes 3: Representing Data 18 3
Example: variable format and length # Fields 2 5 I 46 4 S 4 Code identifying field as E# Integer type Code for Ename String type Length of str. F o r d Field name codes could also be strings, i.e. tags ( XML as a data interchange format) Variable format useful for: sparse records e.g, patient records with thousands of possible tests repeating fields information integration from heterogeneous sources DBMS 2001 Notes 3: Representing Data 19 DBMS 2001 Notes 3: Representing Data 20 EXAMPLE: var format record with repeating fields Employee one or more children 3 E_name: Fred Child: Sally Child: Tom Note: Repeating fields does not imply - variable format, nor - variable size Fred Sally Tom -- Key is to allocate maximum number of repeating fields (if not used NULL) DBMS 2001 Notes 3: Representing Data 21 DBMS 2001 Notes 3: Representing Data 22 Many variants between fixed and variable format: Ex. #1: Include record type in record 5 27.... record type record length tells me what to expect (i.e. points to schema) Record header - data at beginning that describes record May contain: - indication of record type - record length - time stamp - other stuff... DBMS 2001 Notes 3: Representing Data 23 DBMS 2001 Notes 3: Representing Data 24 4
Ex #2 of variant between FIXED/VAR format Hybrid format one part is fixed, other variable E.g.: All employees have E#, name, dept; Other fields vary. 25 Smith Toy 2 Hobby:chess retired # of var fields Also, many variations in internal organization of record Just to show one: length of field * * * 3 10 F1 5 F2 12 F3 total size 3 32 5 15 20 F1 F2 F3 0 1 2 3 4 5 15 20 32 offsets DBMS 2001 Notes 3: Representing Data 25 DBMS 2001 Notes 3: Representing Data 26 Next: placing records into blocks blocks... a file assume fixed length blocks assume a single relation (for now) DBMS 2001 Notes 3: Representing Data 27 Issues in storing records in blocks: (1) separating records (2) spanned vs. unspanned (3) mixed record types clustering (4) split records (5) sequencing (6) addressing records DBMS 2001 Notes 3: Representing Data 28 (1) Separating records Block R1 R2 (a) fixed size recs. -> no need to separate (b) special marker (c) give record lengths (or offsets) - within each record - in block header (see later) (2) Spanned vs. Unspanned Unspanned: records are within one block block 1 block 2 R1 R2 R4 R5 Spanned: records span block boundaries R1 block 1 block 2 R2 (a) (b) R4 R5 R6 R7 (a)...... DBMS 2001 Notes 3: Representing Data 29 DBMS 2001 Notes 3: Representing Data 30 5
With spanned records: Spanned vs. unspanned: R1 R2 need indication of partial record pointer to rest (a) (b) R4 R5 R6 need indication of continuation (+ from where?) R7 (a) Unspanned is much simpler, but may waste space Spanned necessary if record size > block size (e.g., fields containing large "BLOB"s for, say, MPEG video clips) DBMS 2001 Notes 3: Representing Data 31 DBMS 2001 Notes 3: Representing Data 32 Example (ofunspanned records) 10 6 records each of size 2,050 bytes (fixed) block size = 4096 bytes R1 block 1 block 2 R2 2050 bytes wasted 2046 2050 bytes wasted 2046 Space used about 4 x 10 9 B, about half wasted (3) Mixed record types Mixed - records of different types (e.g. DEPT, EMPLOYEE) allowed in same block e.g., a block: Dep d1 Emp e1 Emp e2 DBMS 2001 Notes 3: Representing Data 33 DBMS 2001 Notes 3: Representing Data 34 Why do we want to mix? Answer: CLUSTERING Records that are frequently accessed together should be in the same block Example Q1: select DEPT.Name, EMP.Name, from DEPT, EMP where DEPT. Name = a block EMP.DeptName DEPT,Name=Toy,... EMP,DeptName=Toy,... EMP,DeptName=Toy,... DBMS 2001 Notes 3: Representing Data 35 DBMS 2001 Notes 3: Representing Data 36 6
If Q1 frequent, clustering is good But consider Q2: SELECT * FROM DEPT If Q2 is frequent, clustering is counterproductive (4) Split records Typically for hybrid format Fixed part in one block Variable part in another block DBMS 2001 Notes 3: Representing Data 37 DBMS 2001 Notes 3: Representing Data 38 Block with fixed recs. R1 (a) R2 (a) Blocks with variable recs. R1 (b) (5) Sequencing Ordering records in file (and block) by some key value R2 (b) R2 (c) Sequential file ( sequenced) DBMS 2001 Notes 3: Representing Data 39 DBMS 2001 Notes 3: Representing Data 40 Why sequencing? Typically to make it possible to efficiently read records in order (e.g., to do a merge-join discussed later) Sequencing Options (a) Next record physically contiguous R1 Next (R1)... (b) Linked R1 Next (R1) DBMS 2001 Notes 3: Representing Data 41 DBMS 2001 Notes 3: Representing Data 42 7
Sequencing Options (c) Overflow area Records in sequence header R1 R2 R4 R5 R2.1 R1.3 R4.7 (6) Addressing records How does one refer to records? Many options: Physical Rx Indirect DBMS 2001 Notes 3: Representing Data 43 DBMS 2001 Notes 3: Representing Data 44 Purely Physical Device ID E.g., Record Cylinder # Address = Track # or ID Block # Offset in block Block ID Fully Indirect E.g., Record ID is arbitrary byte string (say, an object ID) rec ID r map table Rec ID Physical addr. address a DBMS 2001 Notes 3: Representing Data 45 DBMS 2001 Notes 3: Representing Data 46 Tradeoff Flexibility Cost to move records of indirection (for deletions, insertions) Physical Indirect Many options in between DBMS 2001 Notes 3: Representing Data 47 DBMS 2001 Notes 3: Representing Data 48 8
Ex #1 Indirection in block Header A block: R4 R2 R1 Free space Block header - data at beginning that describes block May contain: - File ID (or RELATION or DB ID); the ID of this block - Record directory; Pointer to free space - Type of block (e.g. contains records of type 4; is overflow, ) - Pointer to other blocks like it (say, if part of an index structure) - Timestamp... DBMS 2001 Notes 3: Representing Data 49 DBMS 2001 Notes 3: Representing Data 50 Issues in storing records in blocks (1) Separating records (2) Spanned vs. Unspanned (3) Mixed record types - Clustering (4) Split records (5) Sequencing (6) Addressing records here Other Topics (1) Insertion/Deletion (2) Pointer Swizzling (3) Comparison of Schemes DBMS 2001 Notes 3: Representing Data 51 DBMS 2001 Notes 3: Representing Data 52 (1) Deletion Block Rx Options: (a) Immediately reclaim space (b) Mark deleted May need chain of deleted records (for re-use) If there are pointers to the record, either need to set them to NULL or indicate the record space reclaimed DBMS 2001 Notes 3: Representing Data 53 DBMS 2001 Notes 3: Representing Data 54 9
As usual, many tradeoffs... How expensive is to move valid record to free space for immediate reclaim? How much space is wasted? e.g., deleted records, delete fields, free space chains,... Concern with deletions Dangling pointers R1? DBMS 2001 Notes 3: Representing Data 55 DBMS 2001 Notes 3: Representing Data 56 A solution: Tombstones (a) Leave MARK in old location with physical IDs: A block A Solution : Tombstones (b) Leave MARK in map table with logical Ids: Map table ID LOC This space never re-used This space can be re-used 7788 Never reuse ID 7788 nor space in map... DBMS 2001 Notes 3: Representing Data 57 DBMS 2001 Notes 3: Representing Data 58 (1) Insert Easy case: records not in sequence Insert new record at end of file, or in any block with space for it Insert Hard case: records in sequence If free space close by, not too bad - can "slide" records to preceeding/suceeding blocks - pointers to moved records require forwarding addresses Or use additional blocks as an overflow area DBMS 2001 Notes 3: Representing Data 59 DBMS 2001 Notes 3: Representing Data 60 10
Interesting problems: How much free space to leave in each block, track, cylinder? How often to reorganize file + overflow? (2) Pointer Swizzling Two kinds of record addresses: DB address physical location on secondary storage Memory address record location when loaded into (main or virtual) memory Which one to use, and when? DBMS 2001 Notes 3: Representing Data 61 DBMS 2001 Notes 3: Representing Data 62 Pointer Swizzling One Option: block 1 Memory Disk block 1 Translation DB Addr Mem Addr Table Rec-A Rec-A-inMem block 2 Rec A Rec A block 2 For all records copied to memory DBMS 2001 Notes 3: Representing Data 63 DBMS 2001 Notes 3: Representing Data 64 Another Option: In memory pointers - need type bit to disk Swizzling Automatic ("eager") On-demand ("lazy") No swizzling / program control M to memory DBMS 2001 Notes 3: Representing Data 65 DBMS 2001 Notes 3: Representing Data 66 11
Comparison There are about 10,000,000 ways to organize my data on disk Issues: Flexibility Space Utilization Which one is right for me? Complexity Performance DBMS 2001 Notes 3: Representing Data 67 DBMS 2001 Notes 3: Representing Data 68 To evaluate a given strategy, estimate following parameters: -> space used for expected data -> expected time to - fetch record given its key - fetch record with next key - insert/delete/update record - scan all records in the file - reorganize file Example How would you design Megatron 3000 storage system? (for a relational DB, low end) Variable length records? Spanned? What data types? Fixed format? Record IDs? Sequencing? How to handle deletions? DBMS 2001 Notes 3: Representing Data 69 DBMS 2001 Notes 3: Representing Data 70 Summary How to lay out data on disk Data Items Records Blocks Files Memory DBMS DBMS 2001 Notes 3: Representing Data 71 Next How to find a record quickly, given a key DBMS 2001 Notes 3: Representing Data 72 12