The World According to the OS Operating System Support for Database Management App1 App2 App3 notes from Stonebraker s paper that appeared in Computing Practices, 1981 Operating System Anastassia Ailamaki http://www.cs.cmu.edu/~natassa processor memory disks network 2 Today s talk What are these s anyway? OS Issues for DB Systems What we see Queries/ Answers Client apps Database Management System Conclusions/Discussion Data Storage : the software that reads data & answers questions 3 4 Banking DB Application Components of a transaction Data Definition Design (schema): CUSTOMER(NAME str, AGE int, ACCNT int) ACCOUNT(ACCT_ID int, BALANCE real) EMPLOYEE(NAME str, ADDRESS str, SALARY int) query Query Compiler Execution Engine Transaction Manager Logging/Recovery Schema Manager Concurrency Control Query: What is the average balance of customers under 30? Buffer Manager LOCK TABLE Transaction: Transfer $1,000 from checking C01 to savings S02 Storage Manager BUFFERS BUFFER POOL : a set of cooperating software modules 5 6 1
Query Compiler Execution Engine Interface with the OS transaction Data Definition query OS INTERFACE Buffer Manager Storage Manager Transaction Manager Schema Manager OS INTERFACE Logging/Recovery Concurrency Control LOCK TABLE BUFFERS BUFFER POOL Crucial modules use OS services 7 USERS query Query Compiler Execution Engine Buffer Manager Storage Manager FILES Similarities with the OS transaction Data Definition Transaction Manager Schema Manager Logging/Recovery Concurrency Control PROCESSES LOCK TABLE BUFFERS BUFFER POOL MAIN MEMORY Almost every part has an OS counterpart 8 Past and Present Situation OS Issues for DB Systems Database Management System Operating System Buffer pool management File System Scheduling, processes, IPC Concurrency/Recovery Virtual Memory processor memory disks network 9 10 Buffer Management Typical Unix provisions: All file I/O goes through main memory LRU (or approximation) stack for replacement Prefetch on sequential access Transparent to clients (except for force all ) read main memory cache Y Y 1. Performance Overhead: Can be terrible for each page read System call Core-to-core data move read DB cache Main memory 11 12 2
2. Replacement policy Replacement policy (cont.) Typical access patterns: Sequential scan What is the avg. balance of customers < 30? Cyclic (looping) sequential scan Which employees are also our customers? Random accesses (once) Random accesses (many times) Same as above, with index Sequential scan MRU (one page) Cyclic (looping) sequential scan: MRU (one page) or fix n+1 pages Random accesses (once) MRU Random accesses (many times) LRU LRU is the worst here!!! Which is the best replacement for each? Need provision for DB hints (or manage own BP) 13 14 3. Prefetch 4. Crash Recovery: Example knows what it wants next It is not always sequential More hints needed for good performance Further issue: Prefetched pages might replace needed ones (why???) Transfer $1,000 from checking A to savings B begin transaction write begin record read balance A into = 1000 write into balance A write update record read balance B into Y Y=Y 1000 write Y into balance B write update record commit transaction write commit record print receipt 15 16 Crash Recovery Buffer Management Summary Deferred Updates Force intentions list to disk Force commit flags (after intentions list!!!) Do updates from intentions list WAL (Write Ahead Loggind) Force undo/redo records Need facilities for Selected force out Ordering of physical writes Performance Replacement policy Prefetching Crash Recovery desired services done not quite right, therefore remain unused 17 18 3
File System Issues In current UNI file systems File = byte stream Logical order little relation to physical order Indirect blocks (trees) + directory trees Consequences: + Small files cheap + Large files possible + Byte model for programmers Large files costly Many physical reads/logical Loss of sequentiality Byte model for Too many trees! Preferred approach Physical contiguity OS-level B+ trees, hashing Let know about blocks of file Implement records at the low level Provide higher-level services on top of this What really happens today: Extents File I/O vs. raw I/O 19 20 Scheduling, ing, IPC needs Shared buffer pool Shared lock table Critical sections Structure Alternatives user 1 user k user 1 user k Q: Does UNI now have shared data segments? Shared memory? -per-user Server 21 22 Evaluation -per-user structure Expensive context-switching Preemption at bad places s critical sections convoy Structure Alternatives (cont.) user 1 user k user 1 user k server Duplication of OS services must do own multi-tasking Messages cost several thousand instructions Server Pool Disk Disk Server Disk 23 24 4
Evaluation (cont.) yet another wish list Server pool Internal parallelism Avoid multitasking Similar to process-per-user Disk Server Trades messages for task switches May be more expensive Reduced message/task overheads Sockets No-preemption scheduling (can the OS do this?) fast-path for context-switching among procs Threads, threads, threads! (first appeared on IBM MVS in the 70 s - sigh) Still has queued-up requests to locked items Multi-agent, multi-device mgr is used today 25 26 Recovery/CC issues Virtual Memory OS provides: File-level locks too coarse Page-level 2PL no special index CC possible Why not map into virtual memory? VM approach requires (warning: old data): 4 bytes overhead/vm page 100 MB file means 100 KB page table Problems OS-supported transactions Transaction commit point Duplicate functions due to buffer manager Ordering Dependencies tion result independent from execution order If page table not resident, two-touch page access Extent-based files system approach 1000 consecutive blocks represented in <addr, len> (versus 100KB above to store all addresses) 4 bytes overhead/file ctl blocks can stay in memory Super-pages 27 28 Virtual Memory (cont.) Conclusions Bind chunks of file: must keep track of binding Bind/unbind in tion very expensive Overhead comparable to file open Plus, all the problems from buffering! OSs have problems with wish lists: Buffer management (policies, ordering, overhead) File systems (abstraction, sequentiality, overhead) issues (structure, task/msg overhead, scheduling) CC/Recovery (buffer pool problems) Virtual memory (space, efficiency, etc.) 29 30 5
What the wants What about modern /OSs? Database Management System Operating System processor memory disks network no-cache file system option in DB2 NT: VirtualLock API (override some buffer policies) FlushViewOfFile API (flush portions of file) Physical contiguity Unix FFS tries to place a file s data blocks in thesamecylindergroup 64-bit systems will allow to map some files in VM 31 32 References Reading Stonebraker, M., Operating System Support for Database Management, Communications of the ACM, 24(7), 1981 (and all its references section! Gray, J., and Reuter, A., Transaction ing: Concepts and Techniques, Morgan Kaufmann, 1993. CMU Courses 15-415, Database Applications 15-721, Database Management Systems 15-823, Advanced Topics in DB System Performance 15-826, Multimedia Databases and Data Mining 33 6