DBOS: Revisiting Operating System Support for Data Management Systems Yinan Li yinan@cs.wisc.edu Wenbin Fang wenbin@cs.wisc.edu Abstract Three decades ago, Michael Stonebraker made an observation that the operating system (OS) provided notquite-right services to the database management system (DBMS). Therefore, DBMS had to make a workaround to implement their own services, for example, buffer pool management. Modern operating systems provide system calls with much richer semantics than 30 years ago. However, DBMS seems to be conservative to adopt new features provided by operating systems, due to the portability consideration. In this paper, we revisit Stonebraker s observation. We conduct an in-depth investigation and analysis of PostgreSQL, a real world open source DBMS, on Linux, Solaris, and Mac OS X. Unlike the 30-year-ago conclusion that OS provides wrong services, our observations show that PostgreSQL misuses or underutilizes some services provided by OS. For example, PostgreSQL on Mac OS X by default uses POSIX semaphore system calls that heavily consumes file descriptors. In addition, we conduct a case study on write-ahead logging to explore how to improve the logging perfromance with existed or new OS services. In particular, we propose a lazy sync technique by using advanced OS services to provide higher logging performance than conventional fsync system call, without violating durability guarantee. Our experimental result shows that, lazy sync has 1.16x speedup on overall DBMS throughput over using fsync method, when running online transaction benchmark. 1 Introduction The database management system, or DBMS, runs on top of conventional operating systems. However, Stonebraker [11] pointed out OS provided not-quite-right services to DBMS three decades ago. Some key OS services (e.g., buffer pool management, the file system, and process scheduling, etc.) were either too slow or inappropriate. Therefore, DBMS had to implement userspace services, in parallel to the kernel-space services. For example, the page replacement policy in OS buffer management was not suitable for database workload, and DBMS typically constructed a user-level buffer pool and managed by itself. Oftentimes, DBMS had to use duplicate services in both user- and kernel-space. So, Stonebraker [11] made a wish list of OS support for DBMS, including the demand for direct I/O, providing hints for page replacement, and the like. Some related work [3, 18, 7, 16] revisited Stonebraker s observation [11]. Basically, they ran some micro-benchmark on modern OSs, and evaluate whether the desirable services mentioned in Stonebraker s wish list were present on modern OS, and how good were they. From their results, modern OSs provide services with much richer semantics than 30-year ago, many of which are desirable in Stonebraker s wish list. In this paper, we investigate OS support for DBMS in a different angle. In the presence of rich OS services, we suspect that DBMS may misuse or underutilize them. To confirm this hypothesis, we conduct an in-depth study on an open source DBMS PostgreSQL [9]. We trace the system calls used by PostgreSQL on Linux, Solaris, and Mac OS X. We find that PostgreSQL makes a conservative use of OS services. For example, some services including threads and direct I/O are unused, and some services including POSIX semaphore for implementation on Mac OS X are misused. This is possibly due to the portability requirement for PostgreSQL. Furthermore, we do a case study on write-ahead logging in PostgreSQL, because write-ahead logging has an interesting data access pattern, and it is critical to overall DBMS performance and durability guarantee. We demonstrate how to improve DBMS performance by using appropriate system services in this case study. Moreover, we use asynchronous I/O interface to develop a Lazy Sync technique for synchronizing log writing, which improves the entire DBMS throughput by a fac-
tor of 1.16, and preserves the durability guarantee. This paper is organized as follows. Section 2 surveys some pieces of related research work on OS support for DBMS. We investigate what system calls are used by PostgreSQL in Section 3, and conduct an in-depth analysis on how PostgreSQL utilizes OS services in Section 4. In Section 5, we do a case study on write-ahead logging, proposing a Lazy Sync technique to improve it. Finally, we conclude in Section 6. 2 Related Work We start our work with the observation by Stonebraker [11] three decades ago. He drew examples from UNIX and INGRES that was developed by his team. Therefore, he investigated OS support from a DBMS developer s point of view. Stonebraker examined some OS services, including Buffer Pool Management, the File System, Scheduling and Process Management, Consistency Control, and Virtual Memory. He observed that operating systems provided not-quite-write services to database management systems. Finally, he made a wish list of OS support for DBMS, for example, direct I/O, and the way to provide hints for page replacement, etc. Fellig and Tikhonova [3] considered operating system support for one specific component: Buffer Pool Management. They examined the memory management programming interface provided by Windows NT, and evaluated the possibility to implement better application level page replacement policy. In addition, they described how the Buffer Pool Management worked in SQL Server 7.0, which was possibly inferred from some technical documents. Their conclusion was database systems have ample justification in managing their buffers on top of operating system s management policy [3]. Yang and Li [18] examined Solaris and Windows NT, and revisited Stonebraker s [11] arguments. They checked whether Solaris and Windows NT provide desirable OS services for DBMS, and evaluate how good were they by using some microbenchmark. They did not investigate any real world database systems. Finally, they concluded that operating systems provided finer granularity synchronization support, while database systems generally implement most services, e.g., buffering and locking, in user space. They thought that, this was because of the necessity to do so and some historical reasons. Zhang and Ni [7] investigated PostgreSQL on Solaris. They first examined whether desirable services for general DBMS existed on Solaris, and used some microbenchmark to evaluate the performance of those services. Next, they described how PostgreSQL utilized OS services, possibly by looking into the source code and technical manuals. Their conclusion was that OS provides good services and DBMS well utilizes them. Vasil [16] investigated SQL Server 2000 on Windows NT. He also reexamined Stonebraker s arguments one by one, by using a comprehensive micro-benchmark. His conclusion was, there was not an improvement of buffer management and consistency control support provided by operating system. In addition, he identified some new domains for efficient human resource utilization, for example, operating system should provide performance monitoring tools rather than DBMS does. The fundamental difference between our project and others [3, 18, 7] is that, we investigate the problem that DBMS misuses or underutilizes OS-provided services, instead of examining what desirable services modern OS has (Stonebraker s [11] wish list) and how good those services are. Our system-call-oriented approach is applicable to investigate commercial DBMS (e.g., Oracle) without accessing source code. In addition, we study the OS services beyond Stonebraker s wish list (in Section 5), while others only examine the services mentioned in the list. Furthermore, with an in-depth analysis on a key component, we identify a few examples that DBMS does not use right service, and evaluate its improvement with our implementation on the real world DBMS. In contrast, all the previous projects were based on an assumption that DBMSs always use OS services correctly. 3 DBMS-OS Interaction In this section, we investigate how database systems interact with operating systems. The database management system, or DBMS, runs on top of operating systems, from which DBMS requests services by invoking system calls. In our investigation, we study PostgreSQL [9], a real world DBMS, and trace the system calls used in it by using strace [12], truss [14], and dtruss [2], on Linux, Solaris, and Mac OS X, respectively. We use PostgreSQL 8.2 with default configuration, Linux with kernel 2.6.35, Solaris 10, and Mac OS X 10.6.5. Firstly, We show the tracing result on the three different operating systems in Section 3.1. Next, we present how PostgreSQL works in terms of system calls in Section 3.2. Finally, we identify the top time consuming system calls in Section 3.3, when running TPC-C [13] benchmark, an on-line transaction processing benchmark, with dbt-0.40 1. 3.1 System Calls on Different Operating Systems Table 1 shows major system calls used in PostgreSQL across Linux, Solaris, and Mac OS X. We categorize 1 http://sourceforge.net/projects/osdldbt/files/dbt2/0.40/ 2
Service Linux Solaris Mac OS X Networking recv recv recvfrom send send sendto Timing time time gettimeofday gettimeofday setitimer setitimer setitimer write write write read read read open open64 open Disk I/O llseek llseek lseek select select select poll pollsys poll fsync fdsync fsync fcntl64 fcntl fcntl semget semget sem open semctl semctl sem close Semaphore semop semop sem post sem wait sem trywait shmctl shmctl shmctl Shared Memory shmat shmat shmat shmdt shmdt shmdt Private Memory brk brk brk clone fork1 fork Process Control waitpid waitid wait4 kill kill kill Table 1: Main System Calls (not all) used in PostgreSQL on Linux, Solaris, and Mac OS X. those system calls into seven groups of services provided by operating systems: Networking, Timing, Disk I/O, Semaphore, Shared Memory, Private Memory, and Process Control. PostgreSQL relies on these seven groups of system calls to implement the necessary functionalities, for example, to implement various locks by using semaphore. The purpose of this categorization is to provide a high level answer to the question: what services or system calls are used by PostgreSQL on different operating systems. Looking at system calls in Table 1, we can see that, generally, the three operating systems export similar system calls to PostgreSQL. This is due to the fact that Linux, Solaris, and Mac OS X all evolve from Unix, and due to the effort of standardization, e.g., POSIX [8]. This also implies that the standardization enables, or at least, eases the portability of database management system. 3.2 PostgreSQL Internal In this subsection, we investigate how PostgreSQL works internally, in terms of utilizing the services provided by operating systems. We monitor the runtime activity of PostgreSQL on Linux, which have five client connections, by using watch -d ps aux grep postgres. The output is as follows, transformed for the ease of display: PID Description --- -----------... 5637... postgres -D database... 5643... postgres: writer process... 5662... postgres:... FETCH... 5664... postgres:... SELECT... 5669... postgres:... idle in transaction... 5671... postgres:... SELECT... 5676... postgres:... FETCH waiting We can find that there are mainly three kinds of processes in PostgreSQL. First, the last five processes correspond to five clients connections, so as one process handles transactions from one client. We call such process as Worker Process. Second, the first process is the one we execute from shell. It is obvious that this process creates other processes. We name such process as Main Process. Third, the second process is Writer Process. It is unclear so far what this process is used for, so we move on to trace system calls used in each type of process, and obtain the result in Table 2. Based on the categorization in Section 3.1 and the result in Table 2, we come up with Figure 1, showing how different system components and processes interact with each other, in terms of using OS-provided services. PostgreSQL utilizes Networking and Timing services to implement client/server communication protocol. The Timing service is also used for statistical purpose, and to delay flushing dirty buffers to disk at a configurable interval. Master process utilizes Process Control service to create and destroy other processes on the server side. PostgreSQL applies a shared memory area from operating system, and manages it as a buffer pool to cache shared disk data blocks, and other shared data structure, e.g., lock table, transaction table, etc. Of course, each process would need some private buffer that is not shared with other processes, so PostgreSQL applies memory areas using malloc that is implemented by brk system call. PostgreSQL relies on file system in operating systems to do disk I/O. Therefore, accessing disk data is in essence of dealing with files. Finally, PostgreSQL uses semaphore to implement locks, which in turn to implement locking protocol for consistency. 3.3 Top Time Consuming System Calls We examine the time spent in each system call, and identify the heavily used ones. By doing so, we can focus on investigating those time consuming OS services. We run TPC-C benchmark for 30 minutes, and use strace [12] to obtain each system call s CPU time on Linux, and the corresponding ratio in the total kernel CPU time. The result of Solaris and Mac OS X is similar to Linux s, so we only present Linux s result. Figure 2 illustrates the sorted ratios for top 15 time consuming system calls. Disk I/O takes up about 90% of kernel CPU time, including the time for write, fsync, llseek, and read. Please note that, the kernel CPU time does not count the actual I/O time and idle time. For TPC-C workload, in- 3
Process System Call Note clone Spawn processes waitpid Control Processes Master Process time Implement Protocol select Wait to serve connections semctl Manage Shared Memory Area read, write, llseek I/O On data files and log files open, close On log files Worker Process time Delay flushing to disk send, recv, gettimeofday Implement Protocol to transfer data brk Per-process Private memory semop Implement Locks semop Implement Locks Writer Process fsync Flush to disk select Wait for disk I/O to become available open, close On data files and log files time Delay flushing to disk Table 2: System calls used in three types of processes in PostgreSQL, on Linux. File System in OS PostgreSQL Server Server Processes Writer Process Kernel Cache Disk I/O Semaphore Timing DBMS Shared Buffer Pool SharedMem Process Control Master Process Networking Timing Disk Process Private Memory (Heap) PrivateMem Process Control Worker Process Worker Process Networking Timing Client 1 Client2 Server Side Figure 1: PostgreSQL Internal. Labels on edges are the services provided by operating systems. tensive disk I/O is the largest bottleneck to the overall performance. The second time-consuming service is Semaphore that is used to implement locks, and locks are intensively used for consistency control across transactions. Based on the result in Figure 2, we would pay attention on the Disk I/O and Semaphore services, investigating how DBMS utilizes these two services according to the data access pattern of database workload. 3.4 Summary The results presented in this section provide a roadmap for revisiting OS support for DBMS (Section 4). Moreover, the result in this section motivates our investigation on OS support for write-ahead logging (Section 5), which has unique I/O pattern and heavily utilizes the time consuming Disk I/O and Semaphore services. 4 Revisit OS Support In this section, we revisit those OS services examined by Stonebraker [11] three decades ago. We justify whether PostgreSQL well utilizes OS services. 4.1 Buffer Pool Management and File System PostgreSQL applies a shared memory area from OS using System V shared memory interfaces, which includes system calls with a prefix shm, e.g., shmctl shown in Table 1. The shared memory area is used as a buffer pool to shared some data structures, e.g., lock table, and data blocks for disk I/O. Shared memory in OS is swappable to disk. Therefore, it is possible that a data block from disk is swapped back to disk, before really using it. We can pass SHM LOCK to shmctl, and pin the 4
Ratio in total kernel CPU time (%) 45 40 35 30 25 20 15 10 5 0 brk send recv semop read _llseek fsync write System Calls select time open waitpid clone setitimer gettimeofday Figure 2: Top Time Consuming System Calls shared memory area in physical memory. According to the tracing result, we did not see PostgreSQL make an effort to prevent unnecessary page replacement in shared memory area. However, database administrator can use sysctl utility to set kern.ipc.shm use phys, so that kernel would lock shared memory in RAM and prevent it from being paged out. This brings inconvenience that we do a global setting for all shared memory areas, instead of fine-grained control on a particular shared memory area. PostgreSQL totally relies on the file system for disk I/O, and manipulates files. Therefore, PostgreSQL can have free lunch on performance gain as the advance of file system research. For example, the adaptive readahead technique in linux kernel [17] can improve the overall throughput of PostgreSQL by a factor of 1.25, on some workload [10]. However, there is a setback - before accessing disk, PostgreSQL usually has to go through kernel cache (Figure 1). Nowadays, some OS allows applications to bypass kernel cache to access disk, e.g., providing O DIRECT to open system call on Linux. However, PostgreSQL generally does not use direct I/O in most cases, except for writing log files on Linux. There are two possible reasons. First, different OS may not support direct I/O in a uniformed way, so the source code of DBMS is not easy to maintained. For example, Linux passes O DIRECT flag to open, Solaris uses directio, and Mac OS X passes F NOCACHE flag to fcntl. Second, it is a common practice that stores log and data in separate disks, and setups the entire file system for logging to allow direct I/O, e.g., forcedirectio mount option for UFS on Solaris. However, this requires database administrator has the permission to do so, which becomes less possible when running DBMS in the cloud that based on virtual machines. PostgreSQL manages its own sophisticated page replacement policy for buffer pool. while relying on OS to do page replacement for kernel cache. It is possible to use fadvise system call to give hints to OS, so that OS can do a better job on page replacement in kernel cache. According to our tracing result, PostgreSQL 8.2 does not call fadvise. However, we find that in more recent version of PostgreSQL (e.g., version 9.0), fadvise is used to advise kernel to release cached pages for some file that will not be re-read. 4.2 Process Scheduling PostgreSQL spawns a process for each client connection. Some may argue that such one-process-per-client manner is too heavy-weight, while one-thread-per-client is good for performance. However, there are some reasons one-process-per-client is in favor. First, PostgreSQL allows to plug in user-defined functions that may contain malicious code. If we use one-thread-per-client model, then a problematic thread may crash others. Second, it is easy to kill a particular process and allows OS to free all resources cleanly and quickly, while it is hard to do so on a particular thread externally. Third, threads are not well supported on all operating systems that run PostgreSQL. However, threads are useful in some cases. For example, we envision the possible extension to use threads within a worker process, for parallelizing independent queries in a transaction, or independent query operators in a query. In the presence of multi-core processor, OS provides finer-grained process control interfaces. For example, on Linux and Solaris, sched setscheduler is used to select some built-in scheduling policy, and sched setaffinity is for determining the set of cores on which it is eligible to run. We examined system calls used in PostgreSQL. None of such process control system calls is used. It seems that PostgreSQL hesitates to enter multi-core processor era! 4.3 Consistency PostgreSQL uses OS-provided semaphore to implement various kinds of locks for consistency control. By examining the system calls used in PostgreSQL in Table 1, we find that Mac OS X uses POSIX semaphore (e.g., sem open, sem post, sem wait etc.), while Linux and Solaris System V semaphore (e.g., semop, and semctl). PostgreSQL on Mac OS X is able to switch to use System V semaphore, but by default, it uses POSIX semaphore. We try to compare System V semaphore and POSIX semaphore. First, we compare the performance between using System V semaphore and POSIX semaphore, running PostgreSQL on TPC-C benchmark. The result is shown in Figure 3. Using System V semaphore yields better throughput. Second, we compare the scalability of these two semaphores. Every potential worker process takes up a semaphore in PostgreSQL. Thus, if we configure 5
# of transactions per minute 1400 1200 1000 800 600 400 200 0 30 60 90 120 150 # of clients System V POSIX Figure 3: TPC-C Throughput Comparison between using System V semaphore and using POSIX semaphore. tion allows users to use fullsync. This is because Apple ships hardware and the operating system together. On ATA drives, Apple implements F FULLFSYNC with the FLUSH TRACK CACHE command. All drives sold by Apple have to honor this command. PostgreSQL on Linux and Solaris can, at most, flush data out of kernel cache. In this way, PostgreSQL cannot survive from a crash when data is in device cache, and yet to be persistent on disk. We run TPC-C benchmark on Linux while varying the synchronization methods 2. From the throughput comparison in Figure 4, fdatasync has the best performance. For database workload, it is unnecessary to flush a file s meta data timely. Therefore, using fdatasync is good for performance and is also sufficient to guarantee the durability. PostgreSQL to support up to N client connections, then there will be N semaphores. On Mac OS X, we set the max connections in configuration file to be 1000, then PostgreSQL server using POSIX semaphore fails to start, reporting error on insufficient file descriptors available to start server process. There is not such problem when switching to use System V semaphore. We find that each named POSIX semaphore would consume a file descriptor. Therefore, it is easy to exceed the limit of file descriptors allowed in operating system, though this limit is configurable. PostgreSQL utilizes OS-provided synchronous methods for forcing write-ahead log updates out to disk. The synchronous methods provide write barrier to block execution until data are flushed to disk, including: O SYNC in open (open sync): Flush all meta data and all data out of kernel cache, for each write. O DSYNC in open (open dsync): Flush some meta data and all data out of kernel cache, for each write. fsync: Flush all meta data and all data out of kernel cache, for a batch of writes. fdatasync: Flush some meta data and all data out of kernel cache, for a batch of writes. F FULLFSYNC in fcntl (fullsync): Flush all meta data and all data out of kernel cache and disk cache, for a batch of writes. Only supported by Mac OS X. PostgreSQL allows users to select one of the above five methods for write-ahead logging (see Section 5.4). As we can see, only Mac OS X provides a method (denoted as fullsync) to guarantee that data is persistent on disk. So, only the Mac OS X implementa- # of transactions per minute 1800 1600 1400 1200 1000 800 600 400 200 0 open_sync open_dsync fsync fdatasync Sync Methods Figure 4: TPC-C Throughput Comparison between using different synchronization methods, on Linux. We further compare PostgreSQL throughputs on Mac OS X, between using fullsync and using fsync on TPC- C benchmark. The result shows that using fullsync is 3 times slower on overall throughput. Since fullsync performs so poor, it is generally acceptable to sacrifice the durability of recently written data for a better performance. One solution to maintain both durability and performance is, to use fsync while using non-volatile random access memory (NVRAM) for device cache. However, fsync is still slow [15]. So, we propose an elegant lazy sync technique in Section 5.4, to retain the same durability guarantee as fsync, while achieving better performance. 4.4 Summary Although Stonebraker s [11] wish list of desirable OS services are mostly fulfilled by modern operating systems today [18, 16, 3, 7], PostgreSQL use those services 2 Linux does not implement both open sync and open dsync until kernel 2.6.32. 6
conservatively. PostgreSQL has to put portability at a high priority, since it supports more than 15 operating systems. 5 Case Study: Write-ahead Logging In this section, we select write-ahead logging as a case study, to conduct an in-depth investigation on how to improve the logging performance with new or existed OS services. Write-ahead logging is a fundamental component in ARIES-style concurrency and recovery, and one of the most important yet-to-be addressed potential bottlenecks [5]. As the memory size increases, the future databases, except of the largest one in the world, can entirely fit into memory. As a result, all expensive I/O operations only occur in write-ahead logging, which becomes a performance bottleneck, especially in OLTP workloads making frequent small changes to data. Due to its increasing impact on overall DBMS performance, we choose write-ahead logging as our case study. We first briefly introduce the background of writeahead logging in Section 5.1. Then, we identify three potential issues of current implementation of write-ahead logging, and explore OS solutions for the these issues in Section 5.2, Section 5.3, and Section 5.4, respectively. 5.1 Preliminary Write-ahead logging (WAL) is a fundamental technique for providing atomicity and durability (two of the ACID properties) in database systems. It is widely adopted in almost all DBMSs since System R. In a database system, a log of all modifications is saved on stable storage, which is guaranteed to survive crashes and media failures. The log is maintained as a sequential file, and all writes to the log are typically sequential writes. Following the Write-Ahead Log (WAL) protocol, before writing a page to disk, every update log record that describes a change to this page must be forced to stable storage. When a transaction is committed, the log records at the tail of the log must be forced to stable storage. Thus, when the system recovers after a crash, the restart process loads all log records and redo all modifications before the crash. Please refer to ARIES paper [6] for a more detailed description on recovery process. Formally, every log record is given a unique ID called the Log Sequence Number (LSN). LSNs should be assigned in monotonically increasing order, even though multiple clients write log records concurrently. If a transaction made a change and committed, all log records whose LSN is less than the LSN of commit record of the transaction are flushed to stable device, no matter whether these log records are inserted by the committed transaction or not. 5.2 Overlapped Sequential Writes Although writing write-ahead log records is logically in sequential access pattern, the actual access pattern is not exactly sequential if we take synchronization into consideration. Since the granularity of access in disk is page, modern DBMSs implement a page-based WAL subsystem. Log records are inserted into in-memory buffer firstly. The buffer is maintained as multiple separate pages. Once a transaction commits, all pages in the buffer pool containing unwritten log records are immediately flushed to stable device through synchronized writes. As a result, the page containing the last log record of last commit typically need to be written twice, because the page will be filled with new log records, and has to be written again when the next commit occurs. We call this access pattern overlapped sequential writes: writes are sequentially append to the end of the file, but the first page may overlap with the last page of the last write. 1 st Commit 2 nd Commit 3 rd Commit Page Boundary Written Records Written Pages Written Records Written Pages Written Records Written Pages Figure 5: WAL writing pattern Figure 5 demonstrates an example of writing WAL. At the 1st commit, the page containing all log records (the first page) is written to stable device. At the 2nd commit, the log records span over three pages, including the first page that we have already written at 1st commit. Hence, we have to rewrite the first page, and then sequentially write the following two pages. Similarly, at 3rd commit, we need to rewrite the third page and then write the fourth page. We use a microbenchmark to study the performance impact of the overlapped sequential writes. Intuitively, overlapped sequential writes should be much slower than sequential writes, because the disk suffers from a long rotation delay when rewriting the overlapped page (the page just goes over the position of disk head). Our experiment is conduct on Ext2 file system to avoid the interference caused by journalling. We open the test file with O DATASYNC flag to enable synchronized writes and reduce the impact of writing metadata. Our results show that overlapped sequential writes and sequential writes have very similar performance with our experimental settings. This probably is because we have no root access and cannot disable device cache. 7
5.3 Ordering Writes Write-ahead logging (WAL) protocol is a fundamental principle for transaction processing in database systems. It guarantees that every update log record that describes a change to this page must be forced to stable storage, before writing this page to disk. As a result, when a dirty page in the buffer pool is evicted out and will be written to disk, its associated log record has to be forced to stable device firstly. In the implementation of PostgreSQL, it directly flush all log records whose LSN is less than the LSN of the associated log record. Since flushing the log significantly hurts the overall performance, we consider how to provide a new OS service to benefit WAL writing. We propose a new system call shown as follows. WriteOrder(FILE * fd1, int offset1, FILE * fd2, int offset2) This system call gives a hint to OS cache to guarantee that the page at offset1 of file fd1 should be written to disk before writing the page at offset2 of file fd2. OS kernel could maintain these ordering information, and choose evicted page based on both its replacement plicy and the ordering information. With this new service, we can invoke this system call to indicate data page should be written after its associated log page, when a dirty page in DB buffer pool is evicted. In this way, expensive synchronization is eliminated. To evaluate the benefit of the proposed system call, we hack the code of PostgreSQL to count the times of WAL synchronization caused by writing dirty pages. We run a TPC-C benchmark with 10 warehouses and 10 clients. The total size of the database is around 1GB. We varied the database buffer pool size from 8MB to 2GB. Table 3 shows the results. The right column shows the percentage of log flushes caused by writing dirty data pages. When the buffer pool size is 8MB, all pages that are currently accessed by all these 10 clients (10 concurrent transactions / client) cannot fit into the buffer pool. Pages are evicted out before their associated transaction committed, which leads to the high ratio of log flushes caused by writing dirty data. When the buffer pool size is 16MB, the ratio sharply decreases to 5.9%, mainly because the working set fits into the buffer pool. When the buffer pool size becomes to 32MB or larger, the ratio never exceed 2%. The results indicate that the proposed system call can only provides a marginal speedup on the log synchronization performance at most cases. The exception cases includes: 1) the buffer pool is too small compared with the data set; 2) many transactions are executing concurrently. In both cases, the working set cannot fit into the buffer pool of database systems, providing an opportunity for the proposed system call. Table 3: The percentage of log flushes caused by writing data, with the buffer pool size varied. buffer size % of log flushes caused by writing data 8MB 48.6% 16MB 5.9% 32MB 1.9% 128MB 0.65% 512MB 0.51% 2048MB 0.37% 5.4 Lazy Synchronous Commit Synchronization on write-ahead logging at commit points is a key technique to guarantee to survive crashes and media failures. However, the synchronization significantly hurts the overall performance of database systems and becomes potential bottlenecks. Modern database systems support several commit methods to trade off between synchronization performance and durability guarantee. We survey some of the commit methods used in PostgreSQL as follows. Synchronous Commit. When a transaction commits, the database server will try to make sure that the log records are physically written to disk, by issuing fsync() system calls or various equivalent methods. This ensures that the database survives crashes and media failures. However, using fsync results in a performance penalty: when a transaction is committed, database must wait for the OS to flush the write-ahead log to disk. As shown in the leftmost column in Figure 6, all terminals commits transactions one by one. The next terminal start to write commit record exactly after the previous terminal write and synchronize the log records. Asynchronous Commit. An alternate way is to use asynchronized write for write-ahead log. In this way, the OS is allowed to do its best in buffering, ordering, and delaying writes. This can result in significantly improved performance. However, if the system crashes, the results of the last few committed transactions might be lost in part or whole. In the worst case, unrecoverable data corruption might occur. As shown in the second column in Figure 6, the log records are written to OS cache serially, but the there is no explicit synchronization. Log records could be physically written to disk based on OS s policies. Asynchronized write achieves the best performance. However, if a crash occurs (as shown in the figure), the committed transaction cannot be recovered. Group Commit. Group commit [1, 4] reduces pressure on disks by aggregating multiple requests for logflush into a single I/O operation. Small disk accesses are combined into larger accesses and achieves significantly better disk performance by avoiding unnecessary head 8
Sync Async Group Commit Lazy Sync terminal write log to OS cache write log to disk commit crash Figure 6: Comparison of four commit methods seeks and waits. Like synchronous method, group commit also guarantees that database survives crashes and media failures. However, as shown in Figure 6, even though the synchronization cost is amortized to multiple transactions, a group of transactions still need to invoke synchronization call and wait for OS to flush all log records to disk. We observe that the main issue of Asynchronous Commit is that it commits the transaction too early, before the log records are physically written to disk. Modern OSs have already provide a service to tell the application when the asynchronous IO has completed. Based on this service, we propose a new commit method called Lazy Synchronous Commit. The lazy synchronous commit achieves similar performance as asynchronous commit without sacrificing durability. This commit method is based on the OS services shown below: int aio write(struct aiocb *aiocbp); int aio suspend(struct aiocb * cblist[], int n, struct timespec *timeout); aio write() function issues an asynchronously write to a file. The writing information (including file handler, offset, etc.) is specified in aiocb structure. The aio suspend() function suspends the calling process until the asynchronous I/O requests in the list cblist of length n have completed, a signal is delivered, or timeout. This service is added into Linux in 2001, and is supported in other mainstream OSs, like Solaris, Mac OS, and so on. When a transaction is committed with lazy synchronous commit method, database first acquire the WAL lock, and then asynchronously write the commit record by calling aio write. Then, it releases the WAL lock thus other transactions can start to commit. The current transaction then calls aio suspend to wait the OS to write the log record to disk. The log record is not flushed to disk immediately, but is scheduled by OS cache policy. After aio suspend returns, database change the status of the transaction to commit. Lazy synchronous commit does not sacrifice durability, because it do commit a transaction after its log records have been physically written to disk. On the other hand, since it does not explicitly call fsync() or various equivalent methods, the OS is allowed to do its best in buffering, ordering, and delaying writes. Implementation. To evaluate the performance of lazy synchronous commit, we implement lazy synchronous commit in PostgreSQL 8.4.3. In particular, we add a new synchronization method SYN METHOD LAZYSYNC in PostgreSQL. This method can be selected in the postgresql.conf file or on the server command line. The routines XLogWrite(), issue xlog fsync() in xlog.c are modified to provide asynchronous writes through aio write. We add a new routine XLogSuspend() to wait for physical write by calling aio suspend(). The routine Record- TransactionCommit() in xact.c is modified to suspend the process without holding WALWriteLock and other critical sections. Experiment. We ran our experiments on a machine with dual 2.67GHz Intel Xeon Nehalem CPUs, and 24GB of DDR3 main memory, running Linux (kernel 2.6.18). We ran TPC-C benchmark with 3 warehouses. The buffer pool size is set to 2GB. We disable the background writer process and checkpoint process to avoid interferences caused by writing data pages. For Sync method, we use fsync to synchronize log file, since fdatasync is not implemented in Linux kernel 2.6.18. Figure 7 demonstrates the TPC-C throughput comparison of various commit methods. The Group(small) and Group(large) are the group commit methods and the group sizes are 10 and 1000, respectively. The group 9
TPC-C Throughput 40000 35000 30000 25000 20000 15000 10000 5000 0 Sync Async Gourp (small) Group (large) Lazy Mac OS X. The portability requirement of DBMS implementation prevents an aggressive use of OS services. Second, can we improve DBMS performance by providing good services? We do a case study on writeahead logging of PostgreSQL. We conclude that it is possible to provide new OS services to improve DBMS performance, e.g., WriteOrder for guaranteeing write order of two blocks. In addition, we propose a lazy sync technique using asynchronous I/O interfaces, provided by main stream operating systems. Our empirical result shows that using the right OS service is able to improve DBMS performance. Figure 7: TPC-C Throughput Comparison of various commit methods size of 1000 transactions is the maximum of group size that PostgreSQL supports. As shown in the figure, Sync is the slowest method, whereas Async is the fastest one. Group methods are in the middle. The larger groups size achieves better performance than smaller group size. The performance of lazy method is very close to that of async. The results verifies our analysis described above. 5.5 Summary According to the case study of write-ahead log, we observe that modern OSs have already provide sufficient OS services to achieve high performance, but DBMSs are conservative to use them. In particular, the OS services purposed for the issues we identified, e.g. overlapped sequential writes, and write ordering, are shown marginal performance improvement space in most cases. A exist OS service is used to implement a new commit method, which yields similar performance as asynchronous method without sacrificing durability. 6 Conclusion In this paper, we try to answer two questions for OS support to DBMS. First, does DBMS use the right existing OS services? We study a real world DBMS PostgreSQL on Linux, Solaris, and Mac OS X. We conclude that PostgreSQL makes a conservative use of OS services, by examining the traced system calls used in PostgreSQL. DBMS gets benefits from the advance of some OS research, e.g., adaptive read ahead framework. However, many new OS features are unterutilized, e.g., threads, asynchronous I/O, and direct I/O, or even misused, e.g., POSIX semaphore in PostgreSQL implementation on References [1] DEWITT, D. J., KATZ, R. H., OLKEN, F., SHAPIRO, L. D., STONEBRAKER, M., AND WOOD, D. A. Implementation techniques for main memory database systems. In SIGMOD Conference (1984), pp. 1 8. [2] DTRUSS. http://www.brendangregg.com/dtrace/dtruss. [3] FELLIG, D., AND TIKHONOVA, O. Operating system support for database systems revisited. cs736 course project report, University of Wisconsin-Madison, 2000. [4] HELLAND, P., SAMMER, H., LYON, J., CARR, R., GARRETT, P., AND REUTER, A. Group commit timers and high volume transaction systems. In HPTS (1987), pp. 301 329. [5] JOHNSON, R., PANDIS, I., STOICA, R., ATHANASSOULIS, M., AND AILAMAKI, A. Aether: A scalable approach to logging. PVLDB 3, 1 (2010), 681 692. [6] MOHAN, C., HADERLE, D. J., LINDSAY, B. G., PIRAHESH, H., AND SCHWARZ, P. M. Aries: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Trans. Database Syst. 17, 1 (1992), 94 162. [7] NI, J. Z. D. Operating system supports for database system revisit cs736 project report. cs736 course project report, University of Wisconsin-Madison, 2000. [8] POSIX. http://standards.ieee.org/develop/wg/posix.html. [9] POSTGRESQL. http://www.postgresql.org/. [10] READAHEAD, L. A. http://kerneltrap.org/node/6642. [11] STONEBRAKER, M. Operating system support for database management. Communications of the ACM 24, 7 (1981), 412. [12] STRACE. http://linux.die.net/man/1/strace. [13] TPC-C. http://www.tpc.org/tpcc/. [14] TRUSS. http://docs.sun.com/app/docs/doc/816-0210/6m6nb7mnl?l=en&a=view. [15] TS O, T. Don t fear the fsync. http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/. [16] VASIL, T. Reexamining operating system support for database management. Tech. rep., Harvard, 2003. [17] WU, F., XI, H., AND XU, C. On the design of a new linux readahead framework. ACM SIGOPS Operating Systems Review - Research and developments in the Linux kernel 42 (2008). [18] YANG, L., AND LI, J. Operating system support for databases revisited. cs736 course project report, University of Wisconsin- Madison, 2000. 10