hybridfs: Integrating NAND Flash-Based SSD and HDD for Hybrid File System Jinsun Suk and Jaechun No College of Electronics and Information Engineering Sejong University 98 Gunja-dong, Gwangjin-gu, Seoul Korea Abstract: - In this paper, we present a hybrid file system, called hybridfs, whose primary objective is to put together attractive features of both HDD and SSD, to construct a large-scale, virtualized address space in a cost-effective way. HybridFS was designed to take advantage of SSD's high I/O performance, while providing a flexible internal structure to utilize excellent sequential performance of existing file systems. To optimize the limited space capacity of SSD partition, hybridfs supports to define several data sections whose extent size can differ from each other. Such a pre-determined extent size enables to define a data section in such a way to minimize the wasted hole of an extent, by closely matching a multiple of extent size to the file size to be allocated in the data section. Several experiments were conducted to verify the performance of hybridfs. Key-Words: - SSD, hybrid file system, storage integration, data section, extent size, multi-level queue 1 Introduction In this paper, we present a hybrid file system, called hybridfs, whose primary objective is to put together attractive features of both HDD and SSD (solid-state device), to construct a large-scale, virtualized address space in a cost-effective way. The remarkable development of NAND flash technology makes possible the advent of SSD which is considered to replace traditional HDD, due to its potential benefits such as low-power usage, reliability and random I/O performance[9]. Such promising storage characteristics become the driving force of numerous researches related to SSD, with an expectation of achieving high I/O performance in various environments. For instance, multimedia and web applications can obtain performance benefits from employing SSD storage subsystems, to serve a large number of end-user's I/O requests. However, there are several serious constraints in building storage subsystems solely composed of SSDs. One of such constraints is that, because of SSD's internal flash memory components, developing a file system for the SSD storage subsystem evokes some issues that have never occurred in the HDD storage subsystem, such as erasure behavior[10,12,13] and wear-leveling[8,9]. The second constraint is SSD's high cost per unit capacity, compared to HDD. Although the first constraint mentioned could be solved by using FTL[5], high SSD cost still makes it impossible to build large-scale storage subsystems solely using SSD devices. An alternative is to build a hybrid storage subsystem where both HDD and SSD devices are incorporated in an economic manner, while utilizing the strengths of both devices to the maximum extent possible. Building such a hybrid storage subsystem requires to answer several interesting questions. The first question is how to define the work role of each storage medium of a hybrid system. One answer is to use either HDD or SSD as a write cache. Using HDD as a write cache is possible because its sequential I/O performance is comparable to SSD's. However, such a design might lose the opportunity to utilize SSD's excellent random I/O performance. On the contrary, using SSD as a write cache needs to consider its inherent characteristics, such as available erase cycle. The second question is what kind of data to be stored to each storage medium. For example, storing metadata to SSD might cause considerable small writes, resulting in the intensive erase overhead. Furthermore, most file systems attempt to coalesce as many small metadata writes as possible, thereby enabling to achieve good I/O performance on HDD devices. As a consequence, utilizing HDD's comparable sequential performance can make simple to implement a file system for the hybrid This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science And Technology (2010-0017181) ISSN: 1792-4626 178 ISBN: 978-960-474-218-9
storage subsystem. In designing hybridfs, we tried to find suitable answers for those questions. The objective of HybridFS is to take advantage of SSD's high I/O performance while providing a flexible internal structure to integrate excellent sequential performance of existing file systems. This flexibility is achieved by making a direct path between HDD and SSD device partitions, to provide an integrated address space. We chose to use SSD partition as a write-through cache. Such a cache can significantly simplify the design of hybridfs because it requires to have a simple synchronized write path between two partitions and logging mechanism, in case of SSD crash. To optimize the limited space capacity of SSD partition, hybridfs supports to define several data sections whose extent size, that is an allocation unit of SSD partition, can differ from each other. The number of data sections and the extent size of each data section can be determined at file system creation time. One way of making use of such a pre-determined extent size is to define a data section in which a multiple of extent size closely matches to the file size to be mapped to the data section. Furthermore, the extent size of each data section can be appropriately determined by taking into account the flash block size of SSD device, to minimize erasure overhead. This paper is organized as follows. In the next section, we present the implementation detail of the hybridfs. In Section 3, the performance of hybridfs is discussed using two I/O benchmarks. We conclude our paper in Section 4. 2 Implementation Detail 2.1 Disk Layout Fig.1 illustrates the disk layout of hybridfs. The hybridfs takes different block allocation policy for both HDD and SSD partitions. The efficiency of SSD cache depends on how effectively SSD's limited space capacity can be managed. This involves the consideration for the allocation cost of new files and for the hit ratio of most demanding files. In order to alleviate the allocation cost, hybridfs eliminated indirectness in SSD block allocation. Also, to allocate SSD blocks as sequentially as possible, file creation is performed on per-extent. The file allocation in HDD partition is performed by using ext2/ext3 allocation modules. In Fig.1, SSD partition is divided into multiple data sections, with each section defining its own extent size. The first data section, D 0, uses the extent composed of four blocks, whereas each extent of the second and third data sections, D 1 and D 2, is consisted of u and v blocks, respectively, where 4 < u < v. Providing multiple extent sizes gives an opportunity for space optimization, such as allocating large files in the data section with large extent size and allocating small files in the data section with small extent size. Fig.1 Disk layout of hybridfs. The shadowed box indicates the partially occupied data. The beginning of SSD partition contains the information about data sections, such as the number of data sections, section size, and pre-determined extent size of each data section. These configuration parameters are defined at file system creation. The default value for the number of data sections is one, and in such a case, an extent is composed of four blocks. Fig.2 Extent header of data section D 1 where the extent size is u. ISSN: 1792-4626 179 ISBN: 978-960-474-218-9
The configuration parameters are immediately followed by the extent bitmaps and headers. Each data section maintains its own bitmap and header. Each bit of the extent bitmap indicates the allocation status of the associated extent. The bit is set to one only when the whole blocks of an extent are free. The allocation status of the partially occupied extents (partial extents) is indicated by the extent header. Fig.2 shows the memory image of extent header of data section D 1, with extent size of u. The hybridfs reuses only the partial extents whose remaining number of free blocks are more than half of total blocks in an extent, to prevent widespread file allocation across partial extents. The extent header is organized with u/2 number of header entries such that each entry points to the linked list of partial extents with u-2 i number of free blocks, where 0 i log(u/2). Each header entry consists of three attributes. The number of partial extents linked at the header position leads and the starting and current extent addresses of the associated linked list follow. In the first header position, the insertion to the list occurs at the current extent address,(e 1 curofu-1, b 1 ), and the deletion due to extent reuse occurs at the starting extent address, (e 1 stofu-1, b 1 ). 2.2 Algorithm for SSD Block Allocation b = block size f = size of a new file to be allocated w = number of unused blocks in the last extent 1. Choose a data section D i such that the number of extents for the new file is min {f/(b*4), f/(b*u), f/(b*v)}. Assume that D 1 is chosen. 2. Check the extent bitmap of D 1 to find clean extents. After file allocation, if w < u/2, exit algorithm. 3. Calculate i such that u-2 i w < u-2 i-1. Insert the associated extent descriptor to i th header position. In case that there are not sufficient clean extents to hold the new file, use the following steps: 1. If f (u-1)*b, Find the header position i whose linked list contains such partial extents that the remaining number of free blocks of an extent are large enough to hold the new file. Remove the extent descriptor from i th position to reuse the corresponding extent and adjust the current extent address of the position. 2. If f > (u-1)*b, repeat if f > (u-1)*b Scan the header from 0 th position until a partial extent is found. Assume that such an extent is found at i th position. else Scan the header from i th position such that (u-2 i )*b f < (u-2 i-1 )*b until a partial extent is found. f = f - (u-2 i )*b until enough partial extents are allocated for the file. Fig.3 Algorithm for SSD block allocation. Fig.3 describes the steps to allocate a new file in SSD partition. When a new file with size of f is created in SSD partition, hybridfs first chooses a data section D i in such a way that minimum number of extents are needed to allocate the file. After determining the appropriate data section, hybridfs checks the extent bitmap of the data section to allocate clean extents. If the entire blocks of the last extent are not used and the remaining number of free blocks, w, is more than or equal to u/2, then the extent descriptor of the last extent is inserted to the i th header position such that u-2 i w<u-2 i-1. If there are not enough clean extents available, then hybridfs looks for the partial extents by checking the extent header. With a file size f that needs less than or equal to u-1 number of free blocks, hybridfs reuses a partial extent whose unused free space is closest to f. With a file size larger than (u-1)*b, hybridfs checks from the first header position until enough number of partial extents are obtained for the file. The file metadata and data in SSD and HDD partitions are connected by SSD associated attributes in the inode. The SSD_active indicates whether the corresponding data are available in SSD partition. This flag is turned off when the cached data are evict from SSD address space by the extent replacement algorithm. The SSD_wr_done indicates the atomic write between HDD and SSD address spaces. The flag is enabled only when the file is safely stored in both address spaces. The SSD_address is a sequence of extent addresses consisted of extent number, starting block address and block count. The hybridfs provides a circular, multilevel LRU queue per data section, for SSD extent replacement. Fig.4 shows an example of circular LRU queue of data section D 1. In the circular LRU queue, the hotness of a file is determined by file access time, meaning that file inodes are linked to the queue, according to the increasing order of access time. The queue pointed to by current_level contains hot files and the queue pointed to by flush_level contains cold files to be flushed out, in case that there are insufficient free extents available. In the current implementation, each queue links 1K number of accessed ISSN: 1792-4626 180 ISBN: 978-960-474-218-9
inodes. Once i th queue is full, the recently accessed inode is inserted to the beginning of ((i+1) mod n) th queue. The re-referenced inodes, ino 1 1 and ino i 0 in Fig.4, are moved to the queue pointed to by current_level, with their indices being modified to reflect the movement between queues. Fig.5 IOzone write performance Fig.4 Circular, multilevel LRU queue. When available SSD memory in a data section drops below a threshold, the files starting from the queue pointed to by flush_level are flushed out from SSD partition until sufficient free SSD extents are obtained. In such a case, because their copies are available in HDD partition, only thing to be done for SSD flush is to turn off SSD_active and SSD_wr_done in the corresponding inodes. Fig.6 IOzone rewrite performance 3 Performance We evaluated the performance of hybridfs, by using IOzone benchmark and our I/O template. We compared the performance of hybridfs to that of both ext2 [4,7] and xfs [1,2,3] installed on HDD and SSD devices, to analyze strength and weakness of hybridfs. The experimental platform we used has Intel Xeon 3GHz CPU, 16GB of RAM, 750GB of SATA HDD and 80GB of fusion-io SSD. 3.1 I/Ozone Benchmark Fig.5 to 8 demonstrate the performance comparison of various file operations, on top of hybridfs, ext2 and xfs. In IOzone benchmark, in case of read measurements, we saw that the impact of memory cache greatly contributes to generate high I/O bandwidth. We changed file sizes from 64KB to 512MB. Fig.7 IOzone read performance ISSN: 1792-4626 181 ISBN: 978-960-474-218-9
Fig.8 IOzone strided read performance Fig.5 shows the write bandwidth on HDD and SSD devices. As can be seen, the performance of hybridfs is almost three times higher than that of both ext2 and xfs installed on HDD devices. On the contrary, hybridfs shows a similar write throughput with ext2 installed on SSD device, whereas is 12% faster than xfs on SSD device. Even though the performances of ext2 and xfs on SSD devices are much higher than those on HDD devices, it is not reasonable to build a large-scale storage subsystem solely composed of SSD devices, due to limited storage capacity and high cost. The experiment shows that in such a case, hybridfs offers an alternative way of utilizing SSD's high performance and HDD's vast storage capacity with low cost. The performance evaluation of write operations on two devices points out several interesting issues. First, the difference between the worst and best performances of ext2 and xfs on HDD devices is much larger than the corresponding difference on SSD devices. For example, on top of HDD device, when the performance with 512MB of file size is compared to that with 256KB of file size, ext2 shows almost eight times speed improvement. On the other hand, on top of SSD device, the performance difference between 64KB and 512MB of file sizes is 55%, which is much less than that on HDD device. Xfs also shows a similar behavior to ext2, resulting in about 59% of performance speedup between 64KB and 512MB of file sizes on SSD device while producing much higher difference between 64KB and 512MB of file sizes on HDD device. This indicates that the overhead of HDD moving parts more deteriorates write performance than the overhead of SSD semiconductor properties. Due to the absence of moving overhead, both ext2 and xfs on SSD devices generate almost three times higher bandwidth than those on HDD devices. Second, although hybridfs uses HDD portion as metadata store, such an internal structure does not greatly degrade write performance because we can observe that, except for 64KB of file size, the performance difference between small-size and large-size files is very small even though more metadata accesses occur in small-size files than in large-size files. In Fig.6, we compared the rewrite performance of hybridfs to that of both ext2 and xfs. Similar to write evaluation on HDD device, the rewrite throughput of hybridfs outperforms that of both ext2 and xfs, mostly due to the advantage of SSD device. Also, when compared to both file systems on SSD device, hybridfs generates almost same bandwidth with ext2, but is marginally faster than xfs. In the file read operation (Fig.7), unlike in the write experiments on HDD device, hybridfs does not produce noticeable performance difference. In this experiment, hybridfs shows a little faster bandwidth over xfs. For ext2 and xfs, the performance difference between worst and best cases on both devices is much less than that in the write experiment. This is because, in the IOzone benchmark, most read requests are served from memory cache. Similar I/O behaviors can be observed in the strided read operation (Fig.8) where the access pattern of reading 4KB and seeking 200KB is repeatedly performed until the end of file reaches. 3.2 Evaluation with Randomized File Accesses The objective of this experiment is to observe the performance behavior of three file systems, with little impact of memory cache and data sequentiality. Our template was designed to measure the performance of random access patterns. The following code segment shows how a file is randomly chosen in our template: while (number_of_files_to_be_tested) { Choose random numbers for a, b, c, srv, m; trail_uid++; sprintf(path,"/mnt/dev_mount/disk%d/test/%c/%c/%c /%c%c-%d-%c@paran.com/mbox\%d.dat",srv,a,b,c,a,b,trail_uid,c,m); Perform I/O operations using a file named path; } We conducted randomized small-size file experiments varying file sizes from 4KB to 512KB. For each file size, we created 1,000,000 random files. Fig.9 shows the random write performance. Fig.9 shows that, on top of HDD device, hybridfs generates exceptionally high bandwidth, compared to the performance of both ext2 and xfs. For example, with 512KB of file size, hybridfs demonstrates three times faster performance over ext2. In the experiment, the drawback of hybridfs is the performance decrement as file size increases. For ISSN: 1792-4626 182 ISBN: 978-960-474-218-9
instance, there is a 47% performance decrement between 4KB and 512KB of file sizes. We think that the overhead of both SSD extent replacement and semiconductor property might cause such a performance degradation. where less number of metadata operations is generated. Fig.10 demonstrates that the overall rewrite performance of xfs on SSD device is higher than that of both hybridfs and ext2. Such a bandwidth is obvious at the midway between 16KB and 64KB of file sizes. However, with larger than 256KB of file size, three file systems achieve comparable performance. Fig.9 Small random write performance Fig.11 Small random read performance Fig.10 Small random rewrite performance Fig.9 shows that, on top of HDD device, ext2 generates gradually improved performance in the random environment, whereas xfs does not. On top of SSD device, in most cases, hybridfs outperforms the write performance of other two file systems. Fig.10 shows the bandwidth of random rewrite operations. In this evaluation, hybirdfs achieves higher bandwidth than in the random write experiment. With 512KB of file size, hybridfs shows 65% speed improvement, compared to write performance with the same file size. This is because, in the random rewrite experiment, less number of metadata operations which need to access HDD partition of hybridfs occurs than in the random write operations. On top of HDD device, the performance difference between hybridfs and xfs is a little less than that in the random write operation. However, in case of ext2, the performance difference with hybridfs became larger than that in the random write operation. This observation tells us that xfs offers better throughput in the random rewrite operations Fig.11 presents the random read bandwidth of three file systems on HDD and SSD devices. In case of random read operations, when compared to the performance of both ext2 and xfs installed on HDD devices, the strength of hybridfs is obvious, which differs from the read behavior observed in the IOzone benchmark. However, such a strength of hybridfs is not available on SSD device where the performance of three file systems is comparable each other. The performance measurement for random I/O operations with large-size files was conducted varying file sizes from 1MB to 256MB. In this experiment, for each file size, we randomly created 1000 files which is less number of files than in the small-size file experiment, because of SSD storage size. In Fig.12, hybridfs shows a steady throughput until 16MB of file size, but from 64MB of file size, it demonstrates a huge performance speedup over ext2 and xfs installed on both devices. Until 16MB of file size, the performance of hybridfs was double the performance of ext2 installed on HDD device. However, from 64MB of file size, the performance difference between hybridfs and ext2 on HDD device becomes large because most file accesses occur in SSD partition of hybridfs. ISSN: 1792-4626 183 ISBN: 978-960-474-218-9
observed in random rewrite operations on both HDD and SSD devices. Fig.14 presents random read bandwidths. As can be seen, with more than 64MB of file size, the high performance speedup of hybridfs over ext2 and xfs is achieved, which was not observed in the previous read experiments. The existence of such a speedup curve in various random I/O operations indicates that hybridfs can effectively work in an environment where a large amount of random files are generated. Fig.12 Large random write performance Fig.13 Large random rewrite performance Fig.14 Large random read performance When compared to ext2 and xfs installed on SSD devices (Fig.12), at steady state from 1MB to 16MB of file sizes, the performance of hybridfs is worse than those of the other file systems. However, such a performance curve is changed from 64MB of file size at which the performance of hybridfs and ext2 becomes comparable, whereas the performance of xfs is 70% slower than that of hybridfs. As the file size exceeds 64MB, the strength of hybridfs becomes more apparent, achieving 61% of speedup over ext2 with 256MB of file size. In Fig.13, similar performance behaviors can be 4 Conclusion We designed and implemented hybridfs whose objective is to combine potential benefits of both SSD and HDD devices, to construct a large-scale, virtual storage capacity in a cost-effective way. HybridFS uses SSD partition as a write-through data cache to utilize SSD's high write performance, while maintaining HDD's advantages of fast sequential read performance and low-cost storage capacity. The performance evaluation shows that hybridfs frequently demonstrates the comparable performance to ext2 and xfs installed on SSD devices. However, it is not reasonable to build a large-scale storage subsystem solely composed of SSD devices because of high cost per unit capacity. The experiments show that in such a case, hybridfs gives an opportunity to make use of advantages of both HDD and SSD devices. As a future work, we plan to evaluate hybridfs with real applications to verify its performance benefits. References: [1] The SGI XFS project, http://oss.sgi.com/projects/xfs [2] J. Mostek, W. Earl and D. Koren, Porting the SGI XFS File System to Linux, 6th Linux Kongress and the Linux Storage Management Workshop, 1999 [3] A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto and G. Teck, Scalability in the XFS File System, Proc. of the USENUX 1996 Technical Conference, San Diego, USA, 1996 [4] R. Card, T. Ts'o and S. Tweedie, Design and Implementation of the Second Extended Filesystem, Proc. of the First Dutch International Symposium on Lunix, 1995 [5] Understanding the flash translation layer(ftl) specification, Technical Report, Intel Corporation, December 1998 [6] Aleph One Company, YAFFS: Yet Another Flash File System, http://www.yaffs.net/, 2008 [7] T. Ts'o and S. Tweedie, Planned Extensions to the Linux Ext2/Ext3 Filesystem, 2002 USENIX Annual Technical Conference, pp.235-244, Monterey, CA, ISSN: 1792-4626 184 ISBN: 978-960-474-218-9
June 2002 [8] G. Soundararajan, V. Prabhakaran, M. Balakrishnan and T. Wobber, Extending SSD Lifetimes with Disk-Based Write Caches, 8 th USENIX Conference on File and Storage Technologoes, San Jose, USA, February 2010 [9] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. Manasse and R. Panigrahy, Design Tradeoffs for SSD Performance, 2008 USENIX Annual Technical Conference, 2008 [10] A. Rajimwale, V. Prabhakaran and J. D. Davis, Block Management in Solid-State Devices, 2009 USENIX Annual Technical Conference, June 2009 [11] IOzone, http://www.iozone.org [12] J.-W. Hsieh, L.-P. Chang and T.-W. Kuo, Effcient Identification of Hot Data for Flash-Memory Storage Systems, ACM Transactions on Storage, Vol. 2, No. 1, 2006 [13] L. P. Chang and T. W. Kuo, An Efficient Management Scheme for Large-Scale Flash-Memory Storage Systems, SCM Symposium on Applied Computing, 2004 ISSN: 1792-4626 185 ISBN: 978-960-474-218-9