NANDFS: A Flexible Flash File System for RAM-Constrained Systems

Transcription

1 NANDFS: A Flexible Flash File System for RAM-Constrained Systems Aviad Zuck The Blavatnik School of Computer Science Tel-Aviv University Tel-Aviv, Israel [email protected] Ohad Barzilay The Blavatnik School of Computer Science Tel-Aviv University Tel-Aviv, Israel [email protected] Sivan Toledo The Blavatnik School of Computer Science Tel-Aviv University Tel-Aviv, Israel [email protected] ABSTRACT NANDFS is a flash file system that exposes a memory-performance tradeoff to system integrators. The file system can be configured to use a large amount of RAM, in which case it delivers excellent performance. In particular, when NANDFS is configured with the same amount of RAM that YAFFS2 uses, the performance of the two file systems is comparable (YAFFS2 is a file system that is widely used in embedded Linux and other embedded environments). But YAFFS2 and other state-of-the-art flash file systems allocate RAM dynamically and do not provide the system builder with a way to limit the amount of memory that they allocate. NAND- FS, on the other hand, allows the system builder to configure it to use a specific amount of RAM. The performance of NANDFS degrades when the amount of RAM it uses shrinks, but the degradation is graceful, not catastrophic. NANDFS is able to provide this flexibility thanks to a novel data structure that combines a coarsegrained logical-to-physical mapping with a log-structured file system. Categories and Subject Descriptors D.4.2 [OPERATING SYSTEMS]: Storage Management Allocation/deallocation strategies, Garbage collection General Terms Design, Algorithms, Performance, Experimentation Keywords Flash, NAND flash, File System, RAM constrained, Page Mapping 1. INTRODUCTION As the use of NAND flash increases and becomes more varied, the need for flexible and efficient NAND-flash storage management systems grows. This paper describes NANDFS 1, a novel flash file system with a POSIX-like interface. The main innovation in NANDFS is a 1 NANDFS is available under the GNU Public License at Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EMSOFT 09, October 12 16, 2009, Grenoble, France. Copyright 2009 ACM /09/10...$5.00. flash-specific log-structured storage subsystem that uses indirection to reclaim memory more efficiently than previous log-structured designs, but requires much less RAM than earlier systems that use similar indirection. Given enough RAM and a flash that is not too full, this log-structured design allows NANDFS to achieve high performance on virtually any access pattern, including random writes, which are slow on many simpler (but widely-used) flash storage systems. As the amount of RAM available to NANDFS shrinks, its performance degrades, but gracefully. Achieving this memory-performance tradeoff has been our main goal in the design of NANDFS. We aimed to achieve near-peak performance at the large-ram end and good performance even at the small-ram extreme, under any access pattern. We did not have any performance goals for the case when the flash is nearly full, because as we explain later, random writes are always slow on a nearly-full flash. This flexibility is critical for embedded file systems, since embedded systems vary widely in the amount of RAM that they have. Squeezing a file system into a small RAM is not hard, if one is willing to tolerate poor performance. For example, a FAT file system mounted on top of an SD card or on top of a flash-translation layer can operate with very little RAM, but under some access patterns (e.g. many random writes), performance is likely to be terrible. NANDFS provides RAM-usage flexibility, while delivering near-peak performance when a large RAM is available and good performance even with small RAMs. Because NANDFS does not require a large RAM, it can scale to larger and larger flash chips without requiring a larger and larger RAM. In contrast, the amount of RAM that existing flash file systems like YAFFS2 and JFFS2 require is proportional to the amount of data stored on flash. Given that flash storage density continues to grow rapidly, this is an important advantage of NANDFS. Log structured file systems are well suited for flash, because they use copy on write rather than in-place re-writing. In traditional log-structured design, random writes are not terribly slow, but they do increase the cost of garbage collection, because when a randomly-written still-valid block is copied during garbage collection, the indirect block pointing to it must also be re-rewritten, and the block pointing to the indirect block, and so on. (Blocks that are written sequentially are cheaper to garbage collect because a few blocks of pointers point to many blocks of data in the garbagecollected chunk). NANDFS uses a level of indirection to eliminate that cost; pointers stored in the file system are logical, not physical, so pointers stored on flash remain valid when blocks are copied during garbage collection. The use of logical addresses in storage systems is not new, of 285

2 course, but in earlier systems that used this idea the granularity of the mapping was individual blocks. This requires a large logicalto-physical mapping table in RAM for fast lookups; without a large mapping table, random accesses are slow, requiring two or more medium accesses per random read, even when the logical address of the block to be read is known. Widely-used flash file systems like YAFFS2 and JFFS2 use large mapping tables in RAM to avoid this cost. This leaves less RAM for buffering and caching, thereby hurting performance. NANDFS uses an innovative logical-to-physical mapping that is coarse grained. This mechanism allows NANDFS to achieve high performance on virtually any access pattern, including random writes, and to maintain acceptable performance even on systems with small amounts of RAM (tens of kilobytes, or even just a few kilobytes). The coarse-grained mapping forces NANDFS to keep blocks at the same relative offset within large fixed-sized chunks when they are garbage collected. This, in turn, implies that garbage collection and block allocation are interleaved, not separated. This turns out to work well in flash, in which spatial locality brings essentially no performance benefit. It would not work well on disks, where spatial locality is essential for performance. In this sense, NANDFS exploits the unique characteristics of NAND flash. NANDFS is transactional. It supports multiples concurrent transactions. The transactional guarantees make it easier to develop robust applications than with the weaker POSIX or Win32 semantics. Transactional file systems are not new, but they are particularly easy and natural to support on flash; the paper shows one concrete way to do so. The rest of this paper is organized as follows. Section 2 provides an overview of flash memories, and Section 3 describes other background and related work. The design of NANDFS is described in Section 4. The implementation and testing strategy of NANDFS are described in Section 5. Section 6 presents and explains the results of extensive experiments that we carried out. We present our conclusions in Section FLASH MEMORIES Virtually all the high-capacity flash chips today use a technology called NAND flash. NAND flash chips are fairly standardized 2 in terms of their physical packaging, their electrical interconnects, and their programming interfaces. The electrical interconnect consists of an 8-bit address/data/command bus, along with a few control signals. This interconnect allows NAND flash chips to be attached to the external memory bus of a microprocessor using a small amount of glue logic, but it can also be easily driven by a general-purpose I/O interface of a microcontroller. Connecting a NAND flash chip to a memory bus leads to higher bandwidth and less overhead than driving the chip using general-purpose I/O pins. The memory of a NAND flash chip is partitioned into fixed-size erase units, which are further partitioned into pages (erase units are also referred to as erase blocks, but in this paper we reserve the term block for another entity). Typical combinations are 128 KB/2112 B and 16 KB/528 B. These combinations are called large-page and small-page flash. Each page has a data area ( or 512 B) and a spare or metadata area of 64 B in large-page flash and 16 B in small-page flash. The spare area is used for storing an errorcorrection code and metadata associated with the page. The programming interface of NAND flash chips provides three 2 An industry standard called the Open NAND Flash Interface aims to make the chips completely standardized; seewww.onfi.org. 3 There are now also chips with 4 KB pages, but in this paper we use the term large-page flash to refer to 2 KB pages. K9F2G08UXA K9W8G08U1M total capacity 256 MB 1 GB size of erase units 128 KB 128 KB number of erase units page-read latency 25 µs 20 µs page-program latency 200 µs 200 µs erase latency 1.5 ms 1.5 ms per-byte transfer time 25 ns 25 ns Table 1: Characteristics of two Samsung NAND flash chips. main operations: read, erase, and program (also called write). The read operation transfers a page from the flash array to an onchip volatile buffer (this takes tens of µs), from which data can be streamed to the processor at high speed (tens of nanoseconds per byte). The erase operation sets all the bits in a given erase unit to one. There is no other way to set a flash bit. Erasures are slow. The page-program operation clears some of the bits in a page. It starts with a high-speed data transfer from the processor to the flash s on-chip volatile buffer. Once the data is in the buffer, it is copied into the flash array, which takes hundreds of µs. The data-sheet specifications of two typical chips are shown in Table 1. Once a byte in a NAND flash is programmed, it cannot be programmed again until the erase unit containing it is erased. Some NAND flash chips allow partial-page programming, where only some bytes in a page are programmed, leaving the rest unmodified; the unmodified bytes can be programmed later. All NAND flash chips place severe restrictions on partial-page programming, typically only allowing 1 4 writes to the data area or a page and 2-4 writes to the spare area. Some chips also restrict the byte ranges in partial writes, for example to aligned multiples of 512 bytes in the data area. The endurance of all flash chips is limited. Each erasure causes some damage to the flash memory array. After a certain number of erasures, a unit may cease to function properly. Typical endurance limits for NAND flash are 10,000 to 100,000 erasures. Also, some erase units may be marked in the metadata area as defective, by the manufacturer. A defective erase unit (also called a bad unit) must not be used at all. NAND flash software systems include disk-emulation layers (also called flash translation layers or FTLs) and file systems. Disk emulation layers, which make the flash look like a read/write block device, are built into many flash devices such as USB memory sticks, memory cards (e.g., SD cards), and IDE/SATA disk replacement units. They hide the coarse erasure granularity, the bad erase units, and the endurance issues from the rest of the system, and in particular, from the file system. Flash-specific file systems use the flash chip through a thin driver layer that just provides communication with the flash chip(s). All flash-based storage systems must perform garbage collection. When the flash is nearly full and is subjected to extensive random writes, garbage collection becomes inefficient. This occurs because the random writes force the storage system to garbage collect erase units with very few invalid pages. This phenomenon occurs with any flash storage system. The only way to guarantee high performance for all access patterns is to utilize only a portion of the physical flash size; this ensures that when garbage collection is necessary, at least one erase unit contains a significant number of invalid pages. 286

3 3. RELATED WORK The growing importance of flash has led to an extensive research literature in the last few years. We review the most relevant work in this section. Most of the flash-based storage devices today uses translation layers that use proprietary data structures and algorithms. Some of these devices perform well enough to outperform disks under commercial database workloads [14]. However, many of these devices, even high-end ones, perform poorly on many access patterns, including random writes and repeated writes to the same logical block [4]. Detailed measurements by Birrel et al [4] and by Ajwani et al [1] suggest that this is caused by using a coarse-grain logicalto-physical mapping table, probably due to the small amount of RAM in the flash controller that performs the mapping. This strategy does achieve excellent read performance for any access pattern and good sequential-write performance. Birrel et al. have shown that with enough RAM for a page-granularity mapping table, flash devices can achieve high performance on any access pattern. Researchers proposed several other ways to cope with this problem. Some researchers have proposed translation layers with a small on-flash staging area [12, 15] to accelerate random writes. Birrel et al observed behaviours that are consistent with such designs [4]; the system can perform short bursts of random writes quickly, but performance degrades considerably under sustained random writes. A RAM buffer large enough to store several erase units or more can also improve the performance of short bursts of random writes [11]. Buffering/caching policies can be specialized to flash by evicting clean pages when possible, to reduce the number of writes as much as possible [24]. When the flash chips cannot be removed from the system, flashspecific file systems are often used rather than generic file systems mounted on a translation layer. It was recognized early [10] that log-structured file systems [27] are appropriate for flash, because they do not overwrite in place (which is physically impossible in flash). Most of the flash-specific file systems today, like YAFFS/YAFFS2 [23] and JFFS/JFFS2 [29], are essentially log structured. YAFFS and JFFS are Linux file systems, but YAFFS can also be used without an operating system kernel. ELF, MicroHash, Capsule and FlashDB are flash-storage subsystems for wireless sensor nodes [19, 22, 17, 6]. Both YAFFS and JFFS maintain data structures in RAM that map every block of every file to a physical flash address; this requires a lot of RAM and may lead to slow startup times (to read the mapping from flash). Lim and Park proposed a technique to accelerate the startup time of YAFFS-like file systems [16]. The log-structured idea is also appropriate for diskemulation layers [2]. TFFS [8] is a log-structured file system for NOR flash; it exploits a property of NOR flash that NAND flash lacks (namely, that any bit still in the erased state can be cleared), so it is not suitable to NAND flash and cannot be ported to NAND. Much work has been done in recent years on flash-specific indexing data structures. Wu et al have proposed BFTL, a flash-specific B-tree; it is an intermediate layer between the application and the underlying FTL [30]. BFTL reduces the number of page writes by minimizing frequent node updates. Instead, tree-node index changes are logged to flash. This trades off reads for writes, much like flash-specific cache eviction policies. Lee and Moon proposed another cache-specific logging scheme for indexes[13] Kang et al tackled the indexing problem by introducing µ-tree, a new type of balanced tree [9], in which the size of nodes changes dynamically during operation. This is achieved by allowing all nodes residing in the path of a leaf to be compacted to a single page. This reduces the cost of record insertions/deletions almost to a single page write, without increasing the cost of searching. However, the system is currently not transactional. The idea of a transactional file system is not new [18, 28], but it is particularly attractive for flash, because of the prevalence of copy-on-write in flash storage systems. TFFS [8] is a transactional file system for NOR flash; more recently, Prabhakaran et al [25] showed how to exploit the properties of flash to make an existing file system transactional. One of the key ideas in NANDFS is the use of logical pointers in inodes and indirect blocks. This idea is also not new, but it is used in NANDFS in a novel way. de Jonge et al [7] used a similar idea for disk block mapping, but their system kept a block-level mapping in RAM (Bonwick et al [5] also use a similar mechanism); our system maps blocks at a much coarser granularity that is not appropriate for disks but is appropriate for NAND flash. 4. THE DESIGN OF NANDFS 4.1 High-Level Design NANDFS uses a two-layer structure, but the two layers have been co-designed. The main benefit of designing both layers as parts of one whole is that a single mechanism ensures atomicity in both. When a conventional file system is mounted on top of a flash disk-emulation layer, both layers must ensure that their actions are atomic, and both layers must implement crash-recovery mechanisms. By co-designing the two layers, we both simplify this aspect of the overall system and make it more efficient. The top layer is a file system with a Unix-like file structure [26] that represents a file with an inode, possibly indirect blocks, and data blocks. All the operations of the file system are transactional, and it supports multiple concurrent transactions; in that sense, it is not a traditional Unix-like file system. The lower layer in NANDFS is a store for page-size blocks of data. For reasons that will become clear later, we call it the sequencing layer. The blocks that the sequencing layer manages are immutable: once written to, a block cannot be modified, only read or deleted. This abstraction is easy to map to NAND flash, whereas a store of mutable blocks is hard to map to flash. The file system layer in NANDFS uses a unique combination of ideas from both log-structured and overwrite-in-place file systems. The file system is log-structured in the sense that it never overwrites a block. When an already-written part of a file is modified, the file system asks the sequencing layer to allocate a new block into which the modified contents are written. Pointer structures are updated to reflect the new location of the data. The log-structured design maps nicely onto the lower sequencing layer that does not support block mutations. But the file system layer differs from earlier log structured file systems in that it does not perform garbage collection. Instead, it merely marks blocks that are no longer part of the file-system data structure as obsolete. Obsolete blocks are garbage-collected by the sequencing layer. The key idea in NANDFS is the use of indirection to allow the sequencing layer to collect garbage in a way that preserves the validity of pointers to non-obsolete blocks. Thanks to this technique, pointers in inodes and indirect blocks remain valid even when the physical address location of the pointed-to block changes. This significantly reduces the cost of garbage collection relative to existing log-structured file systems. This indirection (along with other details that we explain later) is illustrated in Figure 1. The sequencing layer partitions the space of blocks that it manages into contiguous logical groups that we call segments. All segments have the same size, which is a multiple of the erase-unit size. Each segment is mapped to one slot, which is a contiguous group of erase units; but one segment, called the active 287

4 slot 0 page seq 935 segment 3 erase unit slot slot 1 seq 936 segment 0 obsolete obsolete obsolete slot 2 seq 940 segment 0 checkpoint copied new copied still erased slot 3 slot 4 seq 939 reserved seq 937 segment 1 slot 5 seq 938 segment 2 page offset 0 page offset 1 page offset 2 page offset 3 page offset 4 page offset 5 Figure 1: Slots and Segments. The first good page in each segment contains a slot/segment header. Erase units in the reserved slots (slot 3 in the figure) replace bad erase units elsewhere. A bad erase unit, the first in slot 3, is marked by blue pages; the unit is also marked as bad on flash. The address of a logical block consists of a segment number and a page offset within the segment. A mapping table in RAM allows the sequencing layer to quickly find the physical address of a block (blocks in bad units may take longer to locate). Slots 1 and 2 illustrate allocation mode. The next action of the sequencing layer in this state will be to allocate the page at offset 5 in slot 2; then it will erase slot 1. segment, is mapped simultaneously to two slots. The sequencing layer keeps a table that maps segments to slots, both in RAM and on flash. When the sequencing layer allocates a new block, it returns a logical pointer to it. The logical pointer consists of the index of the segment in which the new block resides and the offset within that segment. The logical address is translated to a physical one by replacing the segment number by a slot number (Translating an address in the segment that is mapped to two slots is only slightly more complex and will be explained later.) The sequencing layer collects garbage by copying only the nonobsolete pages in the active segment to another slot, keeping their offsets, which ensures that logical pointers that are stored elsewhere on flash remain valid. The garbage collection is incremental and is interleaved with block allocation. 4.2 The Sequencing-Layer/File-System Interface The basic operations that the sequencing layer supports are simple. Viewed by the file-system layer as a store of immutable blocks, it writes on demand a block of data to flash and returns its logical address. This block can later be read by using this address. Marking a block as obsolete is more complex, due to atomicity concerns. From the file system s point of view, the transition of a block from valid state to obsolete state always occurs as part of a transaction. Consider a simple transaction that changes one block of a file. If the transaction commits, the block that contained the old data should be marked as obsolete. But if the transaction aborts, say because of a crash, the block that contained the old data should remain valid. The valid-to-obsolete transition (VOT) is always part of a transaction. Indirect blocks and inodes undergo the same transitions. We denote by t c the time at which the system crashed and by t r the time of the last commit operation before the crash. After the system crashes and reboots, the system should recover to its persistent state at time t r. If there were pending transactions at time t r (except for the one that committed), they need to be undone. In particular, if these transactions wrote blocks before t r, all of these blocks should be marked as obsolete. Also, any blocks that were written to flash between t r and t c are not valid in the state at time t r, so they too may be erased. Some of these blocks may be part of transactions that started before t r and the rest are parts of later transactions. Therefore, blocks become invalid and should be marked as obsolete in one of four situations: 1. The transaction that caused the VOT of the block commits; the block must be marked as obsolete whether or not a crash follows. 2. The transaction that wrote the block aborts (not due to a crash). 3. The system crashed and the transaction that wrote the block started before t r but did not commit. 4. The block was written after t r but before a crash. 288

5 Blocks cannot be physically marked as obsolete when a transaction causes a VOT. Otherwise the mark could not be erased if the transaction aborts, due to the limitations of NAND flash. Therefore, transactions must mark VOTs somewhere else. On the other hand, every transaction must mark its VOTs on flash before it commits. Otherwise, if a crash occurs just after a transaction commits but before it records its VOTs on flash, the old blocks are not marked as obsolete, so they will never be erased. Therefore, a transaction writes two kinds of data to flash: file system data (user data, directories, indirect blocks, inodes, etc), and records of VOTs. For simplicity, we chose to store VOT records in dedicated pages. At commit time, the VOT records are used to mark blocks as obsolete. The marking process can repeat if the system crashes right after the transaction is committed. If a transaction is aborted rather than committed, all the blocks that it wrote, both file-system data and VOT records, are marked as obsolete. To support atomicity, the file-system and the sequencing layer use checkpoints, on-flash objects that satisfy the following requirements: 1. The last-written checkpoint p is written to flash at time t r or later. 2. At boot time, the sequencing layer must be able to find all the blocks that were written to flash after p so it can physically erase them. 3. The checkpoint data must allow the overall system to find all the blocks that were written to flash as part of transactions that did not yet commit when p was written to flash. 4. If p was written at time t r, then the system must also be able to find all the VOTs that are part of the transaction that committed at t r. 5. The checkpoint data must allow the file system to find the root inode. NANDFS satisfies these constraints using the data structures shown in Figure 2. The sequencing layer maintains on flash a linked list of all the blocks that were written within each transaction. This requires the file system to tell the sequencing layer which write operation belongs to which transaction; it does this implicitly, by providing a pointer to the last-written block of the transaction. The file systems also tells the sequencing layer whether a newly written block contains data (including inodes and indirect blocks) or only VOT records. Figure 2 shows these data structures. A checkpoint contains information that is prepared and parsed by both the file system and the sequencing layer. It contains a pointer to the last-written block of the transaction that is being committed (if any), to the last-written blocks of all the on-going transactions, and to the root inode. All of these pointers are logical, and all of them are managed by the file-system layer. The checkpoint also contains counters that indicate the number of obsolete pages in each segment; this information is managed by the sequencing layer. All checkpoints must have the same on-flash size. 4.3 The Sequencing Layer The sequencing layer partitions the physical address space into N fixed-size slots whose size is a multiple of the erase-unit size. The logical address space is partitioned into n = N 1 k segments of the same size, where k 1 is the number of slots reserved for replacing bad erase units. Initially, no segments are mapped to slots and all the slots are in the erased state. As the file system allocates blocks, the sequencing layer creates more segments and maps them to slots. Until n slots have been used, the sequencing layer does not bother to erase anything. Eventually, all the slots are used and the sequencing layer needs to erase occasionally. The first page of every slot contains a slot-segment header structure, as demonstrated in Figure 1. The header indicates whether the slot is erased, or contains a segment, or is a reserve slot that is used for replacing bad erase units. The encoding of this field is such that a physically erased slot is recognized as such. The header also contains a 64-bit sequence number that is incremented any time a slot header is written. If the slot contains a segment, the header stores the segment number. The first pages in a segment always contain a checkpoint; this ensures at least one valid checkpoint exists in every segment. Erasures and Block Allocation. At any given point in time (except for the initial ramp-up which we already explained), the sequencing layer is in one of three possible modes: allocation, erasure, and wear leveling. In erasure mode, it erases all the units in a given slot, from the last to the first unit (to erase the header last). When erasure mode ends, the sequencing layer decides whether to go into wear-leveling mode, which addresses the flash s limited endurance problem. To decide, it tosses a biased coin (with a low probability for entering wear-leveling mode). If it loses in the coin toss, it enters wear leveling mode. It selects a random slot, copies it to the just-erased slot (including the header), and erases it. Copying starts from the header. This wear leveling technique incurs a low overhead and guarantees near-optimal endurance with high probability [3]. Now let s consider allocation mode. Suppose that erasure or wear-leveling mode just ended. In this state, we have one erased slot and n segments that are mapped to n slots. We now select the segment s e to erase next, but we do not erase it just yet. The selection tries to find a segment with as many obsolete blocks as possible. We use counters to maintain the number of obsolete blocks in a segment. To save space NANDFS can use randomized counters [20] if necessary. Let t 0 be the slot that the selected segment s e is currently mapped to and let t n be the erased slot. Once s e is selected, we write a header on t n to indicate that it too contains s e (but with a higher sequence number than t o ). Next, the sequencing layer forces the file system to write a checkpoint. The checkpoint blocks are written to the beginning of t n. The checkpoint that occupies the first pages in t o is ignored; it is no longer needed. We now copy to t n the first contiguous sequence of non-obsolete blocks from t o (following the initial checkpoint on t o ). When we reach an obsolete block or a checkpoint block in t o, we stop copying. Each block that is copied is marked as a copied block in the spare area of the page into which it is copied. The next block in t n will be allocated next by the sequencing layer, when the file system calls it to write a block. It replaces an obsolete or checkpoint block that is no longer needed. It will be marked as fresh (not copied) in the spare area. After allocating a block, the sequencing layer examines the next block in t o. If it must be copied, it is and we examine the block that follows in t o. If not, we are ready to allocate another block. That is, copying of valid blocks and allocation of new blocks are interleaved; garbage collection is incremental. This works well on flash because there is no advantage in large contiguous writes (which are advantageous on disks). Apart from the copied-or-fresh flag, the spare area of every page contains other attributes such as a flag indicating whether the block contains only VOTs, a logical pointer to the previous block in the transaction s linked list, an error-correction code, and a flag marking the page as in-use or obsolete. The obsolete flag, which 289

6 VOTs-only block pointer in RAM to first block of the transaction new obsolete transaction to be committed checkpoint on-going transaction new obsolete new obsolete new obsolete new Figure 2: Left: a pointer in RAM points to a linked list of the new blocks of an on-going transaction. The VOTs-only block points to the on-flash blocks that the transaction invalidated. The linked-list pointers and the VOTs-only flag are stored in the spare area of the blocks. Right: the on-flash representation of a checkpoint includes pointers to the transaction that was committed and to all the on-going transactions. is used to indicate the validity of the page data as described previously, is initially left in the erased state and is later modified in a separate partial write to indicate that the page is obsolete; the other fields are written with the data. As a result each flash page undergoes a total of two partial writes at most. One is the initial data write, and the second to mark it as obsolete (the second write only modifies the metadata). Almost all flash chips today allow for at least two partial writes per page. Finding the Last Checkpoint. When the system boots, the sequencing layer reads all the slot headers to find the last one that was written to flash. It locates the last written page in this slot (the pages that follow are in the erased state) and then searches backwards to find the last complete checkpoint. Once we find the last checkpoint, we erase all the potentially non-erased units in the slot that follow the last checkpoint. The unit that contains the checkpoint is copied to a temporary erase unit and copied back after the erasure, discarding any blocks that originally followed the checkpoint. If we reboot during a recovery the system is safe and able to restore itself to a stable state. Data Structures in RAM. The sequencing layer uses three tables and a few scalar variable in RAM. A table that maps the n segments to slots. It is initialized at boot time from the slot headers. An array of n obsolete counters, one per segment. The array of counters is stored in every checkpoint and read at boot time. An array of k slot numbers, pointing to the reserved slots. Management of Bad Erase Units. The sequencing layer does not read from or write to bad erase units, except for reading their metadata to determine that they are defective. Because we copy the blocks of a segment to the same offsets in the new slot, we cannot simply skip bad units. We therefore reserve k segments for blocks that should reside in bad units. When we find a bad unit in a slot, we replace it with a good unit in a reserved segment. A unit in a reserved segment is marked, in the spare area of its pages, with the slot and offset that it replaces. Replacement units are written when the bad unit they replace should be written, and they are erased when the bad unit should be erased. We find the replacement for a given bad erase unit by searching the reserved slots linearly. We keep in RAM one bad-to-replacement mapping, to eliminate this search in many cases. We also keep in RAM the addresses of the last few non-bad erase units that we accessed, to avoid checking for a bad unit in every read. We note that the wear leveling of reserve slots is poorer than that of slots that store segments. However, they are included in the overall wear leveling scheme, so a single slot is not used as a reserve slot during the entire life span of the file system. Deleting Files when the Flash is Full. When a file is deleted in a log-structured file system, the file system adds information to the storage medium before reclaiming the blocks that were used by the file. These writes modify the directory that contained the link to the deleted file and they modify the data structure representing the set of existing inodes. NANDFS also writes VOT records for the deleted file and the obsolete directory blocks. If we allow the flash to become full or almost full, there may be too few free pages to even delete a file. NANDFS avoids this danger by keeping track of the number of free pages and obsolete pages on flash. If the next write of a transaction lowers the number of pages in these categories below a threshold, we abort one of the on-going transactions. The threshold is set such that a transaction that deletes a file can always complete. We omit the calculation of this threshold from this paper (it will be available in the full technical report). 4.4 The File-System Layer The file system uses a Unix-like file structure and a logstructured mutation mechanism. Files are represented by inodes, which contain the metadata of the file, pointers to data blocks and single/double/triple indirect blocks, and also optionally the first few bytes of the data, to pad the inode to page length. (This is not a new idea; it is proposed by Mullender and Tanenbaum [21] and it is used in NTFS for small files). The inodes are stored as an array in the root file, whose address is included in every checkpoint. The root file is a sparse file; inodes that are not in use are considered holes in the file. A mechanism described below allows us to find holes in 290

7 sparse files, and hence to reuse inode indices. If there are no holes in the root file when a new file is created, the root file is extended. Pointers in inodes and indirect blocks are logical and are represented by signed numbers. The root file and directories are allowed to be sparse (can contain holes). A flag in address p indicates that the part of the file pointed to resides at logical address p and that it contains at least one large-enough hole (one inode in the root file and 256 bytes in directories). This mechanism allows the file system to easily find a hole in a sparse file by following a path of addresses with this flag starting from the file s inode. A non-existing block is represented by the minimal negative number. Directory entries are stored as file name and inode number pairs, and are packed into page size chunks. File names are restricted so that a directory entry always fits into 256 bytes. Therefore, a chunk with a hole can always contain another directory entry. Directory entries are not sorted. NANDFS maintains an LRU block cache for all block access requests (except those containing VOT records and checkpoints) from/to the sequencing layer. This dramatically reduces the number of actual read/write requests from the sequencing layer. The size of the cache is configurable. The file system converts sequences of system calls into implicit transactions. System calls involving file and directory management (creat, unlink, mkdir, and so on) are treated as complete transactions. System calls that modify an open file are bundled into a single transaction that is committed when the file is closed or fsync-ed. 5. IMPLEMENTATION AND TESTING NANDFS is implemented in ANSI C. The implementation currently assumes that NANDFS manages only one flash chip (the implementation issues requests to the low-level flash driver one at a time). This does not limit NANDFS s ability to handle concurrent transactions. We have tested NANDFS in several different environments: with a physical flash chip and a with flash simulator, with and without an operating system. Most of the testing and most of the performance evaluation were carried out on a PC using a flash simulator. The interface of the simulator is identical to the interface of the flash device driver that NANDFS relies on, but it stores the contents of the flash in a file rather than on a physical flash chip. The physical-flash tests were performed on a Samsung K9F5608U0D 32 MB NAND flash chip connected to an NXP LPC2119 microcontroller. The flash chip stores 512 bytes in the data area of each page and 16 bytes of metadata, and has 32 pages in each erase unit. The microcontroller has only 16 KB of RAM. The LPC2119 has an ARM7 core and no external memory bus. The flash chip was connected to the microcontroller s general-purpose I/O pins, which results in low data-transfer rates. This setup did not allow us to perform meaningful performance evaluations, since in performance-sensitive systems the NAND flash will be connected to a memory bus (allowing data to be transferred at around ns/byte). But this setup did allow us to test NANDFS for correctness on real hardware. We did measure on this setup the latency of reads, writes, and erasures. The results were consistent with the data sheet of the chip. We tested NANDFS both under an embedded operating system, ecos and without any operating system. The tests under ecos verified that NANDFS can easily be integrated into an operating system. The integration involved writing a thin glue layer that allows NANDFS to be mounted in ecos. This allows ecos programs access to NANDFS files through the standard ecos system calls. We were able to run ecos programs with multiple threads that access NANDFS on our 16 KB-RAM hardware prototype. Tests without an operating system were carried out by single-threaded programs in which the test comprised the program s main loop. The same test programs run under our flash simulator and on the prototype hardware. Our test suite is thorough. NANDFS was developed using a test driven development process and it includes a large set of unit tests, as well as many integration tests and performance tests. The test suite also includes extensive crash tests, in which the test harness simulates a crash at every relevant point in a program. The simulated crash is followed by a simulated reboot, after which the test checks that the system recovered correctly. 6. EXPERIMENTAL EVALUATION This section describes the results of experiments that explore the performance of NANDFS. We also compare the performance of NANDFS to the performance of YAFFS2 [23], a widely-used opensource NAND-flash file system. YAFFS2 is widely-used in Linux, but it is also available in a stand-alone version called YAFFS Direct, which we used in our experiments. In this mode, the program calls YAFFS2 directly and no operating-system support is required. YAFFS2 requires a fairly large amount of RAM; data structures in RAM map every block of every file to physical flash addresses. The number of pages in use determines the amount of memory YAFFS2 requires. YAFFS Direct includes a buffer cache that is configured by default to contain 10 pages. We measured the performance of NANDFS and YAFFS2 using simulations of the two flash memory chips whose characteristics are described in Table 1. Our hardware prototype used GPIO operations to communicate with the flash chip. The slow GPIO operations dominate the running time, making it hard to draw conclusions from actual timing measurements. The timing data that we provide is based on counting the number of reads, writes, and erasures in each simulation. We multiplied these numbers by the latencies of the operations from the data sheet (as mentioned earlier, we also verified these latencies experimentally). The timing estimates that we report do not account for compute time, but it is insignificant compared to the flash access times. We used the same methodology for NANDFS and YAFFS Overall Performance We measured the performance of NANDFS and compared it to the performance of YAFFS2 by simulating a typical workload on the 1GB chip whose characteristics are given in Table 1. We configured NANDFS on this chip to use 512 slots, 4 of which are reserved for bad erase units. We used exact counters to count the number of obsolete pages in every slot. The size of NANDFS s buffer cache varied from 10 to 500 pages, as described below. We configured YAFFS2 to use a 10-page buffer cache, the default in its test programs (it still used more RAM than NANDFS, as detailed below). The workload consists of (1) creation and sequential writing of many 4 MB files, (2) sequential reading of these files, (3) deletions of these files, (4) creation and sequential initialization of one 100 MB file, and (5) random reading and writing of the 100 MB file. The 4 MB files simulate media files like images, music, or videos, as well as activity logging. All accesses to these files are sequential and all use 16 KB data transfers at the application layer. The 100 MB file represents a database file containing some kind of metadata that is not stored within the media files. It is written sequentially initially but then read and written randomly using application-layer transfers of 200 B. We simulated this workload under when the 4 MB files occupied 291

8 Running Time (seconds) Workload simulations, 1 GB flash, 66% full NANDFS YAFFS RAM Usage (KB) Figure 3: The performance of YAFFS2 and of NANDFS under a mixed-access-pattern workload. either 23%, 56%, or 80% of the raw capacity of the chip. The exact size of the 100 MB file was chosen to fill 10% of the capacity, so the file system was either one third full, two thirds full, or 90% full (relative to the raw capacity). In each simulation, we first wrote all the files sequentially. We then performed 100 iterations of mixed-access-pattern I/Os. In each iteration, we deleted 15 random 4 MB files, re-wrote them and then read each one of them three times (round robin, not repeated reads of the same file). After every KB sequential read accesses (or writes), we performed two 200 B random read (or write) to the 100 MB file. Therefore, random reads/writes comprise about 16% of the total number of system calls, and sequential accesses comprise about 84%. In this workload, a large buffer cache is not beneficial unless it stores at least 60 MB. The results of the simulations when the flash was 66% full are shown in Figure 3. We indeed see that the size of the buffer cache does not have a significant impact on the performance of NANDFS. With a 10-page cache is performs a little worse than YAFFS, but not by much. With a 50-page cache it outperforms YAFFS. At that point, growing the cache even to 500 pages does not improve performance much. The most important aspect of these results is that NANDFS performs well (i.e., outperforms YAFFS) with about 10% of the memory that YAFFS2 need (131 KB versus 1364 KB). With less than 5% of the memory that YAFFS needs (50 KB) it is only 5% slower than YAFFS. Similar experiments when the flash was only 33% full resulted in slightly higher performance, but not by a large factor. When the file system was 90% full, NANDFS slowed down to 3129 s when it used 1 MB of RAM and to 3348 s when it used only 50 KB of RAM. YAFFS2 slowed down to 3001 s. The running time is not particularly sensitive to how full the flash is because most of the garbage (obsolete pages) in our experiments was generated by the deletions of files that were written sequentially. Only a small fraction of the garbage is generated by random overwriting of data. Therefore, both NANDFS and YAFFS2 manage to find erase units with a significant fraction of obsolete pages even when the flash is fairly full. The most significant difference between the 33%-full, the 66% full, and the 90% full cases is the amounts of memory that YAFFS2 used: 887 KB, 1364 MB, and 1690 KB, respectively. Running Time (seconds) NANDFS Sequential Write NANDFS Sequential Read NANDFS Random Write NANDFS Random Read YAFFS Sequential Write YAFFS Sequential Read YAFFS Random Write YAFFS Random Read Microbenchmarks, 1 GB flash, 66% full RAM Usage (KB) Figure 4: The performance of the different access patterns that make up the workload in Figure 3. In the experiments described in Figure 3, different access patterns are mixed in every iteration, whereas in the experiments shown here each access pattern was exercised separately to eliminate cross-accesspattern influences. NANDFS is not very sensitive to the slot size. We repeated the workload in the 90% full case with 500 buffer cache pages and with slot sizes of MB. The results indicate the total running time converged from 4678 s to about 2850 s when using a slot size of 1 MB or smaller. The interleaving of garbage collection with page allocation affects the latency of write operations. The 95th percentile of page allocation call latency grew from 65 µs when the system was 33% full to 369 µs when it was 90% full. 6.2 Microbenchmarks Figure 3 may lead the reader to think that NANDFS and YAFFS2 operate in similar ways, since their performance is quite similar. This is not the case. Following each workload-simulation run, we repeated the 100 iterations four more times, but each time performing only one kind of access: sequential writes, sequential reads, random writes, and random reads. The running times of the single-access-pattern I/O s are shown in Figure 4. Sequential writes are more expensive in NANDFS than in YAFFS2, but sequential reads are cheaper. The differences are large. Also, given that the benchmarks read 3 times more than they write, the data shows that NANDFS reads faster than it writes (this is natural on flash) whereas YAFFS2 writes faster than it reads. Random writes clearly incur higher overheads, but not nearly as high as those of most commercial NAND flash devices [1]. We ran another experiment to evaluate the performance of sustained random writes. In this experiment, we repeated the randomwrites-only phase of the previous experiment many times. We measured the running time of each such random-write phase. That is, the system first performs 100 iterations of a mixed access pattern, then 100 iterations of only sequential writes, and then many phases of 100 iterations of 200 B random writes to the 100 MB file. The file system was 90% full, which makes garbage collection difficult. The results, shown in Figure 5, shows that NANDFS gradually slows down under a random-write load, but after a while its perfor- 292

9 Running Time (seconds) Sustained Random Writes Performance Phase of the Experiment (Time) Figure 5: The performance of sustained random writes. The horizontal axis represent the phase in the experiment, so time flows from left to right; each marker reports the time of one phase of random writes. The data shows that NANDFS gradually slows down under a sustained random-write load. mance stabilizes and does not degrade any more. The slowdown is caused by less effective garbage collections. 6.3 Secondary Metrics In addition to read and write performance, we also measured a number of other aspects of NANDFS. Bad erase units. We ran the main workload experiment (with a 66%-full flash) when the flash contained 15 or 45 randomly-placed bad erase units. The buffer cache stored 500 pages. When the flash contained 15 bad erase units, performance degraded by only one percent. With 45 bad erase units, the overall performance degraded by 4%. Random writes were worst hit, slowing down by 13%. Boot and format times. NANDFS takes less than 50 ms to format the 256 MB flash chip described in Table 1. Post-crash boot times is usually between 50 ms to 100 ms, depending on whether the system crashed shortly after writing a checkpoint or a long time after writing one. Crashing during a wear-leveling operation also slows down the subsequent boot time, but not by a lot (still less than 100 ms with a recent checkpoint). However, when the system crashes when committing a transaction with a lot of VOTs causes recovery to be slow. When we crashed NANDFS in the middle of committing a transaction that invalidated 2000 pages, recovery took about 500 ms. In most cases, the boot and recovery time of NANDFS does not depend on how much data is stored in the file system. The recovery time is proportional to the sum of the number of slots on flash and the number of pages in a slot. In contrast, the boot times of YAFFS2 vary widely depending on the amount of data stored in the file system. In our experiments the mount times varied from 0.5 ms for an empty file system to 2666 ms when the file system was full. In particular, on fairly full file systems YAFFS2 boots much more slowly than even the worstcase boot times of NANDFS. Wear leveling. We measured the endurance of NANDFS by writing to the file system until one erase unit reached 100,000 erasures. At that time, all the other units have been erased at least 95,000 times, indicating that the wear-leveling scheme is effective. 7. CONCLUSIONS NANDFS is a POSIX-like NAND flash file system that requires little RAM to achieve high performance. It outperforms a widelyused flash file system when both use the same amount of RAM, and its performance degrades gracefully when less RAM is available. NANDFS is functional and performs reasonably even with only 4 KB of RAM on small-page flash and with 12 KB on large-page flash. Performance is good even on random-write workloads, on which many flash storage systems perform poorly. The most important conclusion from our research on NANDFS is that the large amounts of RAM that flash file systems like YAFFS2 use is not required to achieve high performance. We did not test JFFS2, but its mapping mechanism is similar to that of YAFFS2, so its RAM usage is similar. NANDFS achieves performance similar to that of YAFFS2 with as little as 5 10% of the RAM that YAFFS2 needs. Furthermore, NANDFS does not rely on dynamic memory allocation. This makes it much easier to integrate it into small embedded systems than file systems that do require a dynamic memory allocation (even ones that do have a dynamic memory allocator), because NANDFS never fails at runtime due to lack of memory. NANDFS is resilient to crashes. It is transactional and can safely recover from any crash while guaranteeing durability to any transaction that has committed. For example, if a close system call returns, the durability of all prior writes to the file is guaranteed. We have tested NANDFS both on a physical NAND chip attached to an ARM7 microcontroller with 16 KB of RAM and in extensive simulations. We have integrated NANDFS into ecos, an embedded operating system, and tested the combination on the ARM7/flash hardware; The 16 KB of RAM were enough for NANDFS and for configuring the ecos with a preemptive multitasking kernel and running multiple threads. When more RAM is available, it can be used for a buffer cache. This can significantly reduce flash accesses when the application accesses the same blocks repeatedly (especially when the application reads the same blocks repeatedly). The buffer cache also reduces reads of file-system metadata pages. The main technical innovation in NANDFS is the use of a coarse-grained logical-to-physical block-address mapping in a logstructured file system. The extra level of translation makes segment cleaning more efficient, because we can move a data block without changing pointers to the block. Because the mapping is coarse, only a small amount of RAM is required to maintain it. The coarse-grained mapping requires a unified cleaning (garbage collection) and allocation procedure, which leaves valid blocks at the same segment offset when they are copied. This technique would not work well in disk-based file systems. In a conventional log-structured file system, pointers to a block are updated every time it is moved to a new segment. This increases the cost of space reclamation, but it allows the file system to compact segments and create large contiguous chunks of free storage. The contiguous chunks allows files to be written or re-written contiguously, which leads to good read performance later. In our design, free space tends to be fragmented, which would degrade performance on a magnetic disk. But on flash this fragmentation has no effect on performance. Innovations in log-structured storage systems, like the innovation in this paper, are important as NAND flash is becoming more 293

10 and more common in computer systems. It appears that the only way to achieve good random-write performance on flash is to use a log-structured file system (mounted directly on top of the flash, or on top of a block translation layer with poor random-write performance), or to use a log-structured block translation layer (even with a non-log-structured file system). Acknowledgements Thanks to Sergei Gavrikov for helping us configure ecos on the LPC2119. Thanks to Danny Kastenberg for providing us with the flash chips. Ezra Shaked helped us add the NAND chip to the LPC2119 system. Gil Shklarski helped with the implementation of randomized counters and randomized decisions. Part of this research was conducted while the third author was on sabbatical at MIT s Computer Science and Artificial Intelligence Lab; MIT s support is gratefully acknowledged. 8. REFERENCES [1] D. Ajwani, I. Malinger, U. Meyer, and S. Toledo. Characterizing the performance of flash memory storage devices and its impact on algorithm design. In Proc. of the 7th Intl. Workshop on Experimental Algorithms (WEA), volume 5038 of LNCS, pages , [2] G. Bartels and T. Mann. Cloudburst: A Compressing, Log-Structured Virtual Disk for Flash Memory. Technical Report , Compaq Systems Research Center, [3] A. Ben-Aroya and S. Toledo. Competitive analysis of flash-memory algorithms. In Proceedings of the 14th European Symposium on Algorithms, volume 4168 of Lecture Notes in Computer Science, pages , [4] A. Birrell, M. Isard, C. Thacker, and T. Wobber. A design for high-performance flash disks. ACM SIGOPS Operating Systems Review, 41:88 93, [5] J. Bonwick, M. Ahrens, V. Henson, M. Maybee, and M. Shellenbaum. The Zettabyte File System. Technical report, Technical report, Sun Microsystems. [6] H. Dai, M. Neufeld, and R. Han. ELF: an efficient log-structured flash file system for micro sensor nodes. In Proceedings of the 2nd international conference on Embedded networked sensor systems (SenSys 04), pages , Baltimore, MD, USA, [7] W. de Jonge, M. F. Kaashoek, and W. C. Hsieh. The logical disk: A new approach to improving file systems. In SOSP, pages 15 28, [8] E. Gal and S. Toledo. A transactional flash file system for microcontrollers. In USENIX Annual Technical Conference, General Track, pages USENIX, [9] D. Kang, D. Jung, J. Kang, and J. Kim. µ-tree: an ordered index structure for NAND flash memory. In Proceedings of the 7th ACM & IEEE International Conference on Embedded Software, pages , [10] A. Kawaguchi, S. Nishioka, and H. Motoda. A flash-memory based file system. Proceedings of the USENIX 1995 Technical Conference, [11] H. Kim and S. Ahn. BPLRU: A buffer management scheme for improving random writes in flash storage. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages , [12] J. Kim, J. Kim, S. Noh, S. Min, and Y. Cho. A space-efficient flash translation layer for CompactFlash systems. IEEE Transactions on Consumer Electronics, 48: , [13] S. Lee and B. Moon. Design of flash-based DBMS: an in-page logging approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 55 66, [14] S.-W. Lee, B. Moon, C. Park, J.-M. Kim, and S.-W. Kim. A case for flash memory SSD in enterprise database applications. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages , Vancouver, Canada, [15] S.-W. Lee, D.-J. Park, T.-S. Chung, D.-H. Lee, S. Park, and H.-J. Song. A log buffer-based flash translation layer using fully-associative sector translation. ACM Transactions on Embedded Computing Systems, 6(3): , [16] S. Lim and K. Park. An efficient NAND flash file system for flash memory storage. IEEE Transactions on Computers, 55: , [17] S. Lin, D. Zeinalipour-Yazti, V. Kalogeraki, D. Gunopulos, and W. Najjar. Efficient indexing data structures for flash-based sensor devices. ACM Transactions on Storage, 2: , [18] B. Liskov and R. Rodrigues. Transactional file systems can be fast. In 11th ACM SIGOPS European Workshop, Leuven, Belgium, Sept [19] G. Mathur, P. Desnoyers, D. Ganesan, and P. Shenoy. Capsule: an energy-optimized object storage system for memory-constrained sensor devices. In Proceedings of the 4th ACM International Conference on Embedded Networked Sensor Systems (SenSys), pages , [20] R. Morris. Counting large numbers of events in small registers. Communications of the ACM, 21: , [21] S. Mullender and A. Tanenbaum. Immediate Files. Software - Practice and Experience, 14(4): , [22] S. Nath and A. Kansal. FlashDB: dynamic self-tuning database for NAND flash. In Proceedings of the 6th International Conference on Information Processing in Sensor Networks (IPSN), pages , [23] A. One. YAFFS: Yet another flash filing system [24] S.-Y. Park, D. Jung, J.-U. Kang, J.-S. Kim, and J. Lee. CFLRU: a replacement algorithm for flash memory. In Proc. of CASES, pages , [25] V. Prabhakaran, T. Rodeheffer, and L. Zhou. Transactional flash. In Proc. Symposium on Operating Systems Design and Implementation (OSDI), San Diego, CA, Dec, [26] D. M. Ritchie and K. Thompson. The UNIX time-sharing system. AT&T TECH J, 57: , [27] M. Rosenblum and J. K. Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems, 10:26 52, [28] M. I. Seltzer and M. Stonebraker. Transaction support in read optimizied and write optimized file systems. In Proceedings of the 16th International Conference on Very Large Data Bases (VLDB), pages , [29] D. Woodhouse. JFFS: The Journalling Flash File System. Ottawa Linux Symposium, available at [30] C.-H. Wu, T.-W. Kuo, and L. P. Chang. An efficient B-tree layer implementation for flash-memory storage systems. ACM Transactions on Embedded Computing Systems, 6: ,