Architectures and Optimization. Methods of Flash Memory Based Storage Systems

Transcription

1 Architectures and Optimization Methods of Flash Memory Based Storage Systems Yuhui Deng, Jipeng Zhou Department of Computer Science, Jinan University, Guangzhou, , P. R. China ; Flash memory is a non-volatile memory which can be electrically erased and reprogrammed. Its major advantages such as small physical size, no mechanical components, low power consumption, and high performance have made it likely to replace the magnetic disk drives in more and more systems. However, flash memory has four specific features which are different to the magnetic disk drives, and pose challenges to develop practical techniques: (1) Flash memory is erased in blocks, but written in pages. (2) A block has to be erased before writing data to the block. (3) A block of flash memory can only be written for a specified number of times. (4) Writing pages within a block should be done sequentially. This survey presents the architectures, technologies, and optimization methods employed by the existing flash memory based storage systems to tackle the challenges. I hope that this paper will encourage researchers to analyze, optimize, and develop practical techniques to improve the performance and reduce the energy consumption of flash memory based storage systems, by leveraging the existing methods and solutions. Keywords: Flash memory, Energy efficient, Solid State Disk, Disk drive, Storage system 1. Introduction Flash memory is a non-volatile memory which can be electrically erased and reprogrammed. Its major advantages such as small physical size, no mechanical components, low power consumption, and high performance have made it likely to replace magnetic disk drives in more and more mobile and embedded systems (e.g. digital camera, MP3 player, mobile phone etc.), where either size and power or performance are important [10, 58, 68, 71]. In the past decade, NAND flash densities have been almost doubling each year since 1996 [9, 33]. Samaung has delivered NAND flash memories with capacity ranges from 64MByte to 4GByte [85, 86]. A 32GByte flash drive, which integrates 16 2GByte flash memory chips, is also available on the market [86]. Flash memory is also much cheaper than the volatile memories such as SDRAM. For example, 1Gbit of NAND flash memory costs 3.75$, while the same size of low power SRAM and fast SRAM cost 320$ and 614$, respectively [71, 86]. Due to the increased capacity and decreased price, flash memory is expected to be widely used in high-end computer systems as an important and promising storage media. Flash memory can play two roles in the existing computer system architecture: (1) As an extension to RAM, and a layer between RAM and the magnetic disk drives. (2) Replacing the magnetic disk drives as block-level storage media. The memory hierarchy in current computer architectures is designed to take advantage of data access locality to improve overall performance. Each level of the hierarchy has higher speed, lower latency, and smaller size than lower levels. Magnetic disk drives are millisecond devices, DRAM are nanosecond devices, and flash memory chips are microsecond devices. It seems that flash 1

2 memory can play as an intermediate layer (e.g. a non-volatile cache) between the DRAM and the magnetic disk drives in terms of the memory hierarchy. Magnetic disk drives have been the preferred media for data storage for several decades. However, the architecture of disk drives currently is facing two challenges. The first one is performance. Disk drives are highly complex systems consisting of electronic and mechanical components. Due to the slow mechanical latency, the disk I/O subsystem has been repeatedly identified as a major bottleneck to system performance in many computing systems. Although the performance of disk drives has been experiencing 40% growth per year, the performance gap of RAM to disk drive has been widened to 6 orders of magnitude in 2000 and will continue to widen by about 50% per year [80]. The second one is energy consumption. Fan et al. [27] investigated the power consumption of the major components within a typical server. They reported that the peak power of one X86 CPU, one Motherboard, one PCI expansion slot, one IDE disk drive, one fan, and one DDR memory are 40W, 25W, 25W, 12W, 10W, 9W, respectively. From a power standpoint, it seems one disk drive is not a problem. Even the addition of several dozen disk drives would hardly be a concern. However, if hundreds or thousands of disk drives are put together, it will quickly become a big headache. One example shows the storage subsystem accounting for 27% of the energy consumed in a data centre [26]. To worsen the situation, this fraction is swiftly increasing as storage requirements are rising by 60% annually [29].The characteristics of flash memory including high performance, low energy consumption, and small size make them the potential storage media in comparison to the traditional disk drives. Therefore, this paper attempts to explore the opportunities and challenges of how to employ the flash memory as a block-level storage media. Flash memory has four specific features which are different to magnetic disk drives. The first one is that flash memory is erased in blocks, but written in pages. Each block consists of a number of pages. The second one is erasing before writing. It indicates that a block has to be erased before writing data to the block. The third one is that a block of flash memory can only be written for a specified number of times. The fourth one is that writing pages within a block should be done sequentially or incrementally [41, 87]. This is called Sequentiality of Programming (SOP). In a Multi Level Cell (MLC) flash memory, two or more bits are programmed in one cell. Among the bits, a lower bit is denoted as the Least Significant Bit (LSB) and an upper bit is denoted as the Most Significant Bit (MSB). The LSB pages within a block should be programmed before the MSB pages. Random page address programming is prohibited. In this case, the definition of LSB page is the LSB among the pages to be programmed. Therefore, LSB doesn t need to be page 0. Please note that sequential page programming does not have to be consecutive page programming. Therefore, a sequence of page 1, page 5, and page 8 is acceptable, but a sequence of page 1, page 8, and page 5 will incur lots of fail bit. Flash memory is usually accessed by embedded systems as a raw medium or indirectly through a block-oriented device. In other words, the management of flash memory is carried out by either software on a host system (as a raw medium) or hardware/firmware inside its device [38]. Therefore, there are two kinds of approaches to overcome the hardware limitations of flash memory [34]. One is designing a new flash file system [3, 4, 21, 46, 49, 63, 66, 73, 89]. The other one is using the traditional file systems (e.g. FAT and Ext2), and wrapping the flash memory to mimic a block-level storage device. Consequently, the algorithms and functionalities used to handle the characteristics of flash memory (e.g. erasing before writing and the limited life span) can be integrated into either a file system [49, 77] or the firmware inside a flash memory device [51, 59]. Gal and Toledob [34] wrote a comprehensive survey about the algorithms and data structures for flash memory. They reviewed flash-specific file systems including log-structured file systems[49], research-in-motion file system[73], Journaling Flash File System(JFFS)[3], Yet Another Flash File System (YAFFS)[4], Trimble file 2

3 system[66], Microsoft flash file system[89], Norris flash file system [21], and some other commercial embedded file systems. However, due to the compatibility with traditional file systems, the second solution is more popular and widely used in products. In contrast to their work, this paper attempts to review the architectures, technologies, and optimization methods involved in the flash memory devices. The major objectives of this survey are: (1) To understand the flash memory technology, and basic design concepts of flash memory based storage systems. (2) To create awareness among the researchers about state-of-the-art optimization methods developed for the flash memory based storage systems in the community. (3) To help discover more effective ways to improve performance and reduce energy consumption of the flash memory based storage systems. (4) To create a vision about future directions/challenges for the flash memory based storage systems. The remainder of the paper is organized as follows. Overview of the flash memory is introduced in Section 2. Section 3 describes the flash memory based storage systems including system architecture, logical-to-physical address mapping, wear-leveling, garbage collection, and power-off recovery. Section 4 discusses the architecture, performance pattern and the corresponding optimization methods, and energy consumption and conservation of Solid State Disk (SSD). Section 5 concludes the paper with remarks on the contributions of the paper. 2. Overview of flash memory (a) NOR Flash (b) NAND Flash Fig. 1 Cell architecture of flash memory [25, 56] There are two major types of flash memory, which are available on the market, following different logic schemes: namely NOR and NAND. The cell architecture of NOR flash and NAND flash is different. NOR flash arranges cells in parallel between two bit lines. NAND flash serially connects the cells along the bit line. Fig.1 illustrates the cell architecture of two different flash memories [25, 56]. Each cell of a NOR flash is a MOSFET transistor which has two gates. There is a control gate (CG) on the top. A floating gate (FG) insulated from its surroundings by an insulating oxide material is below the CG. The FG resides between the CG and the MOSFET channel. The electrons injected onto the FG are trapped because the FG is isolated by the insulating material. Unlike the NOR flash, the basic cell of a NAND flash is a MOSFET transistor with a floating gate. Charge is injected into this gate during writing, and released during erasing. The organization of NAND flash cells reduces much of the decoding overhead found in other memory technologies. However, accessing one individual cell has to go through the other cells in its bit line in terms of Fig. 1 (b). This adds significant noise to the reading. It also brings challenges to writing since the adjacent cells in the line should not be disturbed. For erasing, all cells on the same bit line have to be 3

4 erased. The NOR flash memory, which employs a standard memory interface, is byte accessible and can be adopted as execute-in-place memory. It is designed for efficient random access. It is mainly used for programmable read-only memory (PROM) and erasable PROM (EPROM) replacement. It has separate address and data buses like EPROM and static random access memory (SRAM). Compared with the NOR flash memory, NAND flash memory has faster erasing and write times, simpler interface, along with higher data density. Therefore, NOR flash memory is well suited for code storage and execute-in-place (XIP) applications, while NAND flash memory is a better candidate for data storage [45]. As explained in the introduction Section, the traditional disk drives are facing both performance and energy consumption challenges, and the NAND flash could replace the magnetic disk drives as the major storage media. Therefore, this paper focuses on the NAND flash. The flash memory in the following sections denotes NAND flash memory. Chang and Kuo[10] summarized the characteristics of NAND and NOR flash memory listed in Table 1. Table 1. Characteristics of NAND flash memory and NOR flash memory NAND NOR Density High Low Read/write Page-oriented Bitwise XIP No Yes Read/write/erase Moderate/fast/fast Very fast/slow/very slow Cost per bit Low High NAND flash memory can use Single-Level-Cell (SLC) technique or Multi-Level-Cell (MLC) technique. For the SLC flash memory, one cell represents one bit (two states), whereas MLC doubles or even triples the memory density of SLC. This is possible because of the charge storage in the floating gate, which allows for subdividing the amount of stored charge into small increments. When this is coupled with the superior retention characteristics of the floating gate, it is possible to accurately determine the charge state after a long period of time. Unfortunately, the decreased separation between charge states incurs a higher sensitivity to cell degradation in comparison to the SLC [56]. Therefore, SLC flash memory is faster, more reliable, and has a longer life span than MLC. However, for the same-sized die, MLC is cheaper and can provide larger storage capacity than that of SLC flash. Thus, MLC NAND flash memory is suitable for low-bit cost and high-density applications, while SLC NAND flash memory is a good candidate for high-performance applications [13, 41]. However, MLC has a shorter lifespan which degrades every time when data is written to the cell. It is estimated that the native lifespan of an SLC cell is around 100,000 cycles, but it drops to around 10,000 cycles with two bit MLC cells and as low as 1,000 cycles on a three bit cell. There are algorithms to extend the lifespan such as MLC charge-placement algorithms [56], but the bottom line is that lower-capacity SLC has a much longer life span [76]. Table 2 summarizes the characteristics comparison of SLC and MLC NAND flash memory. Table 2. Characteristics comparison of SLC and MLC NAND flash memory SLC MLC Density Low High Performance High Low Reliability High Low 4

5 Life span Long Short Cost per bit High Low Flash planes Data area Spare area Control Col. decode and address logic Row decode NAND Flash array One block One page 8/16 bit bus I/O Data register Fig. 2 Architecture of a typical flash chip A NAND flash memory is composed of flash plans. The plans can be independent with its own buffer to hold data, thus allowing simultaneous operations for higher performance, although they compete for the package pins. Each plan consists of a fixed number of blocks, where each block is further divided into a number of pages and each page has a fixed-size main data area and a spare data area. The data area is for the storage of data, and the spare area stores the corresponding LBA, ECC, and other information. Data on NAND flash memory is read or written at the page level, and the erasing is performed at the block level. A static RAM buffer holds data before writing or after reading, and data is transferred to and from this buffer via an 8 bit or 16 bit wide bus. Fig. 2 shows the architecture of a typical flash chip [25]. Data transfer consists of two parts. The first part is transferring data over the external bus to or from the data register. The second part is between the data register and the flash arrays. Table 3 summarizes the parameters of five different NAND flash memories from four manufacturers [2,30,42,85, 88]. Table 3 indicates a non-uniform access latency for read. It shows that a sequential read in flash memory is three orders of magnitude faster than a random read. This is consistent with what Chen et al. [12] have observed. Table 3 also illustrates that the endurance cycle of flash memory could be as low as 10K. This may not be a good news for using the flash memory in enterprise storage systems. Table 3. Characteristics of typical NAND Flash Memories Manufacturer Samaung Intel AMD FUJITSU Type K9NBG08U5A JS29F16G08FANB1 Am30LV0064D MBM30LV0128 Capacity 4G x 8 Bit 2G x 8 Bit 8M 8Bit 16M 8Bit Page Size(Byte) (2K + 64) (2K + 64) ( ) ( ) Block Size(Byte) (128K + 4K) (128K + 4K) (8K + 256) (16K + 512) Random Read 25µs(Max) 25µs(Max) N/A 10µs(Max) Serial Read 50ns(Min) 25ns(Min) <50ns 35ns(Min) Program time 200μs(Typ.) 220μs(Typ.) 200μs 200µs (Typ.) Erase Time 1.5ms(Typ.) 1.5ms(Typ.) 2 ms 2 ms Endurance 100K 100K 10K 1Million 5

6 Voltage V V V V Power(Active) 75 mw 75 mw 30 mw 72 mw Power (standby) 60 mw 3 mw 0.03 mw 3.6 mw 3. Flash memory based storage system 3.1 Architecture Modern operating systems tend to be structured in distinct, self-contained layers, with little inter-layer state or data sharing. Flash memory based storage systems also employ a layered architecture. Fig. 3 shows the major layers involved in a typical flash memory based storage system. In comparison with the traditional storage systems [23], the Flash Translation Layer (FTL) [14, 51, 54, 59] driver and the Memory Technology Device (MTD) [67] driver are two important layers for the flash memory based storage systems. The MTD driver mainly provides three low-level operations including read, write, and erase. Based on the low-level functionalities, the FTL is in charge of how to handle the specific characteristics of flash memory. The objective of the FTL driver is to provide transparent services for file systems to access flash memory as a block device, to optimize the performance using the least amount of RAM, and to extend the durability by using wear-leveling technique. The functionalities of FTL and MTD can be integrated into either the firmware inside a flash memory device (e.g. Disk On Module) or the file system (e.g. Memory StickTM) [38]. Applications File system API File system Block-level commands Flash translation layer(ftl) Flash commands Memory technology device(mtd) Control Signals Flash memory media Fig. 3 A typical flash memory storage system Fig. 3 describes the data flow in a typical flash memory based storage system. The applications issue I/O requests by calling file system APIs (e.g. fread / fwrite). File systems normally manage the storage capacity as a linear array of fixed-size blocks. Therefore, a file access is converted to many block-level I/O requests by the file system. Each block-level I/O request contains a specific Logical Block Address (LBA) and a data block length. The block-level I/O requests will go through the FTL and be translated to specific commands provided by flash memory. The commands will be converted as control signals to access the raw flash memory media. 6

7 READ Cmd Busy Data WRITE ERASE Cmd Cmd Busy Data Busy Fig. 4 Control flow of read, write, and erase A typical flash memory media supports page read, page write, and block erase. Generally, reading an entire page data in parallel from the plane into the internal buffer takes µs. Writing a page includes moving the data over the package pins and onto the device. Once the data is in an internal buffer, writing typically takes between 200μs (SLC) and 800μs (MLC). Writing can only change all the bits in the page from 1 to 0. Therefore, writing is implemented by selectively changing bits from 1 to 0 to match the supplied contents assuming all the bits in the target page are 1 before the writing. The only way to change a bit in a page from 0 to 1 is to erase the block that contains the page. Thus, an erasing is required to arbitrarily modify the page s contents [9]. Erase sets all the bits in the block to 1. It normally takes 1.5-2ms to erase the whole block. Table 3 lists the performance of reading, writing, and erasing of five flash chips produced by four manufactures. Grupp et al. [35] argued the performance metric provided by manufactures. They performed a comprehensive evaluation by using flash memory devices from five manufacturers, and reported very interesting observations. According to Table 3, the program time of a flash memory is uniform across the pages. However, Grupp et al. observed a significant variation in program time within each MLC block. All of the MLC devices they measured demonstrated a regular and predictable variation in program latency between pages within a block. For example, for one flash device they used, the first four pages and every other pair of pages in each block are 5.8 times faster on average than the other pages. They summarized that the fast MLC programs are nearly as fast as SLC programs, while the slow pages are very slow indeed. Accessing the flash memory media normally consists of setup phase and busy phase. For instance, the first phase of writing is for command setup and data transfer. The command, the data address, and the data are written to proper registers of flash memory in order. The second phase is for busy-waiting of the data being flushed into flash memory. The reading is similar to that of writing, except that the sequence of data transfer and busy-waiting is switched. The phases of erasing are the same as those of writing, except that there is no data transfer in the setup phase. The control flow of read, write, and erase are illustrated in Fig. 4 [11]. 3.2 Flash translation layer In order to handle the specific characteristics of flash memory and offer transparent accesses to the file systems, a FTL should contain at least four functions including logical-to-physical address mapping, wear-leveling, garbage collection, and power-off recovery [14] Logical-to-physical address mapping The NAND flash memory is accessed much like block devices (e.g. disk drives) which require data to be read or written in larger units. The reader is referred to [10] for a comprehensive understanding of the 7

8 NAND flash memory. Since file systems manage the underneath storage media as a logical linear space, each unit is identified by a LBA. Therefore, when a request issued from the file system arrives at the FTL, the LBA has to be converted to a physical address to locate the data stored in the flash memory. It means that the flash storage system has to maintain a mapping table to perform the logical-to-physical address mapping LBA (1,1) (1,2) (2,1) (2,2) (3,1) (3,2) Physical address (1,3) (1,4) Block 1 (2,3) (2,4) Block 2 (3,3) (3,4) Block 3 Fig. 5 A simple logical-to-physical address mapping Designing a high-effective mapping algorithm has long been a challenge in the community. Generally, there are two criteria to measure the mapping algorithm. The first one is the size of the mapping table. Because the mapping table should be stored in persistent storage and cached in memory so that they can be accessed with little overhead, the storage capacity occupied by the table is important and should be kept as small as possible. The second one is the query efficiency. The query efficiency of a well designed algorithm should not have significant degradation with the growth of the system scale. A simple mapping algorithm is one-to-one mapping. A newly incoming request is compared against the mapping table, and will be directed to the physical location. In this algorithm, each logical address is mapped to a corresponding physical address. Therefore, if the length of the logical address is N, the entries of the mapping table is also N. Fig. 5 illustrates a simple logical-to-physical address mapping. The flash memory is composed of three blocks. Each block consists of four pages. Therefore, the maximal LBA is twelve, and each LBA is mapped to a physical location. For example, LBA 5 is mapped to physical address (2, 1). It is easy to know that this algorithm is not effective in terms of the above two criteria. Over the last decade, many research efforts have been invested in designing effective mapping algorithms [14]. Generally, the mapping algorithms can be divided into three categories: page mapping, block mapping, and hybrid mapping. (1) Page mapping. Page mapping indicates that a page can be mapped to a random page within a block. However, the pages which will be written should be programmed sequentially in terms of SOP. When a physical page is required to be modified, the new data has to be written to another clean physical page, and the old page should be invalidated. This is an out-of-place scheme. The page mapping has the best performance because it can delay the erase operation as much as possible if there are free pages in flash memory. However, with this scheme, the physical locations where validate data reside might be changed from time to time due to the updates. Therefore, page mapping needs a large mapping table. This is not applicable in flash memory based storage systems. (2) Block mapping. Block mapping means that each logical block is mapped to a physical block. The logical block and the physical block employ the same page offset within a block. This is called in-place scheme. It indicates that a page is written to a fixed location in a block in terms of the page offset. The block mapping requires a relatively small mapping table. However, when a 8

9 specific page is required to be modified, the corresponding block has to be erased, and both the valid and invalid pages have to be copied to a new block. This incurs a significant overhead. (3) Hybrid mapping. Since both the page mapping and block mapping have some disadvantages, hybrid mapping is a compromised algorithm of page mapping and block mapping. It first uses a block mapping to locate the corresponding physical block. It then employs a page mapping to obtain an available page within the physical block. Optimizing the FTL has long been a hot research topic. Generally, the methods can be classified into four categories: (1) Designing the FTL for each target NAND application in terms of its performance, endurance, and memory requirements. For example, a reconfigurable FTL architecture [72] employs a flexible mapping structure to leverage the spatial and temporal locality of each target applications. (2) Combining a coarse granularity mapping mechanism and a fine granularity mapping mechanism to achieve an effective mapping policy. This is because the coarse granularity mapping reduces resources requirement to maintain the mapping information, while fine granularity mapping is efficient in handling small size writes [51]. For example,μ-ftl [61]is designed to dynamically adjusts mapping granularities according to the size of write requests by maintaining mapping information in an extent-based μ-tree structure. (3) Designing effective searching and mapping algorithms. Searching and mapping algorithms is crucial for the large-scale flash memory based storage systems since the overhead increases with the growth of the system scale [10]. (4) Designing specific FTLs. For instance, STAFF [15] assigns states to flash blocks and uses the states to control address mapping, thus reducing the erasing. Demand-based Flash Translation Layer (DFTL) [36] is a page-level address mapping based on selective cache. By leveraging the significant temporal locality contained in most enterprise-scale workloads, DFTL uses the limited SRAM integrated with flash memory to store the most popular (specifically, most recently used) mappings while the rest are maintained on the flash device itself. Unlike currently predominant hybrid mapping, DFTL is a pure page mapping. There are some other optimization approaches by adopting log. For example, some algorithms classify all physical blocks into log blocks and data blocks. The log blocks and data blocks employ page mapping and block mapping, respectively. When a physical page is required to be updated, the data is first written into a log block and the corresponding old data in data block is invalidated. When the number of full log blocks reaches a specific threshold, one of the log blocks is selected as a victim block, and all the valid pages in the log block are moved to data blocks, and the log block is erased. EAST [54] is an efficient and advanced space management technique to improve the performance of flash memory. EAST reduces the number of operations in flash memory by optimizing the number of log blocks, by applying the state transition, and by using reallocation blocks. The major idea is limiting the number of log blocks, which decreases the size of mapping table, thus reducing the RAM requirements. Another idea is leveraging the state transition to use both in-place and out-of-place technique in a block. The in-place technique writes data in the offset of resulting PBN, but the out-of-place technique writes data not related to the offset. The in-place technique is for fast access of read operations, and the out-of-place technique has an advantage of fully utilizing the block bandwidth. This guarantees all the pages in data blocks to be utilized before allocating a log block. Traditionally, the date consistency of journal file system is guaranteed by duplicating the same file system changes in both the journal regions and the home locations of the changes. 9

10 However, these duplications degrade the performance of the file system. JFTL[17] is an efficient FTL to handle the data consistency of journal file systems based on a journal remapping technique. FTL uses an address mapping method to write all the data to a new region in a process known as an out-of-place update to avoid overwriting the existing data in flash memory. Based on this characteristic, the JFTL remaps addresses of the logged file system changes to the addresses of the home locations of the changes, instead of writing the changes to flash memory, thus eliminating the redundant data in flash memory as well as preserving the consistency of the journaling file system Garbage collection A page of flash memory can be either writable or un-writable, and any page initially is writable. The writable pages are called free pages. A writable page becomes un-writable once it is written. Therefore, a very important feature of NAND flash is that the pages cannot be rewritten before erasing. When a portion of data on a page is modified, the new version of the data must be written to an available page somewhere, and the old version is invalidated. The page which stores the old version of the data is considered as dead, while the page which stores the newest version of data is considered as live. When the storage capacity becomes low, garbage collection is triggered to recycle the dead pages. Because erasing is performed in blocks, the valid pages in the recycled blocks have to be copied to somewhere before erasing the blocks. The erasing involves significant valid data copying, and could be initiated when flash memory storage systems have a large number of valid and invalid data mixed together. Garbage collection normally involves a series of reads, writes, and erasing. Therefore, the performance is normally very low when the system is performing the garbage collection. A poor garbage collection policy could quickly wear out a block and a flash memory chip. An optimized algorithm can reduce the impact to a certain degree. Chang and Kuo [10] defined the garbage collection as follows. Assuming a set of contiguous pages on flash memory is a physical cluster (PC). The status of a PC could be a combination of (free/live) and (clean/dirty), namely, live-clean PC (LCPC), live-dirty PC (LDPC), free-clean PC (FCPC), and free-dirty PC (FDPC). A free PC indicates that the PC is available for allocation, and a live PC means that the PC is occupied by valid data. A dirty PC is a PC that might be involved in garbage collection for block recycling, where a clean PC should not be copied to any other space even for garbage collection unless wear-leveling is considered. An LCPC, an FCPC, and an FDPC are a set of contiguous live pages, a set of contiguous free pages, and a set of contiguous dead pages, respectively. An LDPC is a set of contiguous live pages, but it could be involved in garbage collection. Suppose that there is a set of n available FDPC fd, fd, fd, L, fd }. Let D = dt, dt, dt, L, dt } be the set of the proper dirty { m tree { n sub-trees of all FDPC. Let f ( dt size i ) be a function on the number of dead pages of dt i, and f cost ( dti ) be a function on the cost in recycling dt.given two constants S and R, the problem is to find a subset i ' D tree D tree such that the number of pages reclaimed in a garbage collection is no less than S, and the cost is no more than R. They also claimed that the garbage collection problem is NP-Complete. In order to tackle this challenge, various techniques have been proposed to improve the performance of garbage collection for flash memory. Kawaguchi et al. [49] designed a cost-benefit policy which employs a value-driven heuristic function based on the cost and the benefit to recycle a specific block. The cost benefit policy avoids recycling a block that contains recently invalidated data because the policy 10

11 assumes that more live data on the block might be invalidated soon. Chiang et al. [18] proposed to track the heat of each LBA. The heat of an LBA denotes how often the LBA is written. This method attempts to avoid recycling a block that contains many live and hot data because any copying of the data is usually considered inefficient. Because the garbage collection might impose unbounded blocking time on tasks, Chang et al. [19] designed a real-time garbage-collection mechanism which provides a guaranteed performance for hard real-time systems and supports non-real-time tasks so that the potential bandwidth of the storage system can be fully utilized. A wear-leveling process is executed as a non-real-time service to resolve the endurance problem of flash memory. No matter what garbage collection policies are employed, the goal of garbage collection is to choose blocks to recycle with an objective to minimize the number of copy operations needed, and the flash memory based storage systems have to consider the limit of possible erasing on each erasable unit of flash memory (the endurance cycle) Wear-leveling Another important feature of NAND flash memory is the endurance cycles. A block will wear-out after a specified number of program/erase cycles ranging from 10,000 to 100,000. A wear-leveling process attempts to evenly distribute the data between memory cells to guarantee that no one cell is overly burdened. Because when some blocks of flash memory were worn out, the whole flash memory chip would start to malfunction, a good wear-leveling scheme should be able to keep an even distribution of erase cycle counts across all the blocks. Traditionally, the wear-leveling is combined with the garbage collection policy. A typical scenario goes like this [19]: a wear leveller sequentially scans the blocks to check the erase count which indicates the number of erases that has been performed on the block so far. When the wear leveller finds a block with a relatively small erase count, the wear leveller first copies live pages from the block to some other blocks, and then sleep for a while. The reason why a block has a relatively small erase count is because the block contains many live-cold data (which had not been invalidated/updated for a long period of time). By moving away the cold data and moving in hot data, the erase count of the block will grow gradually. As the wear leveller repeats the scanning and live-page copying, dead pages on the blocks with relatively small erase counts would gradually increase. As a result, those blocks will be selected by the garbage collection policy sooner or later, and the erase counts will grow. Chang and Kuo [10] proposed a dual-pool approach to tackle the wear-levelling in a large-scale flash memory based storage system. The flash memory is partitioned into many virtual groups. Each virtual group has its local wear-leveling algorithm. Because different virtual groups may have different life span, a global algorithm is required to balance the life span across all the virtual groups. They assign an erase cycle count for every virtual group. Whenever the maximum difference between the erase cycle counts of any two virtual groups is larger than a threshold, the two virtual groups are swapped in terms of their mapped physical space. By doing so, all virtual groups can have approximate erase cycles. They further enhanced the algorithm by dividing virtual groups into a hot pool and a cold pool. Hot data and cold data are stored in the hot pool and the cold pool, respectively. Garbage collection and wear-leveling have two different objectives which could conflict with each other. Garbage collection scheme prefers to recycle the blocks which have a small number of valid pages. On the contrary, wear-leveling policy normally recycles the blocks which are not erased for a certain amount of time, in order to eliminate excessive writes to the same physical flash memory location. These blocks usually store much valid and read-only data. Because it is not necessary to perform garbage 11

12 collection if there are already sufficient free pages in the system. The garbage collection could be activated only when the number of free pages is less than a threshold. A typical wear-leveling aware garbage collection policy might sometimes recycle the block that has the least number of erasing, regardless of how many free pages can be reclaimed. Therefore, it is always a challenge to strike a balance between the garbage collection and wear-leveling Power-off recovery Since flash memory is usually employed in portable devices which are powered by battery, power can fail at any time when the systems run out of its battery. For flash disks, power off also occurs when users unplug the flash disks from a computer. Therefore, power-off recovery in flash memory systems is crucial and is a basic requirement for system safety. There are several challenges to handle the power off recovery: (1) The data structures of FTL system should be recovered and its consistency has to be maintained [14]. (2) The file system should be able to recover from the power failure and still preserve data integrity. For example, when a system is crashed, the last few transactions performed on the flash disk may be in an inconsistent state. When the system is first powered on, the operating system must review these operations in order to correct any inconsistencies, and the controller scans the flash to reconstruct the volatile data structures. (3) It is essential that device power-up be fast since users are unlikely to tolerate a perceptible slowdown [6]. Birrell et al. [6] assumed that after a power failure, capacitors can provide sufficient energy to complete any write that was in progress, but no new operation can be started. This assumption does not include erase since it could take more than ten minutes to complete once it is initiated. Based on the assumption, they organized the volatile data structures in such a way that they can be regenerated rapidly when power is applied, and manage the data structures in a way that allows power to fail at any time. PORCE [16] is an efficient power off recovery scheme for flash memory. It is designed to guarantee system safety without much performance degradation by storing recovery information as small as possible. PORCE divides the write operations into a simple write and the write incurring the reclamation operation. For the simple write, the number of write operations for valid mark is minimized, which reduces the performance penalty. For the reclamation write, PORCE writes a small size of log records during the reclamation. By leveraging the log information, PORCE guarantees system safety though there is a failure during the reclamation process. Additionally, the erase count is used in the cost function of the reclamation, which increases the life span of flash memory. Wu et al. [92] proposed a method for efficient initialization and crash recovery for flash memory file systems. The method implements a log management method in a Log Record Manager (LRM) and a logger. The LRM collects log records in the main memory and merges/deletes them whenever necessary. The logger commits log records (processed by the LRM) onto flash memory with a data structure called check regions. The check regions provide fast references to log records stored on flash memory. During initialization or crash recovery, the housekeeping data structure of the flash memory file system could be properly and efficiently constructed by scanning the check regions. 4. Solid State Disk (SSD) As discussed in section 1, the functionalities of FTL and MTD can be integrated into the firmware inside a flash memory device. In this section, we will discuss the architecture and the optimization methods used to improve the performance and reduce the energy consumption of flash memory based 12

13 storage systems. 4.1 SSD architecture Host Host Interface Logic RAM Processor Buffer Manager Multiplexer Flash Chip Flash Chip Flash Chip Flash Chip Flash Chip Flash Chip Channel 1 Channel 2 Channel 3 Fig. 6 SSD logic components The traditional term SSD refers to semiconductor devices. Therefore, an SSD indicates the use of semiconductors to emulate the external I/O interfaces (e.g. Fibre Channel, SCSI, SATA, PATA, USB, etc.) of conventional magnetic disk drives. The SSD can be used in the existing computing systems without any modifications. SSD commonly consists of either DRAM volatile memory, or NAND flash non-volatile memory. The DRAM based SSD requires an internal battery and backup disk drive to guarantee data persistence. This is why most of the current SSD employ non-volatile flash memory as the storage media (e.g. USB memory sticks can be regarded as a naive SSD). Agrawal et al. [1] describes a general block diagram for an SSD. An SSD basically consists of a host interface logic, a logical disk emulation, an internal buffer manager, a multiplexer, a processing engine, flash chips, etc. Fig.6 shows the logic components of a typical SSD. Please refer to [1] for more details. The SSD eliminates electromechanical noise and delay inherent in magnetic disk drives. 4.2 SSD performance Table 4. Characteristics of typical SSDs Type ZEUS 2GB RamSan-20 MCCOE64G5MPP SP064GBSSD750S25 Manufacturers STEC Texas Memory Samsung Silicon Power Capacity GByte 450GByte 64GByte 64GByte Interface FC PCI-e SATA/300 SATA/300 Average access time μsec 50 μsec 0.12ms 0.20 ms Sustained read bandwidth 200 MB/sec 700 MB/sec 90.6 MB/sec MB/sec Sustained write bandwidth 100 MB/sec 500 MB/sec 83.7 MB/sec 33.5 MB/sec Sustained IOPS 50,000 (random) 80,000 (R/W) Power (Idle) (Watt) 5.4 N/A Power (Max ) (Watt) 8.4/8.1(R/W) Startup power (average) 14.1 Watt N/A N/A N/A Startup time (average) 30 sec N/A N/A N/A 13

14 The advent of NAND-flash based SSD represents a sea change in the architecture of computer storage subsystems. The SSD is capable of producing exceptional bandwidth and random I/O performance in comparison to magnetic disk drives. Table 4 illustrates the characteristics of four SSDs from four different manufacturers, where ZEUS 2GB FC SSD and RamSan-20 are high-end server-level SSDs, MCCOE64G5MPP and SP064GBSSD750S25 are low-end SSDs[28,90,95]. The table demonstrates that the performance and power vary significantly between manufacturers and devices. Because SSD has no moving read/write heads and rotating platters like magnetic disk drives, there is no seek time and rotational latency. Its average access time ranges from 20μsec to 120μsec. This SSD can work at a sustained data transfer rate of up to 200 MB/sec, and is capable to perform 50,000 random IOPS. This SSD can be powered from a single 12 voltage source. To better understand the behavior of flash memory based SSDs, Polte [78] investigated the performance of several high-end consumer and enterprise flash memory based SSDs, and compared their performance to a few magnetic disk drives. They reported that for sequential access pattern, the SSDs are up to 10 times faster for reads and up to 5 times faster for writes than the magnetic disk drives. For random reads, the SSDs provide up to 200x performance advantage. For random writes, the SSDs provide up to 135x performance advantage. They concluded that SSDs are approaching price per performance of magnetic disk drives for sequential access patterns and are superior technology to magnetic disk drives for random access patterns. In contrast to the sophisticated controllers of the enterprise SSD, they also discovered that the consumer SSDs perform even worse than the magnetic disk drives for small random writes. Their test results showed that the consumer SSDs achieve between 100 and 200 IOPS, and the disk drives offer performance between 300 and 500 IOPS. Therefore, we believe that intelligent algorithms (e.g. inherent log-structured pattern of writing, maintaining a pool of pre-erased blocks, coalescing writes to minimize rewriting data without changing it, servicing write requests in parallel, etc.), which can be integrated into the controllers of SSDs, have significant impacts on the performance of SSDs. Chen et al. [12] reported that the integrated cache of SSD has a significant impact on write performance. Their experiments show that SSDs experience a significant increase of latencies when the cache was disabled. Actually, some low-end SSDs do not emply cache to reduce the cost. This is also one of the reasons why we can observe such a large performance gaps between different SSDs as indicated in Table Data access pattern Many common access patterns approximate a Zipf-like distribution indicating that a few blocks are frequently accessed, and others much less often [91]. Staelin and Garcia-Molina [84] observed that there was a very high locality of reference on very large file systems (on the order of one tera byte). Some files in the file system have a much higher skew of access than others. The skew of disk I/O access is often referred to as 80/20 rule of thumb, or in more extreme cases, 90/10 Rule. The 80/20 rule indicates that twenty percent of storage resources receive eighty percent of I/O accesses, while the other eighty percent of resources serve the remainder twenty percent I/O accesses. Furthermore, the percentages are applied recursively. For example, twenty percent of the twenty percent storage resources serve eighty percent of the eighty percent I/O accesses [32, 84]. Data access pattern has a significant impact on the flash based storage system. For example, the frequently accessed data imposes great impacts on garbage collection, performance, and its lifespan due to wear-leveling. Hsieh et al. [38] proposed an efficient method of on-line hot data identification to reduce those impacts. As introduced in Section 3.1, LBA is used to identify the data access address. The approach employs K independent hash functions to hash a given LBA into multiple entries of an M-entry hash table 14

15 to track the write number of the LBA, where each entry is associated with a counter of C bits. Whenever a write request is issued to the FTL, the corresponding LBA is hashed simultaneously by the K hash functions. Each counter corresponding to the K hashed values is incremented by one to indicate the fact that the LBA is written again. For every given number of pages that have been written, the values of all counters are divided by two in terms of a right-shifting of their bits. It is an aging mechanism to exponentially decay the values of all write numbers as time goes on. Whenever an LBA needs to be verified whether it is associated with hot data, the LBA is hashed simultaneously by the K hash functions. The data addressed by the given LBA is considered hot data if the H most significant bits of every counter of the K hashed values contain a nonzero bit value. The reader is referred to [38] for details. Compared with magnetic disk drives, NAND flash memory based SSDs exhibit much better performance for random read, a similar or better performance of sequential read and sequential write. However, SSDs exhibit worse performance for random writes due to the unique physical characteristics [52]. Seo et al. [81] analyzed the performance patterns of SSDs in comparison to the magnetic disk drives by using benchmarks. Their results show that the throughput of magnetic disk drives has little difference between read and write, and the request size and the randomness have significant impacts on throughput. However, SSD has a similar performance for all the operations except random writes. In contrast to this evaluation, Chen et al. [12] reported a strong correlation between performance and the randomness of data accesses, for both reads and writes. They conducted intensive evaluation on different types of SSDs ranging from low-end to high-end products. Generally, significant advances have been made in SSD hardware design, and the performance can satisfy the requirement of many regular workloads. However, in the worst case of stress test, the average latency increases from 1.69ms to 151.2ms (89 times increase) and the bandwidth drops to only 0.025MB/sec. Such a low bandwidth is even 34 times lower than a 7200RPM Western Digit hard disk (0.85MB/sec), and the whole flash based storage system is nearly unusable. Therefore, as the authors suggested, from the performance perspective, it is still too early to draw a firm conclusion that SSD will soon replace magnetic disk drives Cache Disk cache, which is a small amount of on-board RAM, is widely employed in disk drives to avoid the physical I/O accesses by leveraging data locality, thus improving performance. Today's SDRAM has access time ranging from 7 to 10 nanoseconds. We assume that 512Byte data (one sector) needs to be accessed in the SDRAM that has 64 bit chip configuration and 10 nanoseconds access time. The data access overhead is about milliseconds. The overhead is only 0.032% of the latest Hitachi Ultrastar 15K which has an average access time of 2 milliseconds and an RPM of Because accessing data from cache is much faster than accessing from disk media, the disk cache can significantly improve performance by avoiding slow mechanical latencies if the data access is satisfied from the disk cache (cache hit). Disk cache works on the premise that the data in the cache will be reused often by temporarily holding data, thus reducing the number of physical access to the disk media [24]. To achieve this goal, caches exploit the principles of spatial and temporal locality of reference. The spatial locality implies that if a block is referenced, then nearby blocks will also soon be accessed. The temporal locality implies that a referenced block will tend to be referenced again in the near future. As the data locality improves, the hit ratio of disk cache grows. Of all the I/O optimizations that increase the efficiency of I/Os, reducing the number of physical disk I/Os by increasing the hit ratio of disk cache is the most effective method to improve disk I/O performance. 15

16 According to Table 3, NAND flash memory has comparable read performance with that of DRAM. However, the write performance is comparatively slow, and lagged far behind that of DRAM. A distinct physical restriction of flash memory is that they must be erased before writing. It means that any existing data on flash memory cannot be updated directly unless it is first erased. This is completely different to DRAM, and results in a relatively poor random write performance. Like the disk cache, combining a small cache with the flash memory also has a significant impact on its performance. In order to improve the random write performance of flash disks that are used to support commonly-used file systems like FAT32 and NTFS, Birrell et al. [6] proposed to integrate sufficient RAM into flash disks to hold data structures describing a fine grain mapping between disk logical blocks and physical flash addresses. The method can significantly improve the performance of the flash disk. A drawback is how to reduce the size of the volatile data. A new write buffer management scheme called Block Padding Least Recently Used (BPLRU) is designed to fully exploit the characteristics of flash storage to improve the random write performance of flash storage [52]. BPLRU attempts to establish a desirable write pattern with RAM buffering. It updates the LRU list considering the size of the erasable block to minimize the number of merge operations in the FTL, changes the fragmented write patterns to sequential ones to reduce the buffer flushing cost, and adjusts the LRU list to use RAM for random writes more effectively. A smart buffer cache is designed and integrated into a NAND flash memory package to enhance the spatial and temporal locality [60]. The buffer cache consists of a fully-associative victim buffer with a small page size, a fully-associative spatial buffer with a large page size, and a dynamic fetching unit. The victim buffer can reduce conflict misses and write operations effectively. The spatial buffer is designed to improve spatial locality. The dynamic fetching unit is employed to handle the different spatial locality contained in different applications. This new flash memory package can achieve higher performance and lower power consumption compared with any conventional NAND-type flash memory module. The traditional buffer replacement policies do not consider the asymmetric I/O latencies of flash memory. LRU-WSR [47] is a buffer replacement algorithm that enhances LRU by reordering writes of not-cold dirty pages from the buffer cache to flash memory. The enhanced LRU-WSR algorithm focuses on reducing the number of write/erase operations as well as preventing serious degradation of buffer hit ratio. FAB [44] is a flash aware buffer management scheme that reduces the number of erase operations by selecting a victim based on its page utilization rather than based on the traditional LRU policy. Chameleon [94] is a high performance Flash/FRAM hybrid SSD architecture to tackle the challenge of small random writes of flash memory. The key point is maintaining the metadata used by FTL in a small FRAM to handle the intensive small random writes. Using write buffering can improve performance for short bursts of writes, but sustained random writes is still a challenge. The above discussions indicate that the integrated SRAM cache has a significant impact on the performance of SSDs. Chen et al. [12] reported a significant degradation of write performance when the integrated cache was disabled. However, in order to maintain performance, one of the main difficulties of designing FTL is fully leveraging the constrained size of the integrated SRAM-based cache where it stores its mapping table. For example, a 16GB flash device requires at least 32MB SRAM to be able to map all its pages [36]. With the capacity growth of SSDs, this SRAM size is unlikely to scale proportionally due to the cost. 16

17 4.2.3 Parallelism Parallel processing such as Redundant Arrays of Inexpensive Disks (RAID) [75] is a milestone which affects the evolution of storage systems. RAID uses multiple disk drives to store and distribute data. The basic idea of RAID is dividing data into strips and spreading the strips across multiple disk drives, thus aggregating the storage capacity, performance, and reliability. Different distribution policies endow the RAID subsystem with different features [23]. RAID0 is not widely used because it does not guarantee fault tolerance, but it provides the best performance. RAID1 offers the best fault tolerance but wastes storage capacity. RAID5 guarantees fault tolerance of a single disk drive by sacrificing some performance. RAID6 is normally employed to tolerate the fault tolerance of two disk drives. SSD employs multiple flash memory chips to aggregate storage capacity. It also leaves the opportunities of improving performance by leveraging parallelism. A high performance NAND flash memory based storage system is proposed in [50]. The system consists of multiple independent channels, where each channel has multiple NAND flash memory chips. It is similar to Fig. 6. By leveraging the multi-channel architecture, the authors exploited an intra-request parallelism and an inter-request parallelism to improve performance. The intra-request parallelism is achieved by using a striping technique like RAID. This approach divides one request into multiple sub-requests and distributes the sub-requests across multiple channels, thus obtaining the parallel execution of a single request. The inter-request parallelism indicates the parallel execution of many different requests to improve throughput. Interleaving and pipelining are used to exploit the inter-request parallelism. For the interleaving method, several requests are handled in parallel by using several channel managers. The pipelining solution overlaps the processing of two requests within one single channel. Striping takes advantage of I/O parallelism to boot the performance of flash memory systems. However, it cannot succeed without considering the characteristics of flash memory. Due to the limited endurance cycles, the RAID solution could wear out the redundant flash memory chips at similar rates. For instance, RAID5 cause all SSDs to wear out at approximately the same rate by balancing write load across chips. This also applies to RAID1. Two mirror chips could reach the endurance cycles at almost the same time. Therefore, using RAID technology to organize the NAND flash memory chips incurs correlated failures. Diff-RAID[48] is proposed to alleviate this challenge in two ways. First, the parity of data blocks is distributed unequally across flash devices. Because each writing has to update the parity block, this unequal distribution forces some flash devices to wear out faster than others. Second, Diff-RAID redistributes parity during device replacements to ensure that the oldest device always holds most of the parity and ages at the highest rate. When this oldest device is replaced at some threshold age, its parity blocks are assigned across all the devices in the array and not just to its replacement. In contrast to the above work, Chang and Kuo [11] proposed a striping approach to leverage the physical feature of flash memory to achieve parallelism. As introduced in Section 3.1, accessing the flash memory media normally consists of setup phase and busy phase. Suppose a NAND flash memory device is composed of multiple NAND banks, where each bank is a flash unit that can operate independently. When a bank operates at the busy phase, the system can switch immediately to another bank such that multiple banks can operate simultaneously. Based on this parallelism, a joint striping and garbage-collection mechanism is designed to boost performance, while reducing the performance degradation incurred by garbage-collection. Agrawal et al. [1] discussed the potential issues which could significantly impact the SSD performance. They reported that the bandwidth and operation rate of any given flash chip is not sufficient 17

18 to achieve optimal performance. Hence, memory components must be coordinated so as to operate in parallel. They also suggested to carefully placing the data across multiple flash chips of an SSD to achieve load balance and effective wear-leveling. Write ordering was proposed to handle the random writing. Gray and Fitzgerald [33] also discussed the opportunities for enhancing the SSD, such as using non-volatile address map, employing block-buffer, adding logic for copy-on-write snapshots, writing data strips across the chip array, and so on Programmed Input/Output Generally, there are two kinds of I/O techniques for storage devices: (1) Programmed Input/Output (PIO). PIO is a technique where the system processor and the hardware directly control the transfer of data between the system and the corresponding devices (e.g. disk drive). In this scenario, software employs instructions to access I/O address space, and perform data transfers to or from an I/O device. The processor is fully occupied for the entire duration of a data transfer, since it has to monitor and poll the status registers of the flash memory until the command is completed. Therefore, the processor is unavailable to perform other work. This indicates that the PIO model is suited for lower-performance applications and single task. (2) Direct Memory Access (DMA). DMA allows certain hardware to access system memory for reading and/or writing independently of the system processor. In this scenario, the processor initiates a data transfer and switches to other tasks while the transfer is in progress. Once the data transfer is done, the processor will receive an interrupt from the DMA controller and finish the final process of the transfer. Therefore, DMA has much less processor overhead in comparison to the PIO. Flash memory is a PIO device. Consequently, data accesses of flash memory are very processor consuming. For example, to handle a page write request, the processor first downloads the data to the flash memory, issues the address and the command, and then monitors the status of the flash memory until the command is completed. Page reads and block erases are handled in a similar way [19]. SSD integrates a processor into a flash memory device to handle the data transfer, thus freeing the system processor from polling the status registers. In order to tackle this challenge, Wu [93] proposed an I/O request mechanism integrated in the MTD layer to exploit the PIO based data transfers for flash memory based storage systems. The method revises the waiting function in the MTD layer to relieve the microprocessor from busy-waiting, thus making more processor cycles available for other tasks. With this method, each I/O request is inserted into a queue for scheduling, where an isolated process is responsible for the scheduling, dispatch, and finalization of each request. However, we believe that interrupt driven flash controller is required to completely release the processor embedded in SSDs. 4.3 Energy consumption of SSD R/W/E Requests (1) (3) (3) Operation (2) Standby (4) Deep standby (4) Power off Fig. 7 Power state transition of a typical flash memory chip 18

19 Magnetic disk drives involve both mechanical components (e.g. spindle motor used to spin the platters and drive the head motors) and electronic components (e.g. analog-to-digital converters, interface logic, etc) [22]. Energy is consumed by both the mechanical and electronic components. However, because SSD has no mechanical components, and the flash memory chips residing in it have low power consumption, most of their energy is consumed by the electronic components [81]. Therefore, SSDs is expected to take much less energy than magnetic disk drives. Table 5 Current required for different power states of a typical flash memory chip Power Read Erase/Write Standby Deep Standby 3.3V 100mA 80mA 100µA 10µA 5V 140mA 120mA 200µA 20µA Normally, we can achieve energy conservation by switching disk drives between different power states. A premise is that the disk has to stay in a low power state for a sufficiently long period of time, so that the saved energy can outweigh the energy needed to spin the disk up again. This technology can be applied to flash memory. Flash memory chips support multiple power states. For example, HN29W12814A [39] flash memory chip has four power states: operating, standby, deep-standby, and power off. Fig. 7 depicts the power state transition of a flash memory chip labelled with a sequence number as defined in the following descriptions. Flash memory chips perform work including read, write, and erase while in an operation state. (1) When an operation is completed and there is no succeeding request, the chip can be transferred to a standby state. (2) If the chip receives a request when it is in a standby state, the chip will be transferred to an operation state. (3) To conserve energy, the chip can be further switched to a deep-standby state, or even a power-off state. (4) To perform requests after entering a low power state, the chip must be transferred back to an operation state. Table 5 summarizes the current required for different power states of HN29W12814A [39]. It demonstrates that flash chips in a standby and a deep standby states use considerably less energy than that in an operation state. Table 6. Major energy related characteristics of different disk drives Parameters IBM 36Z15 IBM 73LZX IBM 40GNX RPM 15,000 10,000 5,400 Power (Active) 13.5Watt 9.5 Watt 3.0 Watt Power (Idle) 10.2 Watt 6.0 Watt 0.82 Watt Power (Standby) 2.5 Watt 1.4 Watt 0.25 Watt Energy (Spin Down) 13.0 Joule 10.0 Joule 0.4 Joule Energy (Spin Up) Joule 97.9 Joule 8.7 Joule Time (Spin Down) 1.5 Sec 1.7 Sec 0.5 Sec Time (Spin Up) 10.9 Sec 10.1 Sec 3.5 Sec Carrera et al. [8] summarize the major energy related characteristics of three different disk drives. In order to make this paper self-contained, we list the parameters in Table 6. According to Table 3, Table 5, and Table 6, flash memory chips take considerably less power than that of magnetic disk drives since they do not involve mechanical components. However, as discussed in section 4.1, a SSD involves not only 19

20 flash memory chips, but also processor, RAM, interface logic, and other circuits. This increases the energy consumption of a SSD. Table 4 shows that the SSD takes 8.4 Watts for the maximal read bandwidth and 8.1 Watts for the maximal write bandwidth. This is comparable to the 9.5 Watts of the moderate-performance server disk drive (IBM 73LZX), and much higher than the 3.0 Watts of the laptop disk drive (IBM 40GNX). According to the statistics, energy conservation is unlikely to be the most important motivation of moving from magnetic disk drives to SSDs. The power state transition incurs both performance and energy penalties. Table 4 shows that it takes =423 Joule to start the SSD, which is much bigger than the energy consumption of spinning up the high-performance server disk drive listed in Table 6 (IBM 36Z15 takes Joule). Therefore, switching the SSD between different power states incur significant energy and delay penalty. This may be because the high-end SSD involves a processor, RAM, interface logic, and other circuits which all take energy. Unfortunately, due to the lack of publicly available parameters of other SSDs, this claim requires further confirm, but we believe that it may be more applicable to switch some specific flash memory chips rather than the whole SSD. By leveraging the different power states, Wu [93] achieved energy conservation of a multi-bank flash memory system [53]. When the system is idle, each bank is able to switch to different power states independently to save energy, but the state transition not only causes energy consumption but also transition delay. In a typical flash memory system, I/O requests are stored in an I/O queue maintained by the MTD layer. Wu [93] proposed a strategy to minimize the number of state transitions by scheduling the I/O requests, thus reducing the energy and performance penalties of power state transitions. Although the energy conservation of SSD is not as efficient as expected, SSD generates much less heat dissipation and also are more reliable in comparison to magnetic disk drives. The reason is because SSD does not have mechanical components (e.g. moving disk head and rotating media). Seo et al. [81] analyzed the power consumption patterns of both SSD and magnetic disk drives by using the same benchmarks. For the magnetic disk drives, random write is more energy-efficient than random read when the request size is small. This is because read-ahead for small random reads takes more energy without any performance gain, but the disk head movement is dramatically reduced with write buffering and write reordering for random writes. SSD consumes similar power for random read and sequential read, and the power consumption increases as the throughput grows. Therefore, random read of SSD is much more energy-efficient than that of magnetic disk drives. On the contrary, the power consumption shows a dramatic difference between random write and sequential write. The power demand for sequential write increases with the growth of the throughput. However, random write requires the power as much as the maximum power consumption of each device all the time, regardless the request size. Due to the write-once feature of flash memory, frequent writes incur frequent garbage collections, thus introducing significant energy consumption. Li et al.[64] revisited virtual memory system design considering the limitations imposed by flash memory from an energy standpoint. They proposed to employ an SRAM cache to buffer frequent writes, to partition a page into subunits and only flush the dirty sub-pages to flash memory, to exploit the data redundancy between the main memory and flash memory to reduce writes incurred by garbage collection. Experimental results show that the combination of the above three methods can save significant energy. 5. Discussions and conclusions Magnetic disk drives offer high capacity, low price. It is suitable for cold data and archive. Flash memory disks provide nonvolatile storage, and high performance for hot and warm data. Nowadays, flash 20