Final Report: FTL Algorithms for NAND Flash Memory Devices

Transcription

1 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY Final Report: FTL Algorithms for NAND Flash Memory Devices Pooyan Mehrvarzi , Mujtaba Tarihi , and Arash Tavakkol Abstract The requirements for storage performance and capacity are ever-increasing. NAND flash-based Solid State Disks (SSDs) have been introduced as a reliable and fast storage device to respond to this market requirement. An important part of each SSD is its Flash Translation Layers (FTL) which highly impacts overall performance. FTL manages the internal data layout for storage while behaving like a block device towards the host. There are different tradeoffs involved in FTL design. This report initially introduces a number of existing FTL algorithms which were considered as base for the project. Each explanation is followed by our ideas to enhance mentioned algorithms performance. Finally, one of these ideas on DFTL algorithm is selected for implementation, and then evaluated using disksim simulator. DFTL, Flash Translation Layer, NAND Flash, SSD. Index Terms 1 INTRODUCTION Flash technology has been in use for a long time. Before being used for main system storage, they have been in use especially in embedded system and as small-sized NVRAM. However, the recent advances in semiconductor technology has allowed the construction of NAND flash-based main storage with a much higher density at a lower cost, making it feasible as a mass storage media. Compared to mechanical disks, flash-based storage has a lower power consumption, is resistent to shocks due to having no moving parts and provides much higher throughput for random access traffic as it doesn t need to perform costly head seek and spin up operations. The a short comparison of the characteristics of an SSD in comparison with an HDD is illustrated in Table 1. As can be seen, the random access latency of SSD is much lower than HDD. However, the read/write operations are asymmetric, number of write-cycles is limited, and erase operation takes a lot of time. This makes the use of NAND flash-based SSDs somehow problematic. Flash is based on trapping a charge in a floating gate transistor, meaning that each

2 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY programmable cell in the device must be erased before being written (called erase-before-write), taking a considerable amount of time. The erase operation also happens in a larger granularity compared to the basic unit of read/write which is a page. As a result, the management of flash memory can be quite complex as not only it requires out-of-place updates, it also requires the controller to do garbage collection and wear-leveling (uneven wear on the storage cells will reduce the potential life of the storage). Garbage collection is necessary because even changing a single page in an erase unit will force an erase of the whole block, meaning that the valid data inside the block must be moved to a new destination. In order to hide the complexity of flash storage management from a system that is unfamiliar with flash storage or doesn t have the required algorithms implemented, a compatibility layer called Flash Translation Layer (FTL) is implemented. There are many works in the flash-related literature concentrating on FTL design and related issues such as FTL effects on performance, endurance and energy consumption [4], [3], [5], [7], [6]. In this project, we selected a subset of the most well-known papers in the mentioned area to investigate their main proposed ideas and possible solutions for improvement. Finally, we selected the work presented by Gupta et al. which presents a Demand-based Flash Translation Layer (DFTL) to selectively cache address mappings [4] and prevent usage of large SRAM/DRAM cache buffer sizes. In fact, DFTL tries to change traditional design of FTL and stores all the page map data on the flash storage itself. DFTL uses a caching mechanism, similar to Translation-Lookaside Buffer (TLB) in microprocessors, to minimize address translation time. The rest of this report is organized as follows: Section 2 provides background materials and related terminology in the FTL design. A short report on our review process presented in section 3 including abstract of the selected papers, possible ideas to improve functionality of the presented methods and our final choice among these options. Section 4 is dedicated to detailed description of the project target and possible solutions to improve DFTL. In addition, this section explains details of the implementation process. Finally, the experimental results are presented in Section 6. Section 7 concludes our work.

3 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY Attribute or characteristic Spin-up time Random access time Read latency time TABLE 1: Comparison of SSD and HDD properties [1]. SSD No need to spin-up (almost instantaneous) About 0.1 ms Generally low because the data can be read directly from any location HDD May take several seconds specially on start up. Ranges from 510 ms due to seek and rotate time. Generally high since the mechanical components require additional time to get aligned Once the head is positioned, a HDD can transfer data at about 100MB/s. Has moving parts (heads, spindle motor) and makes some sound. Heads are susceptible to shock and vibration. May not be specified to operate in all orientations. Data transfer Typically ranging from about rate 100MB/s to 500MB/s. (Acoustic) No moving parts, hence makes no noise sound. Environmental No moving parts, very resistant to factors shock and vibration. Installation Not sensitive to orientation, vibration, and mounting or shock. Usually no exposed circuitry. Magnetic fields No impact on flash memory. Magnets or magnetic surges could in principle damage data, though usually well-screened inside metal case. Weight and size Reliability and lifetime Cost per capacity Storage capacity Read/write performance symmetry Power consumption Small and light in weight. An HDD is relatively large and heavy. No moving parts to fail mechanically. HDDs have moving parts and are But write-cycles of each sector is limited subject to sudden catastrophic fail- to 10 5 times. ure; over a long enough time all drives will fail. As of 2011 NAND flash SSDs cost As of 2011 HDDs cost about about USD per GB USD per GB. In 2011 SSDs were available in sizes up to 2TB, but less costly 64 to 256GB drives were more common. Write speeds significantly lower than read speeds. Normally, requires half to a third of the power of HDDs; highperformance DRAM-based versions require as much power as HDDs. In 2011 HDDs of up to 4TB were available. Generally has slightly lower write speeds than their read speeds but they are assumed to be equal. The lowest-power HDDs (1.8 size) can use as little as 0.35 watts. 2.5 drives typically use 2 to 5 watts. The highest-performance 3.5 drives can use up to about 20 watts. 2 BACKGROUND As illustrated in Figure??, an SSD is composed of different components that work together to provide a traditional block-based storage device to the host interface. The SSD includes

4 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY processing power, as well as local memory and firmware to accomplish this purpose. One of the main limitations of flash chips is originates from the asymmetric read/write mechanisms, where a data read operation can be performed quickly, but a data write operation not only takes more time, but also requires the target memory page to be erased. In fact, a flash cell may be in two different states, free or programmed. A previously programmed cell must be erased before writing new data. Such eras-before-write property may worsen the write mechanism performance one to two orders of magnitude. To alleviate this problem, flash designers manage data pages into groups of certain sizes (16, 64,...), namely data block, where erase operation is performed on all block members simultaneously. Using this structure, if a data page is updated, the new Fig. 1: SSD block diagram [8]. data value is written to an arbitrary empty page and the previous data page is invalidated to be available for future erase operations. This way, the erase-before-write overhead can be hidden. Roughly speaking, it is necessary to implement a mapping mechanism to translate physical page addresses (PPA) to logical page addresses 1 (LPA) as well as the implementation of a garbage collection mechanism. Address mapping is a responsibility of the FTL. Based on the granularity of mapping, one can classify FTL algorithms into three classes: Block-mapping: The LPA is split to an LBN 2 and an offset. The LBN is translated to a physical block (an erase unit) on the flash storage. The diagram for this method can seen in Figure 2a. Page-mapping: As can be seen in Figure 2b, the LPA is used directly used as LPN. The LPN is then translated using the mapping structures. The page-mapping algorithm has a much higher degree of freedom on the data layout. While in block-mapping strategy, a page must 1. The address space as presented to the Operating System is called the logical address 2. Logical Block Number

5 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY Logical Address LBN Offset Block Map PBN Logical Address LPN Page Map Physical Address PPN Offset (a) (b) Fig. 2: Two main traditional mapping strategies. Both of these algorithms use a portion of the address to locate the physical address on the flash storage. (a) Page-mapping: the mapping is done in page granularity (the read/write unit) (b) Block-mapping: the mapping is done in block (the erase unit) granularity. always have the same offset in the block, this is not the case for page-mapping. This allows much more freedom in arrangement of data. The page-mapping algorithm however, will be much more expensive in terms of space needed to hold the mapping structures. Hybrid-mapping: This type of mapping provides a tradeoff between above mentioned algorithms. In fact, in hybrid-mapping, SSD pages are split into the log and data areas. The log area will be page-mapped and will absorb page updates, while data blocks are block-mapped. When certain conditions satisfied (e.g. a threshold for the percentage of free blocks reached), merge operation will be started: The valid data pages inside log area will be moved to the data area and space will be freed. This algorithm reduces the amount of space needed to store the mapping information compared to the page-mapping algorithm, lowering the cost and power consumption related to SDRAM DRAM usage. On the other hand, this method of storage will lead to a number of complicated tasks during merge operation. For example if a block is written to repeatedly, it will lead to repeated search and merge operations, sacrificing performance. As illustrated in Fig.?? merge operations are classified into three classes: 1) Switch Merge: All the pages in a log block A are in their respected block offsets. In this case, block A is simply renamed to the data block B. With proper implementation, these marges can be used for sequential writes. 2) Partial Merge: The valid pages in log block A belong to data block B and they are at the correct position that should be. The data pages are simply copied over and block

6 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY A is erased. This is far less costly than a full merge. 3) Full Merge: The valid pages inside one or more log blocks belong to a data block and a partial merge is not possible. A new, free block must be allocated and then written to. A full merge can involve a large number of blocks and is the most expensive form of merge. This form of merge has a high chance of happening under heavy and random loads. The flash chips by default associate a metadata space with each page, also called the out-of-band area. Conventional mapping algorithms take advantage of metadata to store error correction data as well as mapping information. This form of storage for mapping table is called inverse mapping and forces FTL to scan metadata to reconstruct address mapping table every time the system powers up. Data Block A Log Block B Data Block A LPN=0, I LPN=1, I LPN=2,I LPN=3,I LPN=0, V LPN=1, V LPN=2,V LPN=3,V LPN=0, I LPN=1, I LPN=2,V LPN=3,V LPN=0, V LPN=1, V LPN=N,F LPN=N,F Switch Merge Partial Merge Data Block A LPN=0, I LPN=1, V LPN=2,I LPN=3,V Data Block C LPN=N, F LPN=N, F LPN=N,F LPN=N,F Log Block B LPN=9, I LPN=71, I LPN=0,V LPN=2,V Full Merge Fig. 3: Different forms of merge 3 LITERATURE REVIEW PROCESS From a huge bulk of different papers about FTL design, we selected 3 major works from top computer architecture and storage system conferences. Following subsections provide a short abstract of each paper and our suggestions and ideas to enhance these works. 3.1 DFTL: A Flash Translation Layer Employing Demand-based Selective Caching of Pagelevel Address Mappings Gupta et al. [4] presented DFTL at ASPLOS 09, which is based on page-mapping algorithm. The authors emphasized on the fact that in markets such as portable devices and enterprise

7 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY applications, the performance of flash-based storage has not met the expectations, so they choose to focus on random write performance which is very important for enterprise workloads. They continue by pointing out that if the flash is to use a page-mapped solution, a large amount of space should be used for mapping, which would be very expensive in terms of cost and power if it were stored by SRAM 3. The authors also consider the block-based mapping to be an inappropriate choice due to repeated block level updates and heavy garbage collection overhead. In addition, authors provide a list of state-of-the-art algorithms: Block Associative Sector Translation (BAST): An algorithms based on block-mapping in which direct association of a log block with a data block. Fully Associative Sector Translation (FAST): An algorithms based on block-mapping in which all log blocks are associated with all data blocks. Superblock FTL: An algorithms based on block-mapping, adding another level in the hierarchy by grouping consecutive blocks together. This method also tries to separate the hot and cold blocks. Locality Aware Sector Translation (LAST): An algorithms based on block-mapping which tries to enhance FAST by identifying sequential writes as well as hot and cold blocks. However, the authors argue that the method for detecting sequential writes is inefficient. Authors claim that none of the mentioned algorithms provide appropriate performance. However, as a reference for DFTL evaluation process, they selected FAST as a representative of this group of algorithms. Finally, they declare that a high-performance FTL should completely be redesigned by doing away with log-blocks.. DFTL design idea is similar to the memory-management units (MMU) inside microprocessors. In fact, the mapping structure is a two-level map that is stored on the flash data area similar to how page maps are stored inside the main memory and an SRAM-based cache, namely Cached Mapping Table (CMT). The CMT is used to store recently used mapping entries and functions in a similar manner to translation lookaside buffers (TLB). Even though DFTL is based on page-mapping, the mapping structures are fundamentally different. In a normal page-mapping algorithm, inverse maps are stored inside metadata of each page and are used to store the logical addresses corresponding to the page/block. The metadata 3. The authors however, do not consider the case where a DRAM would be used to hold the translation entries

8 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY will then be used to construct direct maps during disk power up. On the other hand, DFTL makes use of the main data area rather than the out-band-area. When a certain logical page address should be retrieved, DFTL searches the CMT. If a cache hit occurs, the address will be decoded. However, on the cache miss situation, related mapping table entry would be found in the on-flash mapping table and the entry is loaded into CMT. This way, only 2MB of flash data is needed for 1GB of storage space (0.2% overhead) and much smaller SRAM is required to store CMT data. One of the main functionalities of caches is to take advantage of temporal locality. An often used cache replacement policy is the Least Recently Used (LRU). LRU selects the entry whose last access has been the most far in the past. To determine this entry, a counter is usually used in implementations. Every time a cache miss occurs, an empty entry is selected and it s counter is set to 0 while all the other counters are incremented, if an empty entry does not exist, the entry with the largest counter is selected as the victim. In case of a cache hit, all the counters with a value less than the hit entry is incremented and it is set to 0 itself. A variation of this method is the Segmented LRU, which is illustrated in Figure 4. As the name implies, Segmented LRU comprised of two segments. This design gives a higher priority to holding the least recently accessed entries than entries that will not be accessed soon after their first access. The authors selected a such design for the CMT replacement policy. Fig. 4: Diagram for the Segmented LRU There are two kinds of blocks in DFTL: Translation blocks, which store the logical-to-physical mappings and Data blocks, which store the actual data. The diagram for the translation steps can be seen in Figure 5. The logical address is the

9 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY Logical Address: LPN CMT CMT Miss: Consult GTD for Mapping Location GTD Return PPN Translation Pages CMT hit: Return PPN Load Related Translation Page to CMT Fig. 5: Diagram for DFTL translation, consisting the two-level lookup structure. GTD is consulted to find the desired translation entry and CMT caches recent entries. address requested by the host. The logical page number is generated by looking at the high order address bits. At first, a lookup is made in the CMT SRAM-based cache. If there is a cache hit, the physical page number is returned to the controller. If there is a cache miss, the Global Translation Directory (GTD) is consulted to locate the translation block which contains corresponding mapping entry. Then the entry is fetched, mapping operation is accomplished and CMT is updated. If there is no space in CMT, a victim entry is selected based on Segmented LRU policy. If the victim entry has been changed (every write to a logical block will lead to the modification of it s cache entry, due to out of place write semantics of flash), it have to be written back to flash, which will lead to another traversal of the mapping hierarchy. The new entry is then saved to the CMT and returned to the controller. To increase the performance of DFTL algorithm, GTD is always present in SRAM cache. However, authors suggest periodic flushing of GTD to the flash storage to limit the amount of data loss in case of sudden power loss. To service write requests, two in-sram structures are actively used, namely the current data block and the current translation block. Incremental updates to these structures will lead to sequential writes, which improves performance. The worst case performance occurs when the a cache miss happens inside CMT and the selected victim entry is to be written back to the cache. This leads to two reads and a write.

10 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY The authors claim that the overall performance of the FTL algorithm is still better than the competing algorithms, stating that in read-dominated workloads, 97% of the evictions in the TPC-H benchmark do not lead to this costly case. According to the authors note, as temporal locality exists in real workloads, the performance hit due to this case can be averted. As another idea to improve performance, authors pointed out lazy write idea in which multiple updates to the same translation page can be batched to avoid costly flash write time. It is also claimed that, performance can be increased by postponing write requests with no translation entry inside CMT. Adding that the tests indicate that the benefits of having this mechanism far outweigh the costs. However, implementation details on these features given in the paper. A thresholds is set, after which the garbage collection routines will be activated. The garbage collection algorithm selects the page with the highest number of invalid pages. In other words, the algorithm is optimized for the highest space efficiency. During garbage collection, the valid page of a target block should be moved to new destination and GTD must be updated as well. In a nutshell, DFTL claims following enhancement on the current flash technology: As the mappings are done on the page-granularity, full merge operations are not performed in DFTL. Partial merge: DFTL stores pages accessed in a close window of time in the same physical blocks. Authors claim this will lead to an implicit split for hot and cold blocks. In addition, the benchmark results show that in most cases the number of partial merges done in DFTL is less than the sum of the number of full and partial merges in an algorithm like FAST. Random Write Performance: As DFTL does not require full merges, the random writes performance is improved. Block Utilization: Page-based mapping will take advantage of the block space in a far more efficient manner than a hybrid-mapping algorithm and will require less garbage collections, which leads to reduced overhead. The paper also includes a classification of FTL algorithms in as shown in Table 2. For evaluation, the DFTL was implemented under Disksim 3.0 using a flash model different from Microsoft s ssdmodel patch. The source code that was obtained from Pennsylvania State University website [10], upon the inspection of the source code, a number of issues were observed:

11 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY TABLE 2: Classifications of FTL algorithms, presented in [4]. The model that has been used for SSD, is comprised of a single plane only. The authors admit this in the download page. This is in stark contrast with current SSDs which not only have multiple planes. The authors postpone the implementation of a realistic SSD model to future. No support for lazy writes idea has been added to the FTL, the implementation only counts the number of times where a lazy write could have been performed. Four workloads were used in the evaluation: Financial: This OLTP trace was obtained from a financial institution, this trace was writedominated. Cello99: A write dominated access trace obtained from an HP UX server. TPC-H: A decision-support workload benchmark, read-dominated. Web Search: A read-dominated trace obtained from a server. The evaluation displays that due to poorer space utilization, FAST has a far higher number of erase operations compared to DFTL. Moreover, FAST will perform very complex full merge operations in OLTP workloads, as depicted in Figure 6. According to presented results, address translation in DFTL incurs 90% of the extra overhead in most workloads when compared with page-mapping. But overall it still performs less operations than FAST as can be seen by the benchmark results. They also report a 63% hit ratio for CMT in the financial trace. In the response-time results, DFTL nearly matches pure page-mapping algorithm and is almost always superior to the FAST. The authors of DFTL also do a microscopic analysis on the overloaded regions of the workload, comparing DFTL and FAST overheads in these regions. The results indicate uneven response-

12 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY Fig. 6: Number of full merges under OLTP load in FAST with 3% log provisioning [4]. times for FAST compared to DFTL. It must be noted however, that the authors of the FAST algorithm later [7] heavily criticized the choice of 3% log provisioning in the evaluation process. As they stated, such a small size is completely inadequate. The DFTL performance is also presented based on hit ratio and response-time over SRAM cache size. But the total size of memory is not indicated. As expected, the hit ratio reached 100% after a certain size for cache, which is interpreted to be the working set size of the application Ideas on The Paper PRAM technology can be used to hold translation entries. The proposed Lazy write policy is not evaluated and different methods could be used to improve its performance. Blocks with exceptionally high write rates can be cached in the SRAM buffer, i.e. hot and cold pages could be separated explicitly leveraging simple mechanisms inside GTD. Delaying writes can lead to inconsistency in the state of the total storage device. Some reasonable mechanism should be designed to prevent such states. The paper claimed an implicit separation of hot and cold blocks, but garbage collection can actually lead to mixing of these two block types. We can evaluate the impact of GC on such block types and design a GC aware of block types. Can multiple erase blocks be selected as current write blocks to improve the performance. In

13 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY fact, the DFTL performance under mixed application types or traces of an ordinary desktop system, including different shapes of traffic, should be investigated? 3.2 FASTer FTL for Enterprise-Class Flash Memory SSDs Lim et al. presented this paper in an attempt to respond to the criticism directed at the FAST FTL by authors of other papers (DFTL in particular) [7]. They also tried to improve performance of FAST by introducing the FASTer FTL. FAST is an FTL based on hybrid-mapping algorithm. The FAST word stands for Fully Associative Sector Translation, meaning that each log block can hold updates directed at different addresses 4. The log block to be merged is selected in a round-robin manner. The authors of DFTL, evaluated their solution against FAST as a representative for state-ofthe-art hybrid FTL algorithms. They concluded that FAST is inappropriate for OLTP workloads and it is not able to separate hot and cold blocks which leads to an extra number of merges. The FAST authors responded by saying that as the log blocks are reclaimed in a round-robin manner, each log block has a window of time where it can be overwritten. Invalidating any page of the log before reclaiming and adding a new version in another log block will inhibit the page to be merged. Authors defined the log window size concept as: Log window size = num of log blocks no of pages per block The the log window size will increase with the size of the log block. FAST can implicitly detect hot blocks and inhibit their merge. In fact, any page that has update rate higher than log window size will be update before being reclaimed and hence will remain in the log area. The authors go on to heavily criticize the choice of 3% log provisioning used in the evaluations of the DFTL paper [4], saying that FAST shows much higher throughput under OLTP workloads by increasing log provisioning level. Moreover, they stated that the log provisioning is a tradeoff between performance and cost. The larger log window size means that FAST will have a much better ability at separating hot and cold blocks. As illustrated in Figure 7 the number of blocks involved in a merge declines as the percentage of provisioning increases. In addition, as the percentage of provisioning increases to 15% the number of data blocks involved in a full merge decreases drastically. 4. A number of associativity formats were explored by the DFTL paper

14 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY Fig. 7: The impact of log area size on the number of blocks involved in merge operation [7]. The paper then focuses on improving the OLTP write performance of the FAST algorithm by introducing the FASTer algorithm. The authors characterize the OLTP workload as including a lot of small operations over a large dataset with an uneven access frequency for different blocks. Data was obtained from the TPC-C benchmark. The authors studied the impact of write interval on the FASTer operation. Write interval was defined as the number of requests that have been made to different pages between two accesses to the same page. Pages are classified based on their average write interval length in comparison with the log window size. This way, there are three groups of pages: hot, warm and cold. Warm blocks have a write interval near the size of the log window as the cold and hot blocks have longer and shorter write intervals than the log window size, respectively. FASTer is developed to perform better and reduces the variance in response-time without needing extra resources. Two policies have been added to the FAST algorithm: Second Chance Policy: If a block is to be reclaimed, it is given a second chance (if it has not been given a second chance already), it then is moved to another log block. This effectively doubles the window size. The warm pages that have a longer interval than the log window size but not the now-doubled window size, will have a chance to be overwritten and invalidated before they are merged into the data blocks. Isolation area: Cold blocks have a longer interval than the doubled log window size. FASTer moves the cold blocks to a part of the log space, called the isolation area where they would be progressively merged, a few cold pages at a time, whenever a new request arrives. A diagram displaying the function of these two policies can be seen in Figure 8. However, implementation details on the progressive merges is not specified in the paper.

15 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY Second Chance Log blocks Isolation Area Fig. 8: The second chance policy and isolation area in FASTer FTL algorithm. The benchmarks are done on the baseline page mapping, FMAX, FAST and FASTer algorithms. FMAX is a block-based algorithm, where a number of blocks are associated with a single virtual block. If a write is detected to a certain page in that virtual block, the list is searched. If that certain page is free, the write is done to the first free page of the same offset, otherwise, a new page is added to the list. When collecting garbages, the valid data in the list is collected in a single block and all the blocks in the list are reclaimed. As can be expected, FMAX is vastly inferior to the other algorithms. The authors claim the FASTer algorithm is able to improve performance over FAST by 30%-35%. A major problem with the results presented by FASTer authors is that in the charts, a provisioning percentage is defined for the FAST and FASTer as well as the page-mapping algorithm. However, provisioning is only defined for hybrid mapping algorithms, leading to resolved confusion Ideas on The Paper Evaluate FASTer under different workloads. Moreover, we can compare its performance with DFTL. The evaluation must be done against the real page-mapping algorithm. The page-mapping algorithm used in the paper comparisons, gives different results for different log-provisioning area size while the real page-mapping does not require a log area at all. The distribution of the number of blocks involved in a full merge vs. percentage of log area size does not indicate the behavior of the overall storage system. This may be important in

16 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY the performance evaluation process. Apply a scheme based on the ideas of this paper (such as the second chance policy and the isolation area) to the traditional page-mapping algorithm and investigate possible effects. Use a cache to keep the hot blocks. Use a PRAM technology to implement log area. 3.3 Design of Flash-based DBMS: An In-page Logging Approach Lee and Moon[5] intend to modify adatabase management system (DBMS) in such a way that it would be able to take advantage of NAND flash instead of magnetic disks. The authors suggest a new scheme for buffer and storage management and evaluated it. The authors claimed the method to be able to overcome the shortcomings of flash storage while being able to take advantage of flash properties. For example, NAND-flash based disks show uniform response-time which is not possible with magnetic disks. Nevertheless, alleviating erase-before-write limitation is necessary to provide better endurance and write response-time. A comparison between an existing hard disk and a NAND flash-based SSD was performed by authors. The results for the comparison is shown in Table 3. While the performance values for SSD do not differ under random and sequential read, the performance variation under random and sequential write loads varies drastically for SSD, surpassing even the performance variation of magnetic disks. This demonstrates the sensitivity of flash memory to write patterns, making it potentially risky to replace magnetic disks with flash drives. TABLE 3: Comparison of DBMS performance: Sequential vs. Random [5]. As the name implies, the In Page Logging (IPL) uses a log region to absorb page updates. This way, the original page and its updated version are exists simultaneously and will be merged when a predefined condition met. After covering the characteristics for flash storage, the authors highlight a number of points as the their design manifest: The flash memory has a uniform access pattern, allowing the log and data to be allocated anywhere.

17 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY Due to the erase-before-write limitation, it is desirable to avoid write operations as much as possible, even if it leads to a higher number of reads. Limit the modifications in a DBMS design to the buffer and storage managers. Doing the logging in a sequential manner favors the operation of mechanical disks as their sequential performance is much higher. Some approaches go as far as making the whole disk a log (log-based filesystems), however, reading the contents of a file will require scanning the content of the disk as the content may be scattered. As flash-based storage has a uniform memory access latency, it allows for the log writes to be scattered all over the flash without a performance penalty for reads. The approach which was selected for IPL is to co-locate the data and the logs, where the pages of a block will be split between data and the log. Also each page is divided into a number of sectors where each sector can contain changes to the data pages in the same block. The modified DBMS still writes the data inside the memory the same way it used to be, but it will also maintain an in-memory log, recording the changes. There is no need to write the dirty data back to the flash as the in-memory log contains all the modifications. Then modifications will not be immediately committed to the in-flash log, allowing the scheme to avoid writes as much as possible. Figure 9 shows an example of how IPL may be implemented. A 128KB data block is split into 15 8KB pages and the last page will be used to store 512 byte log sectors. When a block runs out of log space, a merge operation will be performed on the data and log pages. Fig. 9: The design of In-Page Logging (IPL) [5].

18 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY As a result of the IPL scheme, the read operations will be more complicated. When trying to read from a block, the log sectors must be read as well to search for possible updates and construct the actual content. This will lead to a higher overhead but the authors argue that as IPL avoids writes and erases, the performance will be improved, despite the cost of extra reads. This method forces the need for allocation of in-memory log area, but the authors estimate the extra memory cost associated with the method to be only 1.3%. A merge operation takes a free erase unit, uses the contents of the data and log information inside a block to construct the actual data and writes it to the free erase unit, then erases the old block. The mapping structures will have to be updated as well in order to point to the new block address. The merge operation can be seen in Figure 10. Fig. 10: The IPL merge operation [5]. For evaluation, a number of specific queries to measure the read and write performance are tested and a workload is also generated for the TPC-C benchmark. As the size of the log area is increased, the number of merge operations is decreased but the total amount of data that can be stored in this scheme is reduced as well 5, the authors concluded that the reason for this effect was the uneven distribution of updates. Hot pages use up their allocated log space far faster, leading to an increase in merge and erase operations and thus the increase in log area was effective in decreasing the number of merge operations. The paper also discusses the possibilities for recovery support in IPL. The log information can be used for recovery purposes, however it is necessary for extra information to be saved 5. Unlike the FAST algorithm, the log area is not a storage for data, but a form of journal. Leading to lower capacity for the flash device. For example the arrangement in Fig. 9 will hold 6.25% less data compared to the total flash capacity

19 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY by the DBMS, such as the state of transactions. A database recovery mechanism will also be forced to make use of a log or journal, constantly forcing writes to the media. The authors claim they can force writes to the flash device with little performance loss (due to the access time characteristics of flash storage) and unlike recovery mechanisms that are forced to perform REDO on the operations in the journal, such an operation will be implicitly performed by IPL. However, aborting transactions will still be a complicated operation under IPL Ideas on The Paper The following list, presents different ideas to improve IPL performance: Make log region size to change dynamically. In this case, if the log page is full, a free page in the block is added to the log page. The above suggestion may evolve to another idea, in which the blocks of the flash memory are partitioned into classes, where blocks of each class have a particular size of the log region. In fact, we can use the hot and cold partitioning of the DFTL algorithm. This way, we can choose blocks of the smaller log region size to write cold data pages, and blocks with larger log region size to write warm and hot pages. Another work can be done is to balance block updates by positioning hot and cold pages inside a single block. This way, we can unify the number of updates of all block to a moderate or small value. 4 SELECTED PAPER AND MAIN PROJECT IDEAS Due to technical and implementation reasons, we selected our idea to use PRAM in DFTL, namely PDFTL, to improve performance of this algorithm. This section provides details of the idea and following chapters provide details of the implementation process and experimental results. 4.1 Using PRAM in The Implementation of DFTL If the page-mappings are kept inside a flash memory as flash requires FTL to do an out of place update not only extra costs will be incurred by having to do read-modify-write operations on the translation blocks, but the cells are used for translation will be worn out and wear-leveling issues must be included as well. On the other hand, PRAM supports in-place updates, and is

20 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY also far more durable than a flash device. In addition, PRAM read/write time is much lower and designer have freedom in read/write unit size on PRAM. 4.2 PRAM Technology The memory cells of PRAM are constructed using a chalcogenide-based material such as Ge 2 Sb 2 T e 5 or Ge 2 Sb 2 T e 4 (GST) [12]. In fact, the data storage basis of a PRAM cell is its electrical resistivity. The resistivity of crystalline and amorphous states of chalcogenide materials greatly differs and can be used to represent logical values. The amorphous state shows high resistance and is assumed to represent logical 0 while the crystalline state shows low resistance and is assumed to represent logical 1. The state of GST is changed during SET/RESET processes [13]: RESET: When GST is heated to a high temperature, it is melted and the crystalline format is destroyed. Then, the melted material is frozen to amorphous state which shows high electrical resistance. where a high value current conducted through GST for a short period of time. SET: To change the amorphus material back to the crystalline state, a low current of a constant value is conducted through GST for a long period of time. This way, the GST crystalline is formed which shows low electrical resistance. Currently, PRAM is used in small sizes due to its high cost and manufacturing limitations. There are different works in the literature leveraging small PRAM blocks to improve NAND-flash based SSDs performance [3]. In this work we use a similar idea to improve DFTL performance. Table 4 provides a short comparison of PRAM and NAND-flash properties presented in [3]. TABLE 4: Comparison of PRAM and NAND-flash memory technologies [3]. 4.3 Using PRAM to Store Mapping Information As previously mentioned, we are trying to use PRAM to store translation-blocks of DFTL. There are challenges in implementing this idea:

21 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY PRAM Data transfer Unit Size (PDUS): In a normal NAND-flash based SSD, all transfers are performed in page unit (2KB or higher). However, when utilizing PRAM, we have the chance to use different values for PDUS, since PRAM banks of different word-length can be manufactured. Selection of PDUS poses great impact on the overall performance. Hence, we do not insist on any pre-determined value so that we can test PDFTL under different sizes and study its behavior under various traffic traces. PRAM endurance: As mentioned in Table 4, PRAM write cycles is limited to 10 8 and any design using this technology must consider endurance and lifetime issues. To overcome this problem, we suggest to use the simple and easy implementable Start-Gap wear-leveling algorithm [11]. We will later describe this algorithm in section This way, we just have to guarantee that the number of updates in the whole PRAM is not 10 3 times higher than the number of writes in the flash storage. This is directly related to the PDUS choice. In fact, if PDUS is equal to one mapping entry (e.g. one 32-bit word at a time) then the number of updates of the PRAM cells are equal to the flash storage cells and hence there is no doubt about PRAM endurance. However, larger choices for PDUS may result in higher update rates. As an example, when P DUS = 128, updating a single mapping entry may lead to update of 127 non-updated entries. If we guarantee following condition: P DUS (1) then the PRAM endurance problem is deterministically solved. Nevertheless, in cases Equation (1) is not satisfied, we can still guarantee PRAM endurance leveraging special methods such as lazy writes. As mentioned above, we suggest to use Start-Gap algorithm to guarantee PRAM endurance. Next section specifies details of this algorithm The Start-Gap Algorithm Introduced by Qureshi et al [11], Start-Gap specifies an algorithm for address mapping as well as wear-leveling of PRAM. An important design decision in this algorithm was to avoid use of table-based structures. Table-based methods, such as block-mapping and page-mapping algorithms, use a table to

22 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY hold the wear counter 6 as well as mapping information. The amount of space required for these structures linearly increase with data capacity, meaning that a lot of memory space will be wasted. Unlike flash-based storage, PRAM allows for in-place updates. Meaning that the implementation will be far less complicated and PRAM is expected to have a far greater endurance than flash-based storage. The Start-Gap is implemented on an architecture where a 16GB PRAM is used as main memory with a 256MB write-back DRAM cache. The goal is to create a method that uses algebra and a minimal amount of storage to perform wear-leveling and address mapping. Fig. 11: The Start-Gap algorithm, taken from [11] The memory in the Start-Gap is organized into multiple lines. Assuming there are N lines in PRAM to use, there is another extra line stored in the volatile memory at the N + 1th position which is the Gap line. This means that there will always be N + 1 lines of storage where N lines hold data and one line is empty, aka the gap. The algorithm also makes use of two registers, the start and the gap registers. The organization process is as follows (see Figure 11 for an illustration): Initially, the lines are in order and the gap line is empty. The start register points to the first line and the gap register points to the last line (the gap line). Every ψ writes, the gap is moved one place to the top. In other words, the contents of the block above the current gap position will be written to the gap, moving it upwards. Initially, the value of 100 is chosen for ψ, but choosing an appropriate value will be discussed later. 6. It should be noted that the erase counter is always kept for blocks, taking considerably less space than per-page data.

23 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY The gap pointer eventually underflows, pointing to the last line again (The N + 1th line). At this point, the start register is moved down one spot. In the described algorithm, the method for the calculation of the physical address from logical address is specified in Equation 2: physical address = ((start register + logical address)%n) (2) But the physical address is also incremented by 1 if the gap is at a higher position than the the initially calculated physical address (in other words, we have to consider the gap shifting the address one point down). The authors estimate the average endurance of a PRAM-based memory without any wear leveling to be around 5% of the maximum possible endurance under their average workload. The basic start and gap implementation yielded 50% of the maximum possible life (with a ψ of 100). The authors then set out to find the reason for this difference. The analysis of access frequency for different region displays an uneven frequency for different regions, with writes being focused on certain areas. As the start-gap method can only move a gap in one direction, this results in a region being worn out unevenly which is undesirable. Fig. 12: Feistel networks, one of the shuffling mechanisms used in conjunction with Start-Gap[11]. The first solution that is selected by the authors is to rearrange the address space in a random or semi-random manner. This is done by adding an intermediate stage of address translation just before the physical address, where the addresses are shuffled in a reversible manner. These shuffling algorithms always rearrange the addresses in the same order, meaning that the parameters are always fixed. Two shuffling methods are introduced:

24 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY Feistel networks: These networks are implemented as part of the DES algorithm by IBM. A three stage network (Figure 12) is used in start-gap study. At every stage, the input is split into two parts, one part is used along with the encryption key by the key generation function, the result of which will be XOR-ed with the other half. The half that is used for key generation is alternated in successive stages. The address is shuffled using this network in a reversible manner and the keys are static throughout the network operation. Random Invertible Binary Matrix (RIB): The address is arranged as a vector and is multiplied with a static random invertible matrix 7. As this matrix is reversible, the target address can be used to generate the source address. The evaluation results indicate that Start-Gap with either Feistel networks or RIB approaches to very close to the maximum possible endurance. The authors then focus on the possibility of a user intentionally making continuous writes to a fixed address with the purpose of exhausting the lifetime of a certain region, thus reducing the lifetime of the storage. The storage will be exhausted in a surprisingly short time with no wear leveling. The authors consider the tuning of φ to resist this attack. To this end, the authors split PRAM address space into multiple regions, each managed by a separate Start-Gap implementation. Considering K lines in each region. If the gap is to move before the malicious user can wear out a certain block then, Equation (3) must be satisfied: K < (W max /φ) (3) The amount of extra space needed by this implementation is negligible. A Delayed Write Policy can merge multiple outstanding writes to the same address by using a queue to hold the recent write requests. The number of requests that can be merged to a single request are called the Delayed Write Factor (DWF). The Delayed Write Policy effectively multiplies the amount of time needed to exhaust the write endurance of the PRAM by DWF. 5 IMPLEMENTATION PROCESS We use Microsoft s ssdmodel patch in disksim simulator to implement ideas mentioned in section 4. We also test our implementation using IOZone, Postmark, TPC-C and TPC-E traces. In this 7. If a matrix has a non-zero determinant, it is reversible.

25 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY section we explain main changes made in ssdmodel code, to implement DFTL. 5.1 DFTL Implementation in Disksim 4.0 Since ssdmodel is based on normal page-mapping architecture, we just have to add special code for CMT mechanism. To this end, we downloaded DFTL source code provided in disksim 3.0 [10] and annotated it to understand main implementation ideas. As we mentioned before, DFTL sources is based on very simple assumptions on SSD operation and is completely different from Microsoft s model. We added a pair of header and source files, namely cmt.h and cmt.c to ssdmodel sources. In these files, we implemented functions and variables corresponding to CMT operation. Next subsection is dedicated to the description of these files contents Contents of cmt.h Here, we define important parameters of cmt.h file and ignore secondary and less important ones which are not help to understand implementation method. The parameters can be classified into 4 groups: Cache Status Parameters: This set of parameters is related to the cache physical and logical properties. As depicted in Figure 13a, there are three parameters which are related to cache entry status namely: MAP INVALID MAP REAL MAP GHOST These values are correspond to the Segmented LRU replacement policy which was depicted in Figure 4. In addition, there are 3 parameter defining CMT characteristics: MAP REAL MAX ENTRIES DFTL VALUE MAP GHOST MAX ENTRIES DFTL VALUE MAP ENTRIES PER PAGE DFTL Other parameters listed in Figure 13a are used in cache management process. Flash Physical Attribute Parameters: This set of parameters contain information about flash physical attributes such as page read latency, page write latency and number of mapping entries inside each page. Figure 13b illustrates definition of this set of parameters.

26 ADVANCED STORAGE SYSTEMS, FINAL REPORT, FEBRUARY (a) (b) (c) (d) Fig. 13: Different parts of cmt.h file (a) Cache related parameters (b) NAND-flash physical attribute parameters (c) Cache entry structures (d) Statistic gathering parameters. Cache Entry Structure: This set of constructs define data structures and arrays which must be used to simulate cache behavior (Figure 13c). opm entry: This structure defines a simple mapping table entry. This entry could be and on flash translation block or a CMT entry. omap dir: This structure is used to simulate GTD behavior. ghost arr DFTL: This array specifies CMT entries in ghost side (see Figure 4). real arr DFTL: This array specifies CMT entries in real side (see Figure 4). Statistical Parameters: This set of parameters is used to calculate statistical information such as CMT hit/miss ratio and average response-time Contents of cmt.c In this section we define main contents of cmt.c file which simulates DFTL operation. Following list describes them: opm init: As depicted in Figure 14a, this function is used to initialize variables and struc-