A Data De-duplication Access Framework for Solid State Drives

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 28, 941-954 (2012) A Data De-duplication Access Framework for Solid State Drives Department of Electronic Engineering National Taiwan University of Science and Technology Taipei, 106 Taiwan E-mail: {chwu; m9702116}@mail.ntust.edu.tw With the rapid development of SSDs (Solid State Drives), traditional hard drives in many applications have been replaced by SSDs. Since SSDs consist of NAND flash memory, the main challenge to SSDs is that NAND flash memory is highly sensitive to write requests. A lot of write requests will cause garbage collection to reclaim free space due to the out-place update characteristic of flash memory. Frequent activities of garbage collection will reduce the lifetime of flash memory and overall performance. When SSDs are used for data storage, how to significantly decrease the amount of data written will become an important topic. In the paper, we will propose a data de-duplication access framework for SSDs. The objective is to eliminate duplicate data as much as possible and reduce space consumption. We will combine a file-based de-duplication and a static chunking de-duplication scheme to reach a complete data de-duplication. We will also investigate application-based locality and file-name locality to find out duplicate data. According to the experimental results, the proposed framework can efficiently identify duplicate data and decrease a lot of data written, and at the same time, the overhead is also reasonable. Keywords: embedded systems, flash memory, solid state drives, data de-duplication, storage systems 1. INTRODUCTION At present, most applications such as consumer electronics and embedded systems have adopted NAND flash memory as their main storage media. NAND flash memory has characteristics such as non-volatile, shock-resistant, and power-economic. Since the capacity of NAND flash-memory chips grows rapidly, SSDs (Solid State Drives) have been popular and are composed of NAND flash memory. For example, OCZ Storage has released huge-capacity NAND flash-based SSDs (e.g., 1TB SSD or above) in the markets. Until now, some desktops, notebooks, embedded systems, and servers have also adopted SSDs. We will refer to NAND flash memory as flash memory hereafter. The management of flash memory as a storage system is significantly different from those based on main memory and disks. The unit of an erase operation is a block and the unit of a read/write operation is a page. Since flash memory has the out-place update characteristic, the corresponding block should be erased before one residing page will be updated. When a lot of data are written or updated to flash memory, a lot of pages might be invalidated and should be erased. Therefore, free space on flash memory could become low after a number of writes, and activities (i.e., garbage collection) in the recy- Received May 31, 2011; accepted March 31, 2012. Communicated by Jiman Hong, Junyoung Heo and Tei-Wei Kuo. 941

942 cling of available space on flash memory must be done from time to time. However, garbage collection is considered as overhead in flash-memory management (caused by live data copying). Under current technology, a flash-memory block has a limitation on the erase cycle count. For example, a block could be erased for 100,000 times [1]. After that, a worn-out block could suffer from frequent write errors. A Wear-leveling policy intends to erase all blocks on flash memory evenly so that a longer overall lifetime could be achieved. Obviously, the garbage collection must reduce the overhead as much as possible and consider the wear-leveling policy. Overall, write operations caused larger overhead than read operations since write operations may trigger garbage collection activities and decrease the lifetime of flash memory due to erase operations. Traditional hard disk drives have been gradually replaced by SSDs on personal computers, embedded systems, and even server systems. Since SSDs consist of flash memory, these systems might cause a lot of write requests and will lead to performance degradation and reliable problems for SSDs. Fig. 1 shows the percentage of duplicate data on different systems, and duplicate data could occupy about 7%~23% storage size. If the duplicate data can be found out before writing, a lot of write requests to SSDs can be decreased and system performance will improve. This motivates this research. In the paper, we will propose a data de-duplication access framework for SSDs. The objective is to decrease a lot of write requests by eliminating duplicate data as much as possible. Fig. 1. Percentage of duplicate data on different systems. The rest of this paper is organized as follows: Section 2 is the related work for various data de-duplication algorithms. Section 3 presents the characteristics of duplicate data. Section 4 is the proposed data de-duplication access framework. Section 5 shows the experimental results. Section 6 is the conclusion. 2.1 Duplicate Data 2. RELATED WORKS There are two kinds of duplicate data in the storage systems [5]. One is called near-

A DATA DE-DUPLICATION ACCESS FRAMEWORK FOR SOLID STATE DRIVES 943 duplicate data and another one is called full-duplicate data. As shown in Fig. 2, whenever a part of a file is modified, a large part of existing data may be unmodified and will be re-written since upper applications (e.g., Word, Vi and Emacs) could not identify the unmodified data. As shown in Table 1 [7], there exists a large overlap between the newly version data and the older version data. As a result, when a file is updated, the unmodified data is called near-duplicate data. On the other hand, when a file is replicated, a new file that has the same content will be created. The content of the new file will be called full-duplicate data. In a file system, near-duplicate data and full-duplicate data always exist and waste storage space. For improving the performance of SSDs, the duplicate data should be identified and not be written such that the activities of garbage collection in SSDs will be decreased and the lifetime of SSDS can be increased. Fig. 2. Near-duplicate data. Table 1. Overlap between the newly and older version data. New Data Older version Data size New data Overlap emacs 20.7 source emacs 20.6 52.1 MB 12.6 MB 76% Elisp doc. + new page postscript 4.1 MB 0.4 MB 90% MSWord doc. + edits MSWord 1.4 MB 0.4MB 68% Athicha Muthitacharoen (2001) measured amount of new data in directory that those data had been edited [7]. 2.2 Data De-Duplication Algorithm The objective of the data de-duplication algorithm is to identify the duplicate data as much as possible. The algorithm usually cuts data stream into fixed or variable size chunks. Each chunk will have an identifier by a fingerprint hash function. Since duplicate data can be identified by comparing the identifiers, the hash function should guarantee a very low collision probability. Only different data can be actually written to storage such as a single-instance storage [2]. There are two kinds of data de-duplication algorithm: inline checking and post checking. In-line checking will execute the identification of duplicate data before writing to storage. Post checking will execute the identification after writing to storage. For hard disk drives, posting checking may not affect system response time and will execute when system load is light. However, posting checking is not suit-

944 able for SSDs since it can not reduce the actual write requests. As a result, the kind of in-line checking will be used in the proposed method. 2.3 Data De-Duplication Algorithm The purpose of the fingerprint hash function is to generate a fingerprint for each chunk. Michael O. Rabin proposes that the fingerprint hash function (i.e., Fingerprinting) has following scheme [3, 4]: Fingerprinting Assume Ω is the set of all possible objects, for all A, B belong to Ω with high probability, we have f(a) = f(b) A = B. Note that fingerprinting may have a collision (i.e., f(a) = f(b) A B). A collision occurs if A B and f(a) = f(b). A fingerprint hash function must have very low collision probability. In the paper, chunks are hashed by the robust one-way hash function SHA-1 [9]. Several researchers show that the probability of hash collisions is less than 10-19 by using the SHA-1 hash function [5]. SHA-1 produces a 160-bit digest from a message with a maximum length of (2 64 1) bits. 2.4 Data Chunking Since the data de-duplication algorithm usually cuts data stream into fixed or variable size chunks, there are two chunking approaches: static chunking and variable chunking [5, 6, 8]. 2.4.1 Static chunking Static chunking is a fixed size partitioning. In this approach, data stream is divided into a fixed size chunk and the de-duplicate algorithm can compare the fixed size chunks with low complexity. However, the effectiveness of this approach is highly sensitive to data modifications since even one byte modification at the beginning of a file can change the content of all fixed size chunks. For example, Fig. 3 shows that one byte is inserted at offset 5. Due to the shifting data, the content will be changed after offset 5. It is called a data shifting problem. Fig. 3. Two data chunking approaches: static chunking and variable chunking.

A DATA DE-DUPLICATION ACCESS FRAMEWORK FOR SOLID STATE DRIVES 945 2.4.2 Static chunking Variable chunking is a variable size partitioning. In this approach, data stream is divided into variable sized chunks. It can solve the data shifting problem which is caused by the static chunking. Whenever a chunk of a file is modified, variable chunking will dynamically change the size of the corresponding chunk, as shown in Fig. 3. However, the overhead of this approach is to maintain the mapping information between a variable size chunk and one or more physical blocks. Although the method could cause management overhead, variable chunking could efficiently eliminate duplicate data by solving the data shifting problem. 3. CHARACTERISTICS OF DUPLICATE DATA To find out duplicate data effectively, we conduct a series of experiments to observe the characteristics of duplicate data. In the experiments, we set up an 8GB SSD storage system with Linux OS and Windows XP. The duplicate checking program is called easy duplicate finder and the checking level is file-based. According to the experimental results, two characteristics of duplicate data will be described as following. 3.1 Application-based Locality Fig. 4 shows that 86% of duplicate data exist in the same directory because a large proportion of duplicate data are generated by the same application. These duplicate data are usually temporary files or log files. We call the characteristic as application-based locality. By the application-based locality, it is efficient to identify duplicate data and also reduce the overhead of the data de-duplication algorithm. Fig. 4. Two characteristics of duplicate data: application-based locality and file-name locality. 3.2 File-name Locality Fig. 4 shows that 60% duplicate data tend to have the same file name (e.g., a file could be copied to other directory with the same file name), 20% duplicate data have the

946 similar file name (e.g., system.loga and system.logb could have the duplicate data), and the remaining 20% duplicate data could be unrelated in terms of file name. We call the characteristic as file-name locality. By the file-name locality, it is efficient to identify duplicate data and also reduce the overhead of the data de-duplication algorithm. 4. DATA DE-DUPLICATION ACCESS FRAMEWORK In the section, the data de-duplication (DDD) access framework will be described. The proposed framework can efficiently identify duplicate data and decrease a lot of data written, and at the same time, the overhead is also reasonable. There are four parts in the framework: Meta table Eliminate full-duplicate data Eliminate near-duplicate data Reference count The proposed framework is above the VFS (Virtual File System) layer and will be implemented as library code which can be called by the upper applications. In section 4.1, a meta table is to maintain related meta-data for the data de-duplication algorithm. Each entry in the meta table will denote a file which is currently maintained by the framework. In sections 4.2 and 4.3, we will discuss how to efficiently eliminate full-duplicate data and near-duplicate data. In section 4.4, the concept of reference count is used to maintain the sharing relationship in case the process of deleting files is incorrect. 4.1 Meta Table The design goal of the meta table is to maintain related meta-data for the data deduplication algorithm. Table 2. An entry in the meta table. Entry i.e. Name Key 535541 File Name FileD File-Based Fingerprint a9993e364706816aba3... Reference Count 1 Physical Location FileD(0,4096) FileD_log(0, 2048) FileD(4097, 6144) In order to find out duplicate data, the framework will maintain an entry in the meta table for each file. Each entry is represented as follows: name key, file name, file-based fingerprint, reference count, and physical location. Table 2 shows an example. The file name means that the file is currently maintained by the framework. The name key is used

A DATA DE-DUPLICATION ACCESS FRAMEWORK FOR SOLID STATE DRIVES 947 to quickly identify which files that have the similar file name. The framework can generate the name key by a hash function and the hash function will sum all characters of the file name in ASCII code. This is because the design of the hash function will not cause heavy overhead and still has a high probability to find out similar file names. The filebased fingerprint is derived by SHA-1 to calculate entire file fingerprint. The file-based fingerprint can be used to quickly identify whether the file s content is the same with others or not. Since a file could be referenced by other entries, reference count will be used and its explanation will be introduced in section 4.4. The physical location will denote where the file is. In this example, FileD(0, 4096) FileD_log(0, 2048) FileD (4097, 6144) represents that the latest version data of FileD at offset 4097 to 6145 is located in the FileD_log at offset 0 to 2048. 4.2 Eliminate Full-duplicate Data According to the application-based locality, a large portion of duplicate data might be generated by the same application. According to the file-name locality, those files that have the similar file name might have a large portion of duplicate data. Therefore, it is very efficient to eliminate duplicate data based on the localities. As shown in Fig. 5, we will use an example to explain how to eliminate duplicate data by the file-name locality. Assume that FileD will be created and written. (1) In the meta table, an entry should be generated for FileD. Its file name, name key, and file-based fingerprint should be created accordingly. (2) When FileD s name key (NK) is created, those files whose name keys are in the range [NK k, NK + k] can be quickly found out, where k is a threshold. Since NK (i.e., F + i + l + e + D ) is created by summing the characters of FileD in AS- CII code, k can be used to find out the files which have the similar file name. (3) After those files which have the similar file name (e.g., FileA and FileC) are found out, FileD s fingerprint can be compared with those files. If one file (e.g., FileA) has the same fingerprint with FileD, it means that FileD will not be written to SSDs and the physical location of the entry in the meta table for FileD should keep the information about FileA. Fig. 5. An example of eliminating full-duplicate data.

948 According to our observations, a small portion of files (e.g., 10%~20%) which have no similar file name might also have duplicate data. However, it could need a lot of computing time to find out these files without the similar file name. As a result, those files without the application-based locality and the file-name locality will not be identified for system performance. 4.3 Eliminate Near-duplicate Data In this section, the de-duplication algorithm for near-duplicate data will be described. Section 4.3.1 presents the chunk fingerprint table to reduce checking time of fingerprinting. Section 4.3.2 discusses the log file accessing scheme to solve the data shifting problem. Since invalid data could occur in the log file, a merge operation will reclaim the invalid data in section 4.3.4. 4.3.1 Chunk fingerprint table If a file has been updated recently, it might be updated soon. In order to quickly identify near-duplicate data, related fingerprints for the file should be recorded. This is because the fingerprint can help the identification of the near-duplicate data. The chunk fingerprint table will maintain a file's fingerprint when the file is updated. Each entry in the chunk fingerprint table will contain a file name and a 160-bit (SHA-1) fingerprint for each chunk of the file. 4.3.2 Log file A log file is used to solve the data shifting problem under the static chunking. As shown in Fig. 6 (2), a file may be modified due to data insertion. When the file is closed, the space from the modified area to the last chunk might be written to SSDs. However, if the modified area can be located, as shown in Fig. 6 (3), only the modified area is written to the log file and the corresponding physical location in the meta table should add the new mapping. As shown in Fig. 6 (4), it can locates where the valid area is in the original Fig. 6. Log file: an example of data insertion.

A DATA DE-DUPLICATION ACCESS FRAMEWORK FOR SOLID STATE DRIVES 949 file and where the modified area is in the log file. The physical location format has been described in section 4.1. Since only the modified area is written, unnecessary (redundant) data written can be avoided. In the following subsection, we will present how to efficiently identify the modified area when a file is updated. 4.3.3 Identification of near-duplicate data In the section, we will propose an identification of near-duplicate data under fixedsized chunks. Assume that FileD is the old version file and FileD is the new version file after FileD is updated. Assume that the modified area consists of two parts such as Fig. 7, the identification will execute forward order checking and backward order checking. Forward order checking means that each chunk in FileD is compared with each chunk in FileD by their fingerprints in a forward sequential order from the beginning. Backward order checking has the similar meaning except it is executed in a backward sequential order from the ending. Since FileD and FileD might have different file length, forward order checking and backward order checking can locate different starting offsets and ending offsets for those different parts in the modified area. Note that forward order checking and backward order checking will still execute until all modified areas are found out. After the modified area is identified and written, the corresponding physical location is also updated. As shown in Fig. 8, since the modified area consists of two parts, FileD_log(0,2048) and FileD_log(2049,4097) denote new data mapping and will be added in the corresponding physical location. Fig. 7. Identification of near-duplicate data. Fig. 8. The relocation of de-duplicate algorithm.

950 4.3.4 Merge operation Since data are written to the log file in an appending way, the log file might contain invalid data as time goes by. A merge operation is required to reclaim the invalid data and is like garbage collection for flash memory. As shown in Fig. 9, FileD and its log file FileD_log can be merged to a new file by coping valid data according to the corresponding physical location in the meta table. Obviously, the merge operation might cause overhead in SSDs but it can increase space utilization. Although this is a trade-off, the merge operation can be executed in an optimized why only when the amount ratio of invalid data and valid data is large than a threshold. Note that the amount ratio of invalid data and valid data should be maintained when each update is executed. 4.4 Reference Count Reference count can be used to maintain the sharing relationship in the framework. The process of deleting files must check the corresponding reference count, and the reference file can be deleted only when the reference count is 0. As shown in Fig. 10, FileB Fig. 9. An example of merge operation. Fig. 10. The process of deleting files.

A DATA DE-DUPLICATION ACCESS FRAMEWORK FOR SOLID STATE DRIVES 951 is referred by FileA such that their reference counts are 2 and 1, respectively. If FileA is deleted, the reference count of FileB will become 1 and FileA will be deleted actually since its reference count is 0. On the other hand, if FileB is deleted, FileB can not be deleted actually. This is because FileA still refers to FileB. 5. EVALUATION In the experimental environment, the processor is Pentium Dual CPU E2180 2.00 GHz and the main memory size is 3GB. The SSD storage is Transcend 8GB SLC solidstate-disk (TS8GSSD25S-S). In the experiments, we will use some famous file systems such as FAT32, NTFS and Ext3 as our comparison baseline. FAT32 and NTFS are the file systems in Windows XP. Ext3 is the most popular file system in Linux and was built on Ubuntu 8.10 in the experiments. A famous benchmark io_profile [15] was used in the experiments and can generate a large number of open-seek-write-close operations at random locations in a set of files. It can be used to measure the performance without the influence of the operating system s buffer cache. We will create a set of test files (e.g., about 60 files) and each file size is 4MB. We will set the random write size from 512/ 4096 bytes to 131,072 bytes and execute 100 repetitions for each random write size. In the experiments, we will measure average throughput, average latency, and bytes written. 5.1 Throughput The experiment measured the average throughput under different random write sizes from 512/4096 bytes to 131,072 bytes. Note that the smallest sector size is 512 bytes and 4096 bytes for FAT32/NTFS and Ext3, respectively. The results are shown in Fig. 11. The x-axis represents the random write size for each repetition and the y-axis represents the average throughput. When the random write size was small, the access framework might spend a lot of time to find out a small amount of duplicate data such that the average throughput was not good. However, when the random write size was increased (e.g., above 2048 bytes), the average throughput will be better than that without the access framework. This was because a large amount of duplicate data can be identified for the large write size. As a result, the access framework can benefit those applications that tend to have a large amount of data written. Fig. 11. Average throughput.

952 5.2 Latency The experiment measured the average latency. The average latency means that how long a write request can be finished and also represents an overhead for the access framework. The result is shown in Fig. 12. The x-axis represents the random write size for each repetition and the y-axis represents the average latency. As described in the previous section, high throughput also reflected short latency. This was because the large write size can provide a better opportunity to identify a large amount of duplicate data such that short latency can be achieved. Overall, a file system with the access framework can have short latency. Fig. 12. Average latency. 5.3 Bytes Written Fig. 13. Bytes written. The experiment measured the actual number of bytes written into SSDs. As shown in Fig. 13, the x-axis represents the random write size for each repetition and the y-axis represents the number of bytes written. When a save command by an application (without the identification of duplicate data) is issued to a 4MB file, 4MB data written to SSDs could be required. However, the access framework can decrease the actual number of bytes written significantly when compared to that without the identification of duplicate data.

A DATA DE-DUPLICATION ACCESS FRAMEWORK FOR SOLID STATE DRIVES 953 6. CONCLUSION In the paper, we propose a data de-duplication access framework for SSDs. The framework can eliminate duplicate data as much as possible such that the system performance improves and the lifetime of SSDs increases. There are four parts: (1) Meta table; (2) Eliminate full-duplicate data; (3) Eliminate near-duplicate data; and (4) Reference count, in the access framework. All required meta data for the data de-duplication algorithm are stored in the meta table. Full-duplicate and near-duplicate data can be identified efficiently by the application-based locality and the file-name locality. In particular, the concept of reference count is used to maintain the sharing relationship in case the process of deleting files is incorrect. According to the experimental results, the access framework is efficient to eliminate duplicate data such that the average throughput and latency can be improved significantly. At the same time, the overhead caused by the fingerprint checking is also reasonable. Future research should involve further examination of the characteristics of duplicate data, especially when different applications are executed. With a more considered approach incorporating application designs and access patterns, a real prototype will be designed and manufactured. The real prototype can provide fast and efficient SSDs. REFERENCES 1. Samsung Electronics, NAND Flash-Memory Datasheet and SmartMedia Data Book, 2011. 2. Microsoft, Single instance storage in Microsoft windows storage server 2003 R2, Technical White Paper, 2006. 3. R. M. Karp and M. O. Rabin, Efficient randomized pattern-matching algorithms, IBM Journal of Research and Development, Vol. 31, 1987, pp. 249-260. 4. M. O. Rabin, Fingerprinting by random polynomials, Center for Research in Computing Technology, Harvard University, 1981. 5. A. Brinkmann, Data deduplication, Theoretical Aspects of Storage Systems, http:// pc2.uni-paderborn.de/fileadmin/pc2/media/staffweb/andre_brinkmann/courses/wr oclaw_storage_systems/wroclaw_chapter_3_-deduplication.pdf. 6. D. R. Bobbarjung, S. Jagannathan, and C. Dubnicki, Improving duplicate elimination in storage systems, ACM Transactions on Storage, Vol. 2, 2006, pp. 424-448. 7. A. Muthitacharoen, B. Chen, and D. Mazieres, A low-bandwidth network file system, in Proceedings of the 18th ACM Symposium on Operating Systems Principles, 2001, pp. 174-187. 8. J. Kubiatowicz, et al., Oceanstore: An architecture for global store persistent storage, in Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000, pp. 190-201. 9. National Institute of Standards and Technology, Secure hash standard, Federal Information Processing Standards Publication 180-1, 1995.

954 Chin-Hsien Wu ( ) received his B.S. degree in Computer Science from National Chung Cheng University in 1999. He received his M.S. and Ph.D. degree in Computer Science from National Taiwan University in 2001 and 2006, respectively. Now, he is an Associate Professor at the department of Electronic Engineering in National Taiwan University of Science and Technology. He is also a member of ACM and IEEE. His research interests include embedded systems, real-time systems, ubiquitous computing, and flash-memory storage systems. Hau-Shan Wu ( ) received his B.S. degree in Computer Science from Tamkang University in 2008. He received his M.S. degree in Electronic Engineering from National Taiwan University of Science and Technology in 2010. His research interests include embedded systems and flash-memory storage systems.