1 The assignment of chunk size according to the target data characteristics in deduplication backup system Mikito Ogata Norihisa Komoda Hitachi Information and Telecommunication Engineering, Ltd. 781 Sakai, Nakai-machi, Ashigarakami-gun, Kanagawa, Osaka University 2-1, Yamadaoka, Suita, Osaka, Abstract This paper focuses on the trade-off between the deduplication rate and the processing penalty in backup system which uses a conventional variable chunking method. The trade-off is a nonlinear negative correlation if the chunk size is fixed. In order to analyze quantitatively the trade-off all over the factors, a simulation approach is taken and clarifies the several correlations among chunk sizes, densities and average lengths of the different parts. Then it clarifies to assign an appropriate chunk size based on the data characteristics dynamically is effective to weaken the trade-off and provide higher efficiency than a conventional way. Keywords: Deduplication, Backup, Archive, Capacity Optimization, Enterprise Storage 1 Introduction Due to the explosive increase of the data in IT system, backup operation are becoming a burden to the system because of the resource usage, the processing time and the managing cost, while it is an indispensable operation to recover the data in case of unpredictable disaster. Major requirements to backup operations are to shorten the processing time and to reduce the resource usage, especially storage capacity to keep the data for a long term. Recently, the technology called deduplication has become popular to reduce the burden. This is a technology to eliminate the duplication of the backup target data and store only unique data in the storage. The reduction of stored data reduces not only the backup storage cost but also other resources running workload. Various techniques have been proposed so far to provide more deduction with less processing time from the point view of more cost-effective backup operation. However, because the deduplication rate and the processing time have a non-linear negative correlation, it is difficult to improve both rate and time simultaneously if only one chunking size is activated. This paper shows how to assign the appropriate chunking size according to the data characteristics. The assignment weakens the trade-off in single chunking method and therefore provides more efficient deduplication than any conventional approach. 2 Deduplication backup system 2.1 Variable chunking algorithm In usual customers environment, backup target data include various types of files and how much the duplication patterns remains or how they locates depends on the customers environments or applications, for example some are scarcely duplicated because it was much edited, changed or updated, some are densely duplicated because it was rarely changed, just replicated or copied. In this paper, all changed or updated parts of data are called differed area and unchanged parts are called identical area. A deduplication backup system has a processing module which eliminates the duplication within the target data. A module reads the target data, divides into many small segments, distinguishes the duplicated area from unique or newly updated area, then transfers and stores only the unique data in the storage. Figure 1 shows the typical operational flow of the deduplication process from reading the target data and finalizing to store the unique data. The size of the box is not to scale. A module divides the target data in small segments called chunk which are the unit of reduction or storing (Chunking). Two dividing al-
2 Figure 1. Deduplication process. gorithms are widely implemented, fixed length chunking or variable length chunking. Fixed length chunking divides the data into chunks with the same length that is defined in advance. Variable length chunking divides into chunks with variable length that is calculated by the data patterns. A typical algorithm of variable length chunking scans the data from the beginning of the file one byte at a time to the end by shifting a fixed length window. The algorithm generates a special value, called a signature, based on the window using a Rabin s algorithm. Then when the signature is recognized to match a predefined value, called anchor, the end byte of the windows is set at the end of the chunk. Otherwise the window is shifted by one byte and generate a signature again. The average chunk size is defined from how many bits are taken as a signature. For example, the length of the signature is 12 bits, under the assumption that the calculation of signature generates completely random values, 2 12 patterns are possible, this results in 4KB chunk size in average. Next, the module calculates the unique code from the chunk data to analyze the similarity of chunks using. The hashing algorithm is commonly used to calculate the code for example by SHA-1, SHA-256. This code is called by fingerprint (Fingerprinting). Next the module decides the uniqueness of each chunk using the fingerprint (Decision), whether the chunk is duplicated that the same chunk has been stored or not duplicated that all previous chunks are different. Finally, the module write out only not duplicated chunks into the storage (Writing). 2.2 Issues Many approaches have been proposed and implemented in order to increase the efficiency of deduplication. For the customers, two criteria are important to discuss the efficiency, one is the deduplication rate and the other is the processing time. The deduplication rate is desirable to be higher and the processing time is desirable to be shorter. How to choose the adequate chunk size of the algorithm is one key sensitive factor, because that affects heavily to the efficiency. In many cases, the chunk size is determined based on the consideration to keep the balance of the performance and the deduplication capability for an assuming hypothetical environment. In order to gain more reduction, to assign smaller chunk size is generally effective. A smaller chunk size can pick up the differed area in more precise than a bigger chunk size. However, the effectiveness of the chunk size, that is how much sharply the chunk size cut off the duplication, very much depends on the distribution of differed area among the data, such as a length of each differed area and a distance between differed areas. When a small chunk size is assigned to longer differed data, many continuous chunks may be wasted to remain not duplicated area. When differed area are located with similar distance to the chunk size, many chunks may include both of differed area and identical area, this results in poor deduplication rate. When differed area are distributed with much smaller or much bigger distance comparing to the chunk size, the effectiveness is improved. Processing time, that is how much time or resource utilization is necessary to reduce the duplication, is also depends on the chunk size and others. A smaller chunk size requires more CPU and IO intensive processing time, a bigger one does less time. The less duplicated, less storing operation to the storage. Further, the correlation of the process time and the chunk sizes are non-linear, for example, for a smaller chunk size, the processing time increases more steeply as the size decreases. This non-linearity makes the compensation of trade-off hard to overcome. In any conventional way, only one chunk size is assigned for all data or all environments, it is not effective. If the chunk size is assigned to gain better deduplication rate with smaller processing time corresponding to the characteristics of data or environmental, it is beneficial for the system. 3 Simulation 3.1 Implementation The followings are the operation and assumptions of our simulator. As a convenience, we call the simulator SIM.
3 (1) Backup data Target backup data is implemented as a list with 1024*1024 length, one entity represents 64 bytes length of real data. Therefore, it represents in total 64MB real data length. In the stream, there are many area that represents identical, identical to one of previously stored area and many area which represents differed, not identical to anyone of previously stored area. SIM reduces all of identical area as much as it can and remains definitely all of differed area by the process. The ratio of the amount of all identical area to the total amount of data is defined r as a duplication rate. The domain of r is 0 r 1. In other words, the ratio of all differed area comes to 1 r. These differed areas are spread with equal density over the stream. (2) Chunking SIM assumes an anchor algorithm to separate the data into chunks. The signature length to control the average chunk size is set to 11 to 15 bits. These come to 2 to 32 KB in average chunk size respectively. In this case, the sizes of chunks fits to geometrical distribution. The distribution parameter p is equal to 1/2 n, here n is the window size in bits. We call the case that the average chunk size is m as a notation of (m) for convenience. From industrial practical reason, SIM set two boundaries of the minimum and the maximum size. The minimum size is set to the half of chunk size, and the maximum size is set to one and half of the chunk size. For example, when an average chunk size is 8KB, the minimum size is 4KB and the maximum size is 12KB. Table. 1 is a list of parameters of the simulation. Table 1. Parameters of simulation Parameter Values Duplication rate: r 0.2, 0.3, 0.4, Length of differed area 2, 4, 8, [KB] Average chunk size: m 2, 3, 4, [KB] (3) Decision SIM checks whether the chunk includes differed area. When the chunk includes no differed area at all, SIM decides the chunk should be removed. When the chunk includes fully or partially differed area, SIM decided the chunk should be remained. The deduplication rate is calculated as a ratio of total amount of chunks which are decided to be removed to total amount of data. In addition, the fulfillment rate is calculated as a ratio of the deduplication rate to the duplication rate. The number of reduced chunks is counted as a number of identical therefore removed chunks by SIM. Since the processing time of deduplication is very varied on the resource configuration, usage in the system, SIM evaluates the total number of removed chunks as a uniform substitute. (4) Environment SIM is coded by the language R GUI bit, operated on the environment of the hardware consists of Intel Core i7 2.6GB CPU, 8GB (1600MHz, DDR3) memory. 3.2 Evaluation of the validity of SIM comparing the measured data At first, the validity of SIM is evaluated by comparing the equivalent data that are gathered from the measurement using real hardware. Our experiment hardware is based on a linux based NAS (Network Attached Storage) product connecting to HDD storage, NFS (Network File System) interface and a variable chunking mechanism is coded. Chunk size is controlled as an parameter of average chunk size for NAS and controlled as 1 p for SIM. These parameters are called here operational chunk sizes. Total amount of data in measurement is 1GB, the length of differed area is fixed at 32KB. The differed areas are randomly distributed over the data with control parameters of duplication rate that are assumed to reflect practical data characteristics. NAS reports the deduplication rate, the average chunk size, the number of reduced chunks by the experiment. Table 2 shows the comparison of two chunk sizes gathered from each experiment. One is chunk size that NAS divides in the measurement, the other is one that SIM divides in the simulation. Simulated chunk sizes match well to measured chunk sizes in the area of smaller operational chunk size. Simulated chunk size become gradually shorter than measured as operational chunk size increases. This comes from the difference that NAS uses multiple chunking algorithm installed, on the other hand, SIM uses only one variable chunking algorithm. Table 3 and figure 2 show the comparison of deduplication rates. The rates matches in smaller chunk size, 2% difference in case of 8KB. The
4 Table 3. Comparison of simulated and measured deduplication rate Operational measured simulated difference [%] chunk size KB KB KB KB KB KB Table 2. Comparison of simulated and measured chunk size Operational measured simulated difference chunk size [KB] [KB] [%] 2KB KB KB KB KB KB difference become larger as the chunk size increases. The rate become worse for NAS as the duplication rate increases. This comes from the difference that in NAS some additional information is embedded in to the stream to improve the data integrity, SIM does not. The embedded information spoil the deduplication efficiency and lessen the rate as the chunk size increases. The rate is better for NAS in case that the duplication rate equals to 0.5. This comes from the difference that NAS uses multiple chunking algorithms, results in higher deduplication rate, at the same time NAS has the penalty to lessen the rate by embedded information which becomes bigger as duplication increases. 4 Evaluation 4.1 Dependency on duplication rate Figure 3 shows the correlation of deduplication rate and the number of reduced chunks by changing the duplication rate of 0.2, 0.4, 0.5, 0.6, 0.8 under the same length of differed size of 32KB. The number of reduced chunks that is removed by SIM increases as the duplication rate increases or the chunk size decreases. A criteria is useful to analyze the effects of changing chunk sizes. As shown in the figure, as the chunk size decrease starting from 32KB, the number of re- Deduplication rate KB measured 8KB measured 16KB measured 32KB measured 64KB measured Simulated Duplication rate Figure 2. Deduplication rate by Single layer deduplication system Number of reduced chunks r=0.2 r=0.4 r=0.5 r=0.6 r=0.8 Opt. (32) (64) (2) (4) (8) (16) Cost Deduplication rate Figure 3. Reduced number of chunks by duplication rate (length of differed area = 32KB) duced chunks becomes 3.2 times by (16) from (32), the deduplication rate improves 60% from to Then, the number of reduced chunks becomes 2.5 times by (8) from (16), the deduplication rate improves 30%, successively 2.0 times and only 3% by (2) from (4). The improvement ratio decrease as chunk size decreases. This means to decrement chunk size in bigger chunking situation gains more improvement with less penalty. At the same time, the improvement is more clear as the duplication rate decrease.
5 The deduplication rate and the fulfillment rate with each chunk size increase as the duplication rate increase. This comes because the deduplication works more efficiently as the potential of mixture of identical area and differed area in the same chunk becomes lower along duplication rate increase. 4.2 Dependency on length of differed area Figure 4 shows the correlation between the deduplication rate and the number of reduced chunks by changing the length of differed area of 2, 4, 8, 16, 32KB under the same duplication rate of 0.6. As the length of differed area increase, the corresponding number of reduced chunks increase and deduplication rate also increase under the same chunk size. This comes from the decrement of deduplication inefficiency because the potential of mixture of identical area and differed area in a chunk become lower as length of differed area increase. The improvement of the deduplication rate along decreasing of chunk size becomes smaller as the length of differed area increases. This comes because the preciseness of reduction by using small chunks degrades. The number of reduced chunks increase and therefore the deduplication rate increase as the length of differed area increases for all chunk sizes. The improvement becomes smaller for small chunk sizes. For example, when the length of differed area increase from 2KB to 32KB, the improvement of deduplication rate is 1.8 times in case of (2), 3.3 times for (4), 9.5 times for (8), and so on. This indicates to assign bigger chunk size is more effective than the smaller ones in case that the length of differed area widely varies in the data. Number of reduced chunks KB differed length 4KB differed length 8KB differed length 16KB differed length 32KB differed length (16) (8) (2) (4) Deduplication rate Figure 4. Reduced number of chunks by length of dirty (duplication rate = 0.6) 4.3 Assignment of optimal chunk size As described, there is a trade-off between the number of reduced chunk size and the deduplication rate. Therefore, when the problem to assign the optimal chunk size is defined as a linear programming with two objectives, to maximize the deduplication rate and to minimize the number of reduced chunks, it generates a Pareto-optimal solutions, because there exists no feasible solution for which an improvement in one objective does not lead to a simultaneous degradation in the other objectives. In figure 3 or figure 4, each solid line represents one corresponding Pareto-optimal solutions. Generally, to assign the best chunk size as an optimal one depend on the customers choice, considering of various factors from their proprietary priorities. Without neglecting practical priorities, this paper simplify the customers factors and priorities in two costs. One factor is the operational cost, which includes the cost of resources like server, LAN(Local Area Network) SAN(Storage Area network), utilities and the cost of management. The other factor is the capital cost like Storage and the cost of maintenance. The priorities is to minimize the total cost. Here, the cost is defined in the following function, the optimal chunk size is assigned to minimize the cost. C = A ReducedChunks B DedupedRate Here, ReducedChunks: Number of reduced chunks DedupedRate : Deduplication rate Figure 3 shows an example of cost function. Optimul chunk size KB differed length 32KB differed length 8KB differed length Duplication rate Figure 5. Optimal chunk size to minimize the cost (length of dirty = KB) The dotted line indicates the cost for corresponding deduplication rate and number of reduced chunks, A is set to 1 and B to In case of duplication rate of 0.6, the chunk size to pro-
6 vide the minimum cost is 8KB. Figure 5 shows the chunk size to minimize the cost. The optimal chunk size increase as the duplication rate increases. This comes from the characteristic of bigger chunks that reduce coarsely with short process time. Further, the optimal chunk size increase as the length of differed area increases. This comes from the same reason. In addition, the difference between the optimal chunk size increases as the duplication rate increases. This comes because the effectiveness to use bigger chunks becomes weak as the duplication rate increases in case of short length of differed area. In commercially available products, 4KB chunk size is a popular size. Table 4 shows the improvement for the cases of duplication rate of 0.2, 0.4, 0.6, 0.8 and the length of differed area of 8, 32, 64KB. The improvement of optimal chunks increases as the duplication rate and the length of differed area increase, becomes 33% in the case of 0.8 and 64KB. Table 4. Improvement of cost value by the optimal chunk size to conventional 4KB. Length of Duplication rate differed area KB Optimal (3) (4) (5) (8) Improve KB Optimal (4) (7) (12) (15) Improve KB Optimal (6) (8) (13) (19) Improve Conclusion The paper simulates variable chunking algorithm in backup deduplication system. First, the simulation clarify changing chunk size is more beneficial for bigger chunks than smaller chunks, gains more improvement with less penalty. In addition, the improvement is increase as the duplication rate decrease. Next, to assign bigger chunk size is more effective than the smaller ones in case that the length of differed area widely varies within the data. Finally, under the reasonable assumption to minimize the cost, the optimal chunk size is 3KB in case of 0.2 duplication rate and 8KB length of differed area with 27% improvement than fixed 4K chunk size, similarly the optimal chunk size is 19KB in case of 0.8, 64KB with the improvement of 33%. References  Y. Tan et al. Dam: A data ownershipaware multi-layered de-duplication scheme. In 2010 Fifth IEEE International Conference on Networking, Architecture and Storage. IDC-Japan,  Y. Won, J. Ban, J. Min, L. Hur, S. Oh, and J. Lee. Efficient index lookup for de-duplication backup system. In Modeling, Analysis and Simulation of Computers and Telecommunication Systems, MASCOTS IEEE International Symposium on (Poster Presentation), Sept  U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference,  A. Muthitacharoen, B. Chen, and D. Mazières. A low-bandwidth network file system. In Proceeding of SIGOPS. 18th Symposium on Operating Systems Principles., Banff, Canada,  M. O. Rabin. Fingerprinting by random polynomials. Technical report, Department of Computer Science, Harvard University,  D. Meister and A. Brinkmann. Multilevel comparison of data deduplication in a backup scenario. In Proceedings of SYSTOR 2009, The 2nd Annual International Systems and Storage Conference. ACM, May  G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies,  J. Min, D. Yoon, and Y. Won. Efficient deduplication techniques for modern backup operation. In IEEE Transaction on computer, Vol. 60, No. 6,  M. Ogata and N. Komoda. Optimized assignment of deduplication backup methods using integer programming. In Proceedings of JCIS2011: The 4th Japan-China Joint Symposium on Information Systems, Apr