Cloud De-duplication Cost Model THESIS

Transcription

1 Cloud De-duplication Cost Model THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Christopher Scott Hocker Graduate Program in Computer Science and Engineering The Ohio State University 2012 Master's Examination Committee: Dr. Gagan Agrawal, Advisor, Dr. Christopher Stewart

2 Copyright by Christopher Scott Hocker 2012

3 Abstract De-duplication algorithms have been used extensively in the backup and recovery realm to exploit the temporal data redundancy on the versioning nature of backup applications. More recently de-duplication algorithms have progressed into the primary storage area where more spatial data exist, in either case increasing the efficiency of the usable storage capacity. In parallel another industry trend is the use of cloud resources to provide a utility model for software and infrastructure services. Cloud resources aid in decreasing the provisioning time for applications and infrastructure, while increasing the scalability to meet the elastic nature of application and infrastructure requirements. Using the de-duplication algorithms within cloud resources is the next logic step to increase the efficiency and cost related to cloud computing. Cloud environments involve pricing models for both computing and storage. In this work, we examine the tradeoffs between computing costs and reduction in storage costs (due to higher de-duplication) for a number of popular de-duplication algorithms. Since the main factor impacting the computing cost is the memory availability in different type of instances, we also develop a methodology for predicting the memory requirements for executing a given algorithm on a particular dataset. ii

4 Dedication Dedicated to those who supported me throughout my academic career my Wife, Parents, Brother, Sister and Friends. iii

5 Acknowledgments First I would like to thank my advisor, Dr. Gagan Agrawal for challenging me and providing guidance from the beginning of my time at Ohio State. The support and advice he provided was invaluable. Additionally, I would like to thank my thesis committee member Dr. Christopher Stewart for his time and participation during this work. I would also like to thank my Wife for her support and understanding of the long hours during this work and throughout my entire academic career. Finally, I would like to thank the rest of my support system, my Parents, Brother, Sister, and Friends who were always there to provide a word of encouragement and motivation. iv

6 Vita Vandalia Butler High School B.S. CS, Wright State University 2008 to present...m.s. CSE, The Ohio State University Fields of Study Major Field: Computer Science and Engineering v

7 Table of Contents Abstract... ii Dedication... iii Acknowledgments... iv Vita... v Fields of Study... v Table of Contents... vi List of Tables... vii List of Figures... viii Chapter 1: Introduction... 1 Chapter 2: De-duplication Algorithms... 4 Chapter 3: Memory Prediction Chapter 4: Experimental Evaluation Chapter 5: Related Research References: vi

8 List of Tables Table 1: fs-c Algorithm Chunk Selection Table 2: Fixed Index Memory Estimates vs. Actual Table 3: Variable Index Memory Estimates vs. Actual Table 4: Small Dataset Results Table 5: EC2 m1.small Instance Small Dataset Results Table 6: EC2 c1.medium Instance Small Dataset Results Table 7: EC2 Instance Large Dataset Results Table 8: AWS EC2 Pricing Table 9: AWS S3 Pricing Table 10: Instance Cost Assessment Table 11: m1_small vs. m1_large Instance Cost Table 12: c1_medium vs. m1_large Instance Cost vii

9 List of Figures Figure 1: Out-of-Band vs. In-band De-duplication... 4 Figure 2: Basic Sliding Window Algorithm [8]... 8 Figure 3: TTTD Pseudo Code [8] Figure 4: Chunk Distribution of TTTD algorithm [9] Figure 5: TTTD-S Chunk Distribution Improvements [9] Figure 6: TTTD-S algorithm pseudo code Figure 7: De-duplication ratio and percent savings [19] Figure 8: FS-C Chunk Distributions viii

10 Chapter 1: Introduction De-duplication algorithms have been used extensively in the backup and recovery realm to exploit the temporal data redundancy on the versioning nature of backup applications. More recently de-duplication algorithms have progressed into the primary storage area where more spatial data exist, in either case increasing the efficiency of the usable storage capacity. In parallel another industry trend is the use of cloud resources to provide a utility model for software and infrastructure services. Cloud resources aid in decreasing the provisioning time for applications and infrastructure, while increasing the scalability to meet the elastic nature of application and infrastructure requirements. As companies begin to transition their data and infrastructure to the cloud, the methods of cost savings they are accustomed such as de-duplication should transition as well. The increased efficiency in usable capacity gained through the use of de-duplication will translate into a positive impact on the cloud pay as you go model. De-duplication does come with a tradeoff of additional compute resources required to analyze the data for duplicates, so selecting the right instance type to run a given de-duplication algorithm is an important aspect. Therefore we will examine the resources and cost factors related to the cloud environments and how de-duplication can be effectively integrated. Cloud environments involve pricing models for both computing and storage. In this work, we examine the tradeoffs between computing costs and reduction in storage costs (due to higher de-duplication) for a number of popular de-duplication algorithms. Since the main factor impacting the computing cost is the memory availability in 1

11 different type of instances, we also develop a methodology for estimating the memory requirements for executing a given algorithm on a particular dataset. Through experiments we show support for more aggressive de-duplication algorithms to maximize cost savings on a larger cloud compute instance versus that of a smaller instance type and less aggressive de-duplication algorithm. In some cases the dataset size does not warrant a larger instance type to run more granular de-duplication algorithms since the index memory requirements are satisfied by the entry level instance types. In these situations there is no benefit to choosing a larger instance type in an effort to reduce cloud resource cost. Thesis Statement Integrating de-duplication effectively and efficiently into a cloud environment requires an understanding of the resource requirements, specifically the memory requirements and the tradeoff in compute cost for processing data for duplicates at various levels of granularity. Contributions This thesis makes the following contributions: 1. Proposes a methodology to predict the required cloud instance type based on memory requirements to run popular de-duplication algorithms on a given dataset. 2. Analyze cloud compute requirements for running de-duplication algorithms at varying chunk granularity. 2

12 3. Evaluate cost factors associated with running de-duplication in a cloud environment including compute instance types, de-duplication algorithm chunk granularity, and data storage durations. 3

13 Chapter 2: De-duplication Algorithms The implementation of data de-duplication duplication technologies varies in terms of de- duplication placement, the timing of the de-duplication duplication of data, and the granularity at which data is analyzed to find duplicate data. The placement of the de-duplication process can occur at the client or target device [4]. Additionally, a hybrid approach of both the client and target also exist. The timing of the de-duplication duplication process is performed either in-band as the data is received/sent versus out-of-band at a scheduled time. Figure 1: Out-of-Band vs. In-band De-duplication The placement and timing are key components of de-duplication duplication algorithms however in this research we will focus on client based de-duplication duplication algorithms and out of band timing which will allow us to integrate de-duplication duplication into an existing cloud environment and analyze the data in place. We will explore in more detail the granularity at which duplicates are detected specifically at a fixed and variable block levels. Finally, we will provide a brief explanation of de-duplication ratios and the contributing factors to the ratio. 4

14 Duplicate Detection For duplicate detection there are three main approaches: whole file, often called single instance storage, sub file chunking which is comprised of fixed and variable block hashing, and delta encoding. For whole file, entire files are given a hash signature using MD5 or SHA-1 [6]. Files with exact hash signatures are assigned a pointer to the single file instance previously stored. In certain algorithms a byte by byte comparison is performed to eliminate the potential for hash collisions, which is often a concern with hash based comparisons [27]. Lower de-duplication ratios are generally obtained due to the large dataset that has to be matched; any small change in the file will alter the hash and affect any previous match. The second and more popular approach is the sub file chunking [4]. Methods such as fixed and variable block hashing are two types of sub file chunking. Fixed block deduplication chunks and hashes a byte stream based on a fixed block size. The hash signatures are stored and referenced in a global index, which is implemented using a bloom filter type data structure to quickly identify new unique block segments [14]. If the block signature already exists in the index, a pointer to the existing block is created; otherwise the signature is stored and the block is written to disk. In contrast, variable block algorithms use methods such as Rabin Fingerprinting [1], a hashing algorithm that uses a sliding window to determine the natural block boundaries with the highest probability of matching other blocks. Variable block algorithms still employ the bloom filter based data structure for the in-memory index [11]. Variable block proves to be a 5

15 more efficient approach compared to fixed and whole file hashing, regardless of the slight variations or offsets that exist due to modifications in similar files and blocks [4]. Exact and similarity matching techniques are used in the sub file hashing algorithms. Exact matching algorithms examine the chunk hash index for exact signature match. For exact matching algorithms, the block size has a direct impact on the hash index size which can present a problem when storing in memory [29]. An example was provided in [2], where the index size required for 1 Petabyte of de-duplicated data, assuming an average block size of 8KB would require 128GB of memory for the index at a minimum. Similarity matching techniques address the index size by increasing the block size to 16MB in [2], which in turn will reduce the index size to 4GB for the 1 Petabyte of data. A similarity signature approach consists of a number of block signature based on a subset of the chunk bytes. If a similarity signature matches more than some threshold of block signatures then there is a reasonable probability that the two chunks will share common block based signatures [2]. Thus a match is found, and the new similarity chunk is compared and de-duplicated against the similar chunks [29]. The trade off with the similarity matching technique is de-duplication performance is highly dependent upon the speed of the de-duplication repository storage; as the repository is referenced for similarity block signature matches. Also since each chunk is only compared against a limited number of other chunks in similarity matching, occasionally duplicates are stored [29]. Finally, delta encoding (also called data differencing), processes files based on a reference file for differences, storing the deltas in a patch files [17]. Selecting the 6

16 reference file previously stored is a key operation in delta encoding algorithms and often selected based on a fingerprinting technique similar to that of the whole file and sub chunking algorithms [4]. The sub file chunking approach with exact matching used with variable block algorithms are able to detect the varying offsets of data blocks. This addresses the boundary shifting concerns of fixed or whole file algorithms. The tradeoff comes in terms of additional resources required to maintain the metadata associated with the increase chunks seen with variable block approaches. Additionally, index lookups increase during the variable chunk detection process. Even with the increased resource requirements for variable block algorithms the majority of current algorithms use variable block sub file hashing with exact matching to maximize efficiency and overall de-duplication ratios. De-duplication Implementations The process is straight forward for the whole file hashing and fixed block hashing: perform the hash index lookups based on the SHA-1 [6] or MD5 hash value of the file or fixed bock to determine if the data is unique. If a duplicate is detected in the hash lookup, optionally perform a byte by byte comparison then modify the file metadata to reference the previously stored data. If the data is unique, record the hash value and perform local compression (e.g. LZ, gzip) on the unique block and store only that data [13]. For the variable block otherwise known as content based chunk algorithms, there are several algorithms that vary in their implementation, overall performance and effectiveness in identifying duplicate chunks. The main motivation behind the variable 7

17 block algorithms is elimination of the boundary shifting problem [18]. If a small modification is made to a file, the chunk boundaries for whole file and fixed block chunking shift, causing a poor duplicate detection against the file or data. The low bandwidth network filesystem [18] first introduced the basic sliding window (BSW) using three parameters as inputs: fixed window size W, integer divisor D, and integer remainder R. The window shifts one byte at a time to the max window size of W from the beginning of the file to the end. Fingerprints (h) of the window contents are generated with Rabin fingerprinting. Rabin introduced the idea of detecting the natural block boundaries in a byte stream and assigning the variable chunks a signature [1]. The algorithm then tests if (h mod D ) = R which if true, a D-Match has been found and the current position is set as a breakpoint for that chunk. The parameter D can be configured to make the chunk size as close to the data expectations as possible to maximize the de-duplication. The parameter R must be between 0 and D-1, and most often is configured as D-1. Figure 2, provides a visual representation of the basic sliding window approach. Figure 2: Basic Sliding Window Algorithm [8] 8

18 Problems presented by the basic sliding window approach include the large chunk sizes generated when a match is not detected and the data is chunked at the window size. This leads to boundary shifting problems when small modifications are made, making the large chunk matches more difficult. Additional improvements were made which resulted in the introduction of the two divisor (TD) algorithm [8]. This algorithm addressed the issue with the basic sliding window algorithm by introducing a second divisor (S) that is smaller than D which increases the chance of a match. Both D and S are calculated at each byte shift to increase the chances of chunk match, decreasing the number of large chunks. Using the BSW or the TD algorithms, the chunk size is only upper bounded and could cause chunks to vary greatly in size. Small chunk sizes greatly increases the number of chunks which is directly related to the memory overhead required for exact matching techniques [29]. The two threshold two divisor algorithm was developed to address the range of chunk size generated during duplication detection [8]. TTTD added two threshold parameters to the BSW and TD algorithms which control the upper (Tmax) and lower bounds (Tmin) of chunk sizes [8]. Data fingerprints are not generated until the minimum byte threshold is met; addressing the overhead issues related to small chunk size, while still addressing the boundary shifting concerns of chunking at large chunk sizes. 9

19 int p=0; //current position int l=0; //position of last breakpoint int backupbreak=0; //position of backup breakpoint for (;!endoffile(input);p++){ unsigned char c=getnextbyte(input); unsigned int hash=updatehash(c); if (p - l<tmin){ //not at minimum size yet continue; } if ((hash % Ddash)==Ddash-1){ //secondary divisor backupbreak=p; } if ((hash% D) == D-1){ //we found a breakpoint //before maximum threshold. addbreakpoint(p); backupbreak=0; l=p; continue; Figure 3: TTTD Pseudo Code [8] } } if (p-l<tmax){ //we have failed to find a breakpoint, //but we are not at the maximum yet continue; } //when we reach here, we have //not found a breakpoint with //the main divisor, and we are //at the threshold. If there //is a backup breakpoint, use it. //Otherwise impose a hard threshold. if (backupbreak!=0){ addbreakpoint(backupbreak); l=backupbreak; backupbreak=0; } else{ addbreakpoint(p); l=p; backupbreak=0; } Additional studies found that when maximum threshold (Tmax) of TTTD was reached, that only the last secondary divisor (S) was used for chunking. Therefore, all other secondary divisors calculated were not considered, causing a large distribution of chunk sizes. See Figure 4 to view the chunk distribution of the TTTD algorithm. 10

20 Figure 4: Chunk Distribution of TTTD algorithm [9] The chunk distribution contains two groupings (as seen on the figure above) - the first around the chunk size detected by the expected main divisor (D), the second near the max chunk threshold where a match was not discovered and the previous secondary divisor (S) was used. TTTD-S [9] improves upon the large distribution of the chunk grouping by introducing a switchp value that is set to 1.6 times the expected chunk-size [9]. Once the window size has reached switchp, the divisors are reduced in half to shorten the match process. This in turn helps to shorten the process of finding a breakpoint before max chunk threshold is reached. Additionally, it can improve the distribution and bring the second chunk grouping closer to the average chunk size detected by the main divisor. Figure 5 illustrates the improvements in the chunk distribution made by the switchp parameter introduced in the TTTD-S algorithm, further reducing the chances of a boundary shifting conditions with data modifications. 11

21 Figure 5: TTTD-S Chunk Distribution Improvements [9] int currp = 0, lastp= 0, backupbreak = 0; for ( ;! endoffile( input ) ; currp++ ) { unsigned char c = getnextbyte( input ); unsigned int hash = updatehash( c ) ; if ( currp lastp < mint ) { continue ; } if ( currp lastp > switchp ) { switchdivisor( ) ; } if (( hash % secondd ) == secondd 1 ) { backupbreak = currp ; } if (( hash % maind ) = = maind 1 ) { addbreakpoint( currp ) ; backupbreak = 0 ; lastp = currp ; resetdivisor( ); 12 continue; } if ( currp lastp < maxt ) { continue ; } if ( backupbreak!= 0 ) { addbreakpoint( backupbreak ) ; lastp = backupbreak ; backupbreak = 0 ; resetdivisor( ) ; } else { addbreakpoint( currp ) ; lastp = currp ; backupbreak = 0 resetdivisor( ) ; } } Figure 6: TTTD-S algorithm pseudo code Additional algorithms exist that use a hybrid approach by incorporating variable and fixed block techniques, as well as small chunk merging techniques to reduce the

22 number of small chunks therefore, reducing the overhead associated with the large number of small chunks. Compression is also often used in conjunction with deduplication to increase the storage space utilization. Data de-duplication algorithms have been extensively researched. The techniques available today vary depending on the de-duplication placement, timing of the detection process, and the granularity at which duplicates are discovered [4]. Regardless of the techniques, the overall effectiveness of any of the de-duplication algorithms remains data dependent [30]. The fs-c algorithm [7] used in our research is based on the TTTD algorithm using the Rabin [1] fingerprinting for generating the natural chunk boundaries. De-duplication Savings In addition to the inner workings of the de-duplication algorithms, the data characteristics will help in understanding the space savings obtained by the various algorithms. There are several factors such as data type, scope of the de-duplication, and data storage period that play a role in overall de-duplication savings. De-duplication savings are often stated in terms of ratios, which is in relation to the number of input bytes to the de-duplication process divided by the number of bytes of output. [19]. Figure 7 depicts the calculation of de-duplication ratios to percentages. In our studies we will use percentages to eliminate any confusion and make the overall savings more apparent. 13

23 Figure 7: De-duplication ratio and percent savings [19] Data files types are one component that has an impact on the de-duplication saving expectations. For example, files generated by humans in applications such as text documents, spreadsheets, presentations, etc often contain a large amount of redundant data, while data generated by a computer system such as images, media, archived files, etc often have less redundancy due to the random nature of the data [19]. The scope of de-duplication refers to the range of datasets examined during duplicate detection. For example, the term global de-duplication allows for detection of duplicates across multiple data sources which can span multiple storage systems or locations [19]. Conversely, the de-duplication across just a single appliance or within a single client s data only looks at the data contained within that appliance or client, creating silos of de-duplication stores. In general, the larger the data scope for duplication detection, the higher the expected space savings.. Data storage periods have an effect on the de-duplication savings by increasing the chance of exploiting temporal redundancy. For example, in a backup type scenario 14

24 where temporal data is accumulated over time due to the versioning nature of backup applications, ratios are expected to be higher. In a primary storage scenario, spatial data exists across a broad spectrum of data types which lead to lower de-duplication ratios overall. This outlines data de-duplication approaches in terms of de-duplication placement, timing, and the granularity at which data is analyzed to find duplicates. These approaches combined with the algorithms outlined above provide a foundation for investigating de-duplication resource considerations within a cloud environment. 15

25 Chapter 3: Memory Prediction To examine the tradeoffs between compute and storage cost with the addition of de-duplication, we need a method to estimate the instance type required to execute a given de-duplication algorithm. Since the main factor impacting the computing cost is the memory availability in different type of instances, we developed a methodology for predicting the memory requirements for executing a given algorithm on a particular dataset. Estimation Method With de-duplication there is not a one size fits all configuration. Depending on the application and resources available, certain algorithms might be more effective than another. One resource consideration is the total index memory size required to store the index of unique data signatures. For both fixed and variable block exact matching implementations an in memory index is used for data signature lookups for determining duplicate blocks. There are some techniques that use similarity signatures that increase the chunk size to control the memory index size, the tradeoff being the increased reliance on the speed and responsiveness of the de-duplication store for data comparisons. Therefore, our focus when utilizing cloud resources is the exact matching techniques where the memory size for the index is a concern. Providing a means to estimate the index size is important when sizing system requirements for a de-duplication implementation. Memory size estimates for both fixed and variable block algorithms follow the same formula, only varying in how the specific 16

26 variables are derived. We provide the following formula for the basic memory requirements estimations: Memory Size Estimates = (Data Size / Chunk Size) * (1 De-duplication %) * (Signature Bytes) To estimate the index memory size for a given dataset, the following variables have to be determined and or estimated: Data size - what is the total data set size that is targeted for de-duplication. Chunk size - for a fixed chunking algorithm what is the size of the chunk used in the de-duplication implementation. De-duplication percentage is based on the percentage seen during a sample run on a subset of the dataset. From our testing a sample size of 10 to 15% provides a good sample for the various data types we tested. The de-duplication percentage estimates are on par with similar measurement results in other studies for the given data types [4] [7]. Signature bytes refer to the number of bytes used for generating a chunk s signature hash. In most implementations a 20 byte SHA-1 hash signature for each chunk is used for the collision resistant properties that SHA-1 provides [6]. Variable block index memory estimates are more complex since the chunk sizes are not static but a distribution between the minimum and maximum chunk sizes set at execution time. Figure 8 shows the distribution based on the testing performed using the fs-c algorithm [7] on multiple datasets. The fs-c algorithm uses the TTTD approach to the variable block de-duplication. The CDC32 (content defined chunking) has an expected 17

27 (average) block side of 32KB, a lower threshold (Tmin) set at 8KB and the upper threshold (Tmax) at 128KB. The threshold proportions remain consistent with CDC16, CDC8, and CDC4 algorithms. The following table outlines the different fixed and variable algorithms used in the fs-c algorithm [7] tests. Average Chunk Size (bytes) Minimum Chunk Size (bytes) Maximum Chunk Size (bytes) Chunker Type Fixed8 Fixed 8192 Fixed16 Fixed Fixed32 Fixed CDC4 Variable CDC8 Variable CDC16 Variable CDC32 Variable Table 1: fs-c Algorithm Chunk Selection 25.00% 20.00% % of Chunks 15.00% 10.00% 5.00% 0.00% Min Average Max Block Sizes Office 1 -CDC32 Office 1 -CDC16 Office 1 -CDC8 Office 1 -CDC4 18

28 % of Chunks 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% Min Average Max Block Sizes Figure 8: FS-C Chunk Distributions Office 3 -CDC32 Office3 - CDC16 Office3 - CDC8 Office3 - CDC4 Based on figure 8 distributions, the percentage of data chunks between the minimum and average block size is 50% to 55% of the total unique chunks, which in terms of the total data size is 20-25%. We can derive the total number of chunks based on these calculations. In the worst case scenario (most number of chunks), we would assume that 25% of the data is chunked at the minimum block size, and the remaining 75% of the data would chunk just above the average chunk size. The best case scenario (least number of chunks), 25% of the data would chunk at the average chunk size and the remaining 75% at the max chunk size. Worst Case Total Chunks = ((.25 * DataSize) / Min Block Size) + ((.75 * Datasize ) / Average Block Size) Best Case Total Chunks = ((.25 * DataSize) / Average Block Size) + ((.75 * Datasize) / Max Block Size) As an example - in a dataset size of 100GB ( bytes) and chunking at a variable block size of 16KB (4KB lower threshold, 64KB upper threshold), with an 19

29 estimated de-duplication percentage of 25%, and signature size of 20 bytes, the memory requirements range is: Worst Case Total Chunks = ((.25 * ) / 4096 ) + ((.75 * ) / 16384) = Chunks Best Case Total Chunks = ((.25 * ) / ) + ((.75 * ) / 65536) = Chunks From the worst and best case chunk estimates, we can now utilize our memory estimation formula presented earlier to estimate the minimum and maximum memory requirements for the index when running the CDC16 algorithm against the 100GB dataset. Minimum Memory Requirements = * (1 -.25) * (20) = bytes ~ 42MB Maximum Memory Requirements = * (1 -.25) * (20) = bytes ~ 165MB Therefore, the memory requirements for our 100GB dataset are in the range of 42MB to 165MB. 20

30 Validation of Method We performed experiments with small (150GB or less) and large (500GB or more) datasets with both the fixed and variable algorithms to test how well the memory estimation formula applies to real world scenarios. Fixed Block Index Memory Requirements Dataset Size (GB) Alg Estimated Minimum Memory (MB) Actual Memory (MB) % Error FIXED % FIXED % FIXED % FIXED % Table 2: Fixed Index Memory Estimates vs. Actual Using the fs-c [7] fixed chunking algorithms, we tested an office type dataset extracted from a corporate office file share environment. To obtain our memory estimates we assumed the de-duplication percentage to be at or around the 5% mark for the small and large dataset. This percentage was obtained from a sample run of the fixed algorithm on a dataset a fraction of the size. Additionally, the SHA-1[6] data signature size of 20 bytes was selected at execution time. Based on our assumptions of the de-duplication percentage and parameters selected at run time (signature size, average chunk size) the memory estimates calculate from the formula presented previously were within 8% of the actual memory requirements. The estimate error is dependent on the percentage of deduplication assumed versus the actual, and is only improved by using a larger sample size in the de-duplication percentage estimate [30]. 21

31 The variable block chunking experiments again used the same dataset as the fixed and assumed the chunk distribution discussed previously to obtain the estimate range for the index memory. The de-duplication percentage estimate used for the CDC16 and CDC8 algorithm was 15% and 20% respectively. These estimates were obtained from local sample runs on the dataset. The SHA-1 data signature size was again set to 20 bytes at execution [6]. The minimum and maximum block thresholds set by the fs-c [7] algorithms were 4KB and 64KB respectively for the CDC16 algorithm and 2KB and 32KB respectively for the CDC8 algorithm. Variable Block Index Memory Requirements Dataset Size (GB) Alg Minimum Memory (MB) Maximum Memory (MB) Actual Memory (MB) CDC CDC CDC CDC Table 3: Variable Index Memory Estimates vs. Actual Based on the assumptions, the chunk distributions, and the parameters set at execution time, the actual memory requirements for the variable block executions on the small and large datasets were within the estimated range for the index memory - trending toward the higher end of the range for both the small and large dataset. For the variable algorithm, the index memory estimates are based not only on the de-duplication percentage estimate, but also on the best and worst case chunk distribution estimates. 22

32 Resource considerations regarding cloud instance type selection around the required index memory have been examined in relation to the chunking algorithm selected for duplicated detection. A methodology for estimating memory requirements was presented and tested against real world datasets. From our real world test performed on the corporate file share datasets the index memory estimates presented for both fixed and variable block algorithms provide good estimates for sizing the compute instance required to perform de-duplication using the sub file level granularity. We can now proceed with our experimental evaluation of the tradeoffs cost between the compute and storage when introducing de-duplication algorithms in a cloud environment. 23

33 Chapter 4: Experimental Evaluation In our experimental evaluation of de-duplication in a cloud based environment we look at the following factors namely the dataset size, the cloud compute instance requirements, and the length at which the data is going to be retained in the cloud to analyze the potential cost avoidance surrounding performing fixed and variable deduplication detection on a given dataset. We performed our analysis on the Amazon Web Services offerings, using elastic compute (EC2) for the compute platform and simple storage services (S3) for the storage infrastructure. The standard small and large instance types along with the high cpu medium instances were used in our testing. Below is a recap of resources specifications: o Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of local instance storage, 32-bit platform [26] o Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform [26] o High-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of local instance storage, 32-bit platform [26] Amazon defines one EC2 Compute Unit (ECU) as providing the equivalent CPU capacity of a GHz 2007 Opteron or 2007 Xeon processor [26]. Additionally, Amazon s Linux AMI operating system was selected for the instance builds. The cost 24

34 analysis is based on the Amazon s pricing for the US East region where our testing was performed. We used the fs-c algorithm developed by [7] with reporting and statistical gathering modification as our de-duplication engine. The fs-c algorithm has both fixed and variable chunking options. Chunk size options vary from 2KB to 32KB for both fixed and variable algorithms. The variable chunking approached is based on the two threshold two divisor algorithm [8] using Rabin fingerprinting [1] to determine the natural content boundaries. Additionally, the fs-c approach is an out-of-band approach to de-duplication, which allows the analysis of data in place. Our initial evaluation centered on small datasets extracted from a corporate file share environment that were 300GB or less in size. We grouped our datasets around the following data classifications: Office data types: Microsoft Word (doc,docx), Excel (xls, xlsx), PowerPoint (ppt, pptx), Adobe s Portable Document Format(pdf), rich text documents(rtf). Database file types: Microsoft SQL master database files(mdf), Microsoft Access(mdb) Virtual Machine Data files: VMware virtual machine (vmdk) files Media Files: JPEG, GIF, PNG, MP3, MP4,WAV As a first step we performed testing on a local system on the above datasets to gauge de-duplication percentages and instance type requirements. This allowed us to determine the dataset to focus on when moving the testing to cloud resources. 25

35 Below is a summary of the results from the first dataset of each type against both fixed and variable (CDC) algorithms using various block sizes. The algorithms used followed a format of chunk type followed by a number that indicates the average or fixed chunk size used. For example the cdc8 algorithm uses an average chunk size of 8KB with a lower threshold of 2KB and upper threshold of 32KB. The lower and upper bound thresholds remain proportionally consistent to the average chunk size for the other variable (CDC) algorithms. Refer to table 1 for algorithm specifications. The local system specifications used in our initial testing were as follows: Hardware Brand : HP DL580 G5 CPU : 2 Dual Core Intel(R) Xeon(R) 3.00GHz Memory: 8GB Hard Drives : 2 X 73GB 10K SAS drives RAID 1 for OS OS : Ubuntu (x64) Data Storage: EMC VNXe 3100 The datasets were stored on an EMC VNXe 3100 and accessed via NFS. Each dataset was run in isolate to the others eliminating any competition for resources. 26

36 # of Chunks Memory Requirements (MBs) % Deduplication Execution Time Algorithm Total Size Office cdc % 34 min G cdc % 32 min G cdc % 32 min G cdc % 33 min G fixed % 32 min G fixed % 32 min G fixed % 33 min G fixed % 32 min G VMDK cdc % 57 min G cdc % 64 min G cdc % 63 min G cdc % 64 min G fixed % 64 min G fixed % 63 min G fixed % 64 min G DB cdc % 45 min G cdc % 44 min G cdc % 45 min G cdc % 44 min G fixed % 46 min G fixed % 47 min G fixed % 46 min G Media cdc % 30 min G cdc % 35 min G cdc % 31 min G cdc % 34 min G fixed % 35 min G fixed % 35 min G fixed % 35 min G Table 4: Small Dataset Results As expected overall the variable algorithms were able to find more redundancy within each dataset type. The office dataset had the largest de-duplication percent change 27

37 between fixed and variable block algorithms. Surprisingly the execution time did not vary when changing algorithm chunking granularity or between the fixed and variable block algorithms. We examined this more closely and discovered that the bottleneck was not the CPU in processing the fixed or variable block chunks but at the disk I/O when trying to process the data out-of-band. We recorded high I/O wait times during each execution which caused the CPU to wait on the I/O to finish. This explained the consistency around the execution time regardless of the algorithm. Additionally, the VMDK de-duplication percentage is the highest based on the data redundancy inherent across similar operating system builds. The DB percentage remains the same for all the test perform due to the fact the SQL database files were extracted from a system that had the allocation unit size set to 64K, therefore no additional duplicates would be discovered by reducing the chunk size less than 64K. Finally, as expected based on our research the more random data types, such as media formats, produced the lowest de-duplication percentages. The small dataset memory requirements are within the resources available on the small and medium cloud compute instance types. Additionally, from our local testing the office dataset provides the most interesting analysis given the range of de-duplication percentage, therefore moving forward we will focus solely on office type datasets. Also to ensure the result consistency we collected another office dataset of roughly the same size for our remaining small set testing. After completing the initial testing on our local system our remaining testing will be using Amazon s cloud resources. With the small dataset our testing will focus on the small and medium instance types that differ in the amount of available ECUs [26]. The 28

38 dataset was transferred to the Amazon S3 storage in original form to perform the out-ofband de-duplication testing. Our motivation for the small dataset test using cloud resources is to gauge the execution time differences between the small and medium instance type to analyze any cost savings. Again all tests were run in isolate on separate instances types and only a single test was accessing the S3 storage bucket [26] at one time. Memory Required (MBs) % Deduplication Execution Time Algorithm # of Chunks Total Size Office1 cdc % 152 min G cdc % 137 min G cdc % 147 min G cdc % 147 min G fixed % 140 min G fixed % 138 min G fixed % 145 min G fixed % 150 min G Office2 cdc % 203 min G cdc % 206 min G cdc % 205 min G cdc % 212 min G fixed % 205 min G fixed % 210 min G fixed % 206 min G fixed % 214 min G Table 5: EC2 m1.small Instance Small Dataset Results 29

39 Memory Required (MBs) % Deduplication Execution Time Algorithm # of Chunks Total Size Office1 cdc % 50 min G cdc % 50 min G cdc % 49 min G cdc % 46 min G fixed % 50 min G fixed % 40 min G fixed % 46 min G fixed % 46 min G Office2 cdc % 67 min G cdc % 68 min G cdc % 69 min G cdc % 69 min G fixed % 69 min G fixed % 67 min G fixed % 70 min G fixed % 70 min G Table 6: EC2 c1.medium Instance Small Dataset Results Based on the results of the cloud testing on the small office dataset the execution time difference is inline with the cost difference based on Amazon s EC2 pricing at the time of this publication going from the small instance to the medium instance. Also since the memory resources are the same on the small and medium instance type a more aggressive algorithm cannot be used as a differentiator in terms of space and cost savings. Therefore there is little to no cost savings when comparing the executions times and the related compute cost differences of the small and medium size instances on a small dataset. One interesting aspect of this testing is the relative consistency in the percentage of additional redundancy detected between the fixed and variable block algorithms for both office datasets. 30

40 Transitioning into the larger dataset of 500GB and larger we again focus our attention on an office type dataset extracted from a corporate file share environment. The goal of the large dataset is to examine more aggressive algorithms that exhaust the memory resources available in the small and medium instance type for the global chunk index. This will allow us to explore the cost model and tradeoffs associated with choosing a more aggressive algorithm and large instance type versus a less aggressive and smaller instance type over varying storage durations. Using a dataset size of 764GB on the small and medium instance type the fixed16 and cdc16 were the most aggressive algorithms able to be run within the memory constraints of the instances of 1.7GB, after memory for operating system and the execution of the de-duplication algorithm were allocated. The execution times within a particular instance type are again controlled by the large I/O wait time experienced processing the data. We again see notable increases in duplicate detection with the variable algorithms over the fixed. When using CDC4, a more aggressive algorithm on the larger instance an additional 5 percent of redundancy was detected over that of the CDC16 algorithm on the smaller instances. This translates into approximately 41GB of additional redundant data eliminated. The execution times on the large instance with more aggressive algorithms are slightly longer compared with the medium instance. 31

41 # of Chunks Memory Requirements (MBs) % Deduplication Execution Time (min) Total Size Chunker m1.small fixed % G cdc % G c1.medium fixed % G cdc % G m1.large fixed % G fixed % G cdc % G cdc % G cdc % G Table 7: EC2 Instance Large Dataset Results Using these results we are able to now construct and analyze a cost model associated with the cost tradeoff when selecting a smaller instance type and less aggressive algorithms versus the option to select a larger instance type and a more aggressive algorithm. We also looked at the cost model when storing the data for varying lengths of time from one month to one year and the affect storage duration have on the cost savings and decision in selecting an instance type. To recap the factors of our cost analysis: instance type (m1_small, c1_medium, m1_large), de-duplication algorithms (fixed16,fixed8, cdc16,cdc8,cdc4), and the storage duration (1month, 3 months, 6 months, 1 year). Comparisons will be performed using the small and medium instance types against the large instance types. As discovered with the small dataset testing the cost savings are nonexistent or insignificant to compare the small and medium instances against each other. To start we will look at the cost breakdowns of the Amazon EC2 and the S3 offerings. The Amazon EC2 compute costs are based on per instance hour used and data 32

42 transfer in and out of the EC2 environment. Partial consumed instance hours are billed as full hours, so all execution times will be rounded to the next hour for cost comparison. As for the data transfer in to the EC2 environment, this cost will be excluded from our analysis since this cost does not change depending on the instance type we are selecting. Amazon s S3 cost model is based on the following factors: standard storage pricing which is the pricing for the amount of storage used; request pricing cost for the number of put, copy, post, list or get operations performed on your S3 storage bucket; data transfer cost the cost to transfer data into and out of S3. Since we are using EC2 to communicate with S3 there is no data transfer charge for data transferred between Amazon EC2 and Amazon S3 within the same region, for our case both the EC2 and S3 are within the East region [26]. Additionally, we are focusing on the standard storage pricing as opposed to the reduced redundancy storage which introduces the risk of data loss. In the case of de-duplication data protection is critical due to the large percentage of files that can be referencing a single data block. The reduced storage option is available as a lower cost option for data that is reproducible [26]. Below are a couple tables with the breakdown of the E2 and S3 pricing for the East region at the time of publication. AWS EC2 Compute Pricing Type $ Cost/Hr Small (m1_small) 0.08 Medium (c1_medium) Large (m1_large) 0.32 Table 8: AWS EC2 Pricing AWS S3 Storage Pricing Tiers $ Cost/GB First 1TB / Month Next 49TB 0.11 Next 450TB Request Cost per 1, Table 9: AWS S3 Pricing 33

43 Based on the execution times seen on the three instance types we will begin by breaking out the compute and storage cost associated with small instance type running the CDC16 algorithm. For the compute cost we look at the execution time which is minutes which translates to 26 hours after rounding up to the nearest hour. The compute cost is a straight calculation using the 26 hour multiplied by the per hour cost of the small instance type of $.08 per hour, which equals $2.08. The storage cost has a couple factors to take into account. One being the storage cost of $.125 per GB for the first TB stored. After running the CDC16 algorithm 22% redundant data was removed leaving approximately 596GB which has an associated cost of $74.50 per month. The second component of the storage cost is the request pricing. The request pricing is based on PUT, COPY, POST, LIST, or GET requests. The pricing for the PUT, COPY, POST, or LIST are $.01 per 1,000 requests while the GET and other request are $.01 per 10,000 requests [26]. In order to calculate the number of request we need to determine the number of files that make up our dataset which translates into the number of object put requests. The GB dataset is comprised of 450,990 files and directories, which translates to an estimated 450,990 PUT operations that has an associated cost of $4.51. Using these figures we are able to calculate the cost for a one month, three month, six month, and one year storage period. The calculations are the same for the medium and large instance types with the exception of the values for execution time and deduplication percentage. The request cost remains the same as the dataset and number of files remains consistent across instance tests. 34

44 Algorithm / Instance Storage TimeFrame CDC16 on Small Instance 1 Month 3 Month 6 Month 1 Year Compute Cost $2.08 $2.08 $2.08 $2.08 Storage Cost $79.01 $ $ $ CDC16 on Medium Instance 1 Month 3 Month 6 Month 1 Year Compute Cost $2.48 $2.48 $2.48 $2.48 Storage Cost $79.01 $ $ $ CDC8 on Large Instance 1 Month 3 Month 6 Month 1 Year Compute Cost $4.80 $4.80 $4.80 $4.80 Storage Cost $76.01 $ $ $ CDC4 on Large Instance 1 Month 3 Month 6 Month 1 Year Compute Cost $5.12 $5.12 $5.12 $5.12 Storage Cost $73.76 $ $ $ Table 10: Instance Cost Assessment With the varying storage durations the following assumptions were made: Compute Cost the compute cost was only calculated for the initial data de-duplication process, any subsequent data accesses are not taken into account. The data access frequency is independent of the instance type. Another aspect that was not taken into account is additional data being added or delete to the cloud instance over the storage duration. Data additions based on our research have a positive impact on the cost savings seen with the more aggressive algorithms. Looking at the cost savings of the large and small instances, the use of the large instance type with the aggressive algorithm of CDC4 over a year storage time frame produces a cost savings of 6.15% or $58.37 compared with running the CDC16 using the smaller instance type. The smaller storage timeframes also produce a cost savings in the range of 2.5% for the first month to 5.82% at the six month mark. When comparing the CDC8 on the large instance a cost savings is not realized immediately, with the first month savings at less than 1%. Therefore, in order to maximize the cost savings in the 35