Cloud De-duplication Cost Model THESIS



Similar documents
Data Deduplication and Tivoli Storage Manager

Hardware Configuration Guide

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

How AWS Pricing Works May 2015

Deploying De-Duplication on Ext4 File System

How AWS Pricing Works

Data Deduplication HTBackup

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

Data Deduplication: An Essential Component of your Data Protection Strategy

Speeding Up Cloud/Server Applications Using Flash Memory

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

Barracuda Backup Deduplication. White Paper

LDA, the new family of Lortu Data Appliances

A Deduplication File System & Course Review

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Understanding EMC Avamar with EMC Data Protection Advisor

Data Deduplication and Tivoli Storage Manager

Enterprise Backup and Restore technology and solutions

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

WHITE PAPER. How Deduplication Benefits Companies of All Sizes An Acronis White Paper

Demystifying Deduplication for Backup with the Dell DR4000

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s.

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

Theoretical Aspects of Storage Systems Autumn 2009

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Deduplication Demystified: How to determine the right approach for your business

NETAPP WHITE PAPER Looking Beyond the Hype: Evaluating Data Deduplication Solutions


Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

EMC VNXe File Deduplication and Compression

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

MySQL and Virtualization Guide

Understanding Enterprise NAS

3Gen Data Deduplication Technical

Reducing Backups with Data Deduplication

09'Linux Plumbers Conference

EMC DATA DOMAIN OPERATING SYSTEM

The assignment of chunk size according to the target data characteristics in deduplication backup system

Hardware and Software Requirements. Release 7.5.x PowerSchool Student Information System

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Amazon Elastic Compute Cloud Getting Started Guide. My experience

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card

A Data De-duplication Access Framework for Solid State Drives

Multi-level Metadata Management Scheme for Cloud Storage System

We look beyond IT. Cloud Offerings

EMC BACKUP-AS-A-SERVICE

Inline Deduplication

Technology and Cost Considerations for Cloud Deployment: Amazon Elastic Compute Cloud (EC2) Case Study

TECHNICAL BRIEF. Primary Storage Compression with Storage Foundation 6.0

UBUNTU DISK IO BENCHMARK TEST RESULTS

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Cloud Computing on Amazon's EC2

Deduplication has been around for several

Read Performance Enhancement In Data Deduplication For Secondary Storage

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

EMC DATA DOMAIN OPERATING SYSTEM

Veritas Backup Exec 15: Deduplication Option

Comparison of Windows IaaS Environments

Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Byte-index Chunking Algorithm for Data Deduplication System

Metadata Feedback and Utilization for Data Deduplication Across WAN

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

ESG REPORT. Data Deduplication Diversity: Evaluating Software- vs. Hardware-Based Approaches. By Lauren Whitehouse. April, 2009

How to recover a failed Storage Spaces

A Survey on Deduplication Strategies and Storage Systems

ExaGrid Product Description. Cost-Effective Disk-Based Backup with Data Deduplication

Hey, You, Get Off of My Cloud! Exploring Information Leakage in Third-Party Clouds. Thomas Ristenpart, Eran Tromer, Hovav Shacham, Stefan Savage

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

E-Guide. Sponsored By:

Turnkey Deduplication Solution for the Enterprise

2009 Oracle Corporation 1

VNX HYBRID FLASH BEST PRACTICES FOR PERFORMANCE

Case Studies. Data Sheets : White Papers : Boost your storage buying power... use ours!

Release 8.2 Hardware and Software Requirements. PowerSchool Student Information System

Cloud security CS642: Computer Security Professor Ristenpart h9p:// rist at cs dot wisc dot edu University of Wisconsin CS 642

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team

The Classical Architecture. Storage 1 / 36

Effective Planning and Use of TSM V6 Deduplication

Riverbed Whitewater/Amazon Glacier ROI for Backup and Archiving

Understanding data deduplication ratios June 2008

Amazon EC2 XenApp Scalability Analysis

Part 1: Price Comparison Among The 10 Top Iaas Providers

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)

DeltaStor Data Deduplication: A Technical Review

Contents. WD Arkeia Page 2 of 14

An Oracle White Paper June Oracle Database Firewall 5.0 Sizing Best Practices

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

Transcription:

Cloud De-duplication Cost Model THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Christopher Scott Hocker Graduate Program in Computer Science and Engineering The Ohio State University 2012 Master's Examination Committee: Dr. Gagan Agrawal, Advisor, Dr. Christopher Stewart

Copyright by Christopher Scott Hocker 2012

Abstract De-duplication algorithms have been used extensively in the backup and recovery realm to exploit the temporal data redundancy on the versioning nature of backup applications. More recently de-duplication algorithms have progressed into the primary storage area where more spatial data exist, in either case increasing the efficiency of the usable storage capacity. In parallel another industry trend is the use of cloud resources to provide a utility model for software and infrastructure services. Cloud resources aid in decreasing the provisioning time for applications and infrastructure, while increasing the scalability to meet the elastic nature of application and infrastructure requirements. Using the de-duplication algorithms within cloud resources is the next logic step to increase the efficiency and cost related to cloud computing. Cloud environments involve pricing models for both computing and storage. In this work, we examine the tradeoffs between computing costs and reduction in storage costs (due to higher de-duplication) for a number of popular de-duplication algorithms. Since the main factor impacting the computing cost is the memory availability in different type of instances, we also develop a methodology for predicting the memory requirements for executing a given algorithm on a particular dataset. ii

Dedication Dedicated to those who supported me throughout my academic career my Wife, Parents, Brother, Sister and Friends. iii

Acknowledgments First I would like to thank my advisor, Dr. Gagan Agrawal for challenging me and providing guidance from the beginning of my time at Ohio State. The support and advice he provided was invaluable. Additionally, I would like to thank my thesis committee member Dr. Christopher Stewart for his time and participation during this work. I would also like to thank my Wife for her support and understanding of the long hours during this work and throughout my entire academic career. Finally, I would like to thank the rest of my support system, my Parents, Brother, Sister, and Friends who were always there to provide a word of encouragement and motivation. iv

Vita 2001...Vandalia Butler High School 2007...B.S. CS, Wright State University 2008 to present...m.s. CSE, The Ohio State University Fields of Study Major Field: Computer Science and Engineering v

Table of Contents Abstract... ii Dedication... iii Acknowledgments... iv Vita... v Fields of Study... v Table of Contents... vi List of Tables... vii List of Figures... viii Chapter 1: Introduction... 1 Chapter 2: De-duplication Algorithms... 4 Chapter 3: Memory Prediction... 16 Chapter 4: Experimental Evaluation... 24 Chapter 5: Related Research... 38 References:... 41 vi

List of Tables Table 1: fs-c Algorithm Chunk Selection... 18 Table 2: Fixed Index Memory Estimates vs. Actual... 21 Table 3: Variable Index Memory Estimates vs. Actual... 22 Table 4: Small Dataset Results... 27 Table 5: EC2 m1.small Instance Small Dataset Results... 29 Table 6: EC2 c1.medium Instance Small Dataset Results... 30 Table 7: EC2 Instance Large Dataset Results... 32 Table 8: AWS EC2 Pricing... 33 Table 9: AWS S3 Pricing... 33 Table 10: Instance Cost Assessment... 35 Table 11: m1_small vs. m1_large Instance Cost... 36 Table 12: c1_medium vs. m1_large Instance Cost... 37 vii

List of Figures Figure 1: Out-of-Band vs. In-band De-duplication... 4 Figure 2: Basic Sliding Window Algorithm [8]... 8 Figure 3: TTTD Pseudo Code [8]... 10 Figure 4: Chunk Distribution of TTTD algorithm [9]... 11 Figure 5: TTTD-S Chunk Distribution Improvements [9]... 12 Figure 6: TTTD-S algorithm pseudo code... 12 Figure 7: De-duplication ratio and percent savings [19]... 14 Figure 8: FS-C Chunk Distributions... 19 viii

Chapter 1: Introduction De-duplication algorithms have been used extensively in the backup and recovery realm to exploit the temporal data redundancy on the versioning nature of backup applications. More recently de-duplication algorithms have progressed into the primary storage area where more spatial data exist, in either case increasing the efficiency of the usable storage capacity. In parallel another industry trend is the use of cloud resources to provide a utility model for software and infrastructure services. Cloud resources aid in decreasing the provisioning time for applications and infrastructure, while increasing the scalability to meet the elastic nature of application and infrastructure requirements. As companies begin to transition their data and infrastructure to the cloud, the methods of cost savings they are accustomed such as de-duplication should transition as well. The increased efficiency in usable capacity gained through the use of de-duplication will translate into a positive impact on the cloud pay as you go model. De-duplication does come with a tradeoff of additional compute resources required to analyze the data for duplicates, so selecting the right instance type to run a given de-duplication algorithm is an important aspect. Therefore we will examine the resources and cost factors related to the cloud environments and how de-duplication can be effectively integrated. Cloud environments involve pricing models for both computing and storage. In this work, we examine the tradeoffs between computing costs and reduction in storage costs (due to higher de-duplication) for a number of popular de-duplication algorithms. Since the main factor impacting the computing cost is the memory availability in 1

different type of instances, we also develop a methodology for estimating the memory requirements for executing a given algorithm on a particular dataset. Through experiments we show support for more aggressive de-duplication algorithms to maximize cost savings on a larger cloud compute instance versus that of a smaller instance type and less aggressive de-duplication algorithm. In some cases the dataset size does not warrant a larger instance type to run more granular de-duplication algorithms since the index memory requirements are satisfied by the entry level instance types. In these situations there is no benefit to choosing a larger instance type in an effort to reduce cloud resource cost. Thesis Statement Integrating de-duplication effectively and efficiently into a cloud environment requires an understanding of the resource requirements, specifically the memory requirements and the tradeoff in compute cost for processing data for duplicates at various levels of granularity. Contributions This thesis makes the following contributions: 1. Proposes a methodology to predict the required cloud instance type based on memory requirements to run popular de-duplication algorithms on a given dataset. 2. Analyze cloud compute requirements for running de-duplication algorithms at varying chunk granularity. 2

3. Evaluate cost factors associated with running de-duplication in a cloud environment including compute instance types, de-duplication algorithm chunk granularity, and data storage durations. 3

Chapter 2: De-duplication Algorithms The implementation of data de-duplication duplication technologies varies in terms of de- duplication placement, the timing of the de-duplication duplication of data, and the granularity at which data is analyzed to find duplicate data. The placement of the de-duplication process can occur at the client or target device [4]. Additionally, a hybrid approach of both the client and target also exist. The timing of the de-duplication duplication process is performed either in-band as the data is received/sent versus out-of-band at a scheduled time. Figure 1: Out-of-Band vs. In-band De-duplication The placement and timing are key components of de-duplication duplication algorithms however in this research we will focus on client based de-duplication duplication algorithms and out of band timing which will allow us to integrate de-duplication duplication into an existing cloud environment and analyze the data in place. We will explore in more detail the granularity at which duplicates are detected specifically at a fixed and variable block levels. Finally, we will provide a brief explanation of de-duplication ratios and the contributing factors to the ratio. 4

Duplicate Detection For duplicate detection there are three main approaches: whole file, often called single instance storage, sub file chunking which is comprised of fixed and variable block hashing, and delta encoding. For whole file, entire files are given a hash signature using MD5 or SHA-1 [6]. Files with exact hash signatures are assigned a pointer to the single file instance previously stored. In certain algorithms a byte by byte comparison is performed to eliminate the potential for hash collisions, which is often a concern with hash based comparisons [27]. Lower de-duplication ratios are generally obtained due to the large dataset that has to be matched; any small change in the file will alter the hash and affect any previous match. The second and more popular approach is the sub file chunking [4]. Methods such as fixed and variable block hashing are two types of sub file chunking. Fixed block deduplication chunks and hashes a byte stream based on a fixed block size. The hash signatures are stored and referenced in a global index, which is implemented using a bloom filter type data structure to quickly identify new unique block segments [14]. If the block signature already exists in the index, a pointer to the existing block is created; otherwise the signature is stored and the block is written to disk. In contrast, variable block algorithms use methods such as Rabin Fingerprinting [1], a hashing algorithm that uses a sliding window to determine the natural block boundaries with the highest probability of matching other blocks. Variable block algorithms still employ the bloom filter based data structure for the in-memory index [11]. Variable block proves to be a 5

more efficient approach compared to fixed and whole file hashing, regardless of the slight variations or offsets that exist due to modifications in similar files and blocks [4]. Exact and similarity matching techniques are used in the sub file hashing algorithms. Exact matching algorithms examine the chunk hash index for exact signature match. For exact matching algorithms, the block size has a direct impact on the hash index size which can present a problem when storing in memory [29]. An example was provided in [2], where the index size required for 1 Petabyte of de-duplicated data, assuming an average block size of 8KB would require 128GB of memory for the index at a minimum. Similarity matching techniques address the index size by increasing the block size to 16MB in [2], which in turn will reduce the index size to 4GB for the 1 Petabyte of data. A similarity signature approach consists of a number of block signature based on a subset of the chunk bytes. If a similarity signature matches more than some threshold of block signatures then there is a reasonable probability that the two chunks will share common block based signatures [2]. Thus a match is found, and the new similarity chunk is compared and de-duplicated against the similar chunks [29]. The trade off with the similarity matching technique is de-duplication performance is highly dependent upon the speed of the de-duplication repository storage; as the repository is referenced for similarity block signature matches. Also since each chunk is only compared against a limited number of other chunks in similarity matching, occasionally duplicates are stored [29]. Finally, delta encoding (also called data differencing), processes files based on a reference file for differences, storing the deltas in a patch files [17]. Selecting the 6

reference file previously stored is a key operation in delta encoding algorithms and often selected based on a fingerprinting technique similar to that of the whole file and sub chunking algorithms [4]. The sub file chunking approach with exact matching used with variable block algorithms are able to detect the varying offsets of data blocks. This addresses the boundary shifting concerns of fixed or whole file algorithms. The tradeoff comes in terms of additional resources required to maintain the metadata associated with the increase chunks seen with variable block approaches. Additionally, index lookups increase during the variable chunk detection process. Even with the increased resource requirements for variable block algorithms the majority of current algorithms use variable block sub file hashing with exact matching to maximize efficiency and overall de-duplication ratios. De-duplication Implementations The process is straight forward for the whole file hashing and fixed block hashing: perform the hash index lookups based on the SHA-1 [6] or MD5 hash value of the file or fixed bock to determine if the data is unique. If a duplicate is detected in the hash lookup, optionally perform a byte by byte comparison then modify the file metadata to reference the previously stored data. If the data is unique, record the hash value and perform local compression (e.g. LZ, gzip) on the unique block and store only that data [13]. For the variable block otherwise known as content based chunk algorithms, there are several algorithms that vary in their implementation, overall performance and effectiveness in identifying duplicate chunks. The main motivation behind the variable 7

block algorithms is elimination of the boundary shifting problem [18]. If a small modification is made to a file, the chunk boundaries for whole file and fixed block chunking shift, causing a poor duplicate detection against the file or data. The low bandwidth network filesystem [18] first introduced the basic sliding window (BSW) using three parameters as inputs: fixed window size W, integer divisor D, and integer remainder R. The window shifts one byte at a time to the max window size of W from the beginning of the file to the end. Fingerprints (h) of the window contents are generated with Rabin fingerprinting. Rabin introduced the idea of detecting the natural block boundaries in a byte stream and assigning the variable chunks a signature [1]. The algorithm then tests if (h mod D ) = R which if true, a D-Match has been found and the current position is set as a breakpoint for that chunk. The parameter D can be configured to make the chunk size as close to the data expectations as possible to maximize the de-duplication. The parameter R must be between 0 and D-1, and most often is configured as D-1. Figure 2, provides a visual representation of the basic sliding window approach. Figure 2: Basic Sliding Window Algorithm [8] 8

Problems presented by the basic sliding window approach include the large chunk sizes generated when a match is not detected and the data is chunked at the window size. This leads to boundary shifting problems when small modifications are made, making the large chunk matches more difficult. Additional improvements were made which resulted in the introduction of the two divisor (TD) algorithm [8]. This algorithm addressed the issue with the basic sliding window algorithm by introducing a second divisor (S) that is smaller than D which increases the chance of a match. Both D and S are calculated at each byte shift to increase the chances of chunk match, decreasing the number of large chunks. Using the BSW or the TD algorithms, the chunk size is only upper bounded and could cause chunks to vary greatly in size. Small chunk sizes greatly increases the number of chunks which is directly related to the memory overhead required for exact matching techniques [29]. The two threshold two divisor algorithm was developed to address the range of chunk size generated during duplication detection [8]. TTTD added two threshold parameters to the BSW and TD algorithms which control the upper (Tmax) and lower bounds (Tmin) of chunk sizes [8]. Data fingerprints are not generated until the minimum byte threshold is met; addressing the overhead issues related to small chunk size, while still addressing the boundary shifting concerns of chunking at large chunk sizes. 9

int p=0; //current position int l=0; //position of last breakpoint int backupbreak=0; //position of backup breakpoint for (;!endoffile(input);p++){ unsigned char c=getnextbyte(input); unsigned int hash=updatehash(c); if (p - l<tmin){ //not at minimum size yet continue; } if ((hash % Ddash)==Ddash-1){ //secondary divisor backupbreak=p; } if ((hash% D) == D-1){ //we found a breakpoint //before maximum threshold. addbreakpoint(p); backupbreak=0; l=p; continue; Figure 3: TTTD Pseudo Code [8] } } if (p-l<tmax){ //we have failed to find a breakpoint, //but we are not at the maximum yet continue; } //when we reach here, we have //not found a breakpoint with //the main divisor, and we are //at the threshold. If there //is a backup breakpoint, use it. //Otherwise impose a hard threshold. if (backupbreak!=0){ addbreakpoint(backupbreak); l=backupbreak; backupbreak=0; } else{ addbreakpoint(p); l=p; backupbreak=0; } Additional studies found that when maximum threshold (Tmax) of TTTD was reached, that only the last secondary divisor (S) was used for chunking. Therefore, all other secondary divisors calculated were not considered, causing a large distribution of chunk sizes. See Figure 4 to view the chunk distribution of the TTTD algorithm. 10

Figure 4: Chunk Distribution of TTTD algorithm [9] The chunk distribution contains two groupings (as seen on the figure above) - the first around the chunk size detected by the expected main divisor (D), the second near the max chunk threshold where a match was not discovered and the previous secondary divisor (S) was used. TTTD-S [9] improves upon the large distribution of the chunk grouping by introducing a switchp value that is set to 1.6 times the expected chunk-size [9]. Once the window size has reached switchp, the divisors are reduced in half to shorten the match process. This in turn helps to shorten the process of finding a breakpoint before max chunk threshold is reached. Additionally, it can improve the distribution and bring the second chunk grouping closer to the average chunk size detected by the main divisor. Figure 5 illustrates the improvements in the chunk distribution made by the switchp parameter introduced in the TTTD-S algorithm, further reducing the chances of a boundary shifting conditions with data modifications. 11

Figure 5: TTTD-S Chunk Distribution Improvements [9] int currp = 0, lastp= 0, backupbreak = 0; for ( ;! endoffile( input ) ; currp++ ) { unsigned char c = getnextbyte( input ); unsigned int hash = updatehash( c ) ; if ( currp lastp < mint ) { continue ; } if ( currp lastp > switchp ) { switchdivisor( ) ; } if (( hash % secondd ) == secondd 1 ) { backupbreak = currp ; } if (( hash % maind ) = = maind 1 ) { addbreakpoint( currp ) ; backupbreak = 0 ; lastp = currp ; resetdivisor( ); 12 continue; } if ( currp lastp < maxt ) { continue ; } if ( backupbreak!= 0 ) { addbreakpoint( backupbreak ) ; lastp = backupbreak ; backupbreak = 0 ; resetdivisor( ) ; } else { addbreakpoint( currp ) ; lastp = currp ; backupbreak = 0 resetdivisor( ) ; } } Figure 6: TTTD-S algorithm pseudo code Additional algorithms exist that use a hybrid approach by incorporating variable and fixed block techniques, as well as small chunk merging techniques to reduce the

number of small chunks therefore, reducing the overhead associated with the large number of small chunks. Compression is also often used in conjunction with deduplication to increase the storage space utilization. Data de-duplication algorithms have been extensively researched. The techniques available today vary depending on the de-duplication placement, timing of the detection process, and the granularity at which duplicates are discovered [4]. Regardless of the techniques, the overall effectiveness of any of the de-duplication algorithms remains data dependent [30]. The fs-c algorithm [7] used in our research is based on the TTTD algorithm using the Rabin [1] fingerprinting for generating the natural chunk boundaries. De-duplication Savings In addition to the inner workings of the de-duplication algorithms, the data characteristics will help in understanding the space savings obtained by the various algorithms. There are several factors such as data type, scope of the de-duplication, and data storage period that play a role in overall de-duplication savings. De-duplication savings are often stated in terms of ratios, which is in relation to the number of input bytes to the de-duplication process divided by the number of bytes of output. [19]. Figure 7 depicts the calculation of de-duplication ratios to percentages. In our studies we will use percentages to eliminate any confusion and make the overall savings more apparent. 13

Figure 7: De-duplication ratio and percent savings [19] Data files types are one component that has an impact on the de-duplication saving expectations. For example, files generated by humans in applications such as text documents, spreadsheets, presentations, etc often contain a large amount of redundant data, while data generated by a computer system such as images, media, archived files, etc often have less redundancy due to the random nature of the data [19]. The scope of de-duplication refers to the range of datasets examined during duplicate detection. For example, the term global de-duplication allows for detection of duplicates across multiple data sources which can span multiple storage systems or locations [19]. Conversely, the de-duplication across just a single appliance or within a single client s data only looks at the data contained within that appliance or client, creating silos of de-duplication stores. In general, the larger the data scope for duplication detection, the higher the expected space savings.. Data storage periods have an effect on the de-duplication savings by increasing the chance of exploiting temporal redundancy. For example, in a backup type scenario 14

where temporal data is accumulated over time due to the versioning nature of backup applications, ratios are expected to be higher. In a primary storage scenario, spatial data exists across a broad spectrum of data types which lead to lower de-duplication ratios overall. This outlines data de-duplication approaches in terms of de-duplication placement, timing, and the granularity at which data is analyzed to find duplicates. These approaches combined with the algorithms outlined above provide a foundation for investigating de-duplication resource considerations within a cloud environment. 15

Chapter 3: Memory Prediction To examine the tradeoffs between compute and storage cost with the addition of de-duplication, we need a method to estimate the instance type required to execute a given de-duplication algorithm. Since the main factor impacting the computing cost is the memory availability in different type of instances, we developed a methodology for predicting the memory requirements for executing a given algorithm on a particular dataset. Estimation Method With de-duplication there is not a one size fits all configuration. Depending on the application and resources available, certain algorithms might be more effective than another. One resource consideration is the total index memory size required to store the index of unique data signatures. For both fixed and variable block exact matching implementations an in memory index is used for data signature lookups for determining duplicate blocks. There are some techniques that use similarity signatures that increase the chunk size to control the memory index size, the tradeoff being the increased reliance on the speed and responsiveness of the de-duplication store for data comparisons. Therefore, our focus when utilizing cloud resources is the exact matching techniques where the memory size for the index is a concern. Providing a means to estimate the index size is important when sizing system requirements for a de-duplication implementation. Memory size estimates for both fixed and variable block algorithms follow the same formula, only varying in how the specific 16

variables are derived. We provide the following formula for the basic memory requirements estimations: Memory Size Estimates = (Data Size / Chunk Size) * (1 De-duplication %) * (Signature Bytes) To estimate the index memory size for a given dataset, the following variables have to be determined and or estimated: Data size - what is the total data set size that is targeted for de-duplication. Chunk size - for a fixed chunking algorithm what is the size of the chunk used in the de-duplication implementation. De-duplication percentage is based on the percentage seen during a sample run on a subset of the dataset. From our testing a sample size of 10 to 15% provides a good sample for the various data types we tested. The de-duplication percentage estimates are on par with similar measurement results in other studies for the given data types [4] [7]. Signature bytes refer to the number of bytes used for generating a chunk s signature hash. In most implementations a 20 byte SHA-1 hash signature for each chunk is used for the collision resistant properties that SHA-1 provides [6]. Variable block index memory estimates are more complex since the chunk sizes are not static but a distribution between the minimum and maximum chunk sizes set at execution time. Figure 8 shows the distribution based on the testing performed using the fs-c algorithm [7] on multiple datasets. The fs-c algorithm uses the TTTD approach to the variable block de-duplication. The CDC32 (content defined chunking) has an expected 17

(average) block side of 32KB, a lower threshold (Tmin) set at 8KB and the upper threshold (Tmax) at 128KB. The threshold proportions remain consistent with CDC16, CDC8, and CDC4 algorithms. The following table outlines the different fixed and variable algorithms used in the fs-c algorithm [7] tests. Average Chunk Size (bytes) Minimum Chunk Size (bytes) Maximum Chunk Size (bytes) Chunker Type Fixed8 Fixed 8192 Fixed16 Fixed 16384 Fixed32 Fixed 32768 CDC4 Variable 4096 1024 16384 CDC8 Variable 8192 2048 32768 CDC16 Variable 16384 4096 65536 CDC32 Variable 32768 8192 131072 Table 1: fs-c Algorithm Chunk Selection 25.00% 20.00% % of Chunks 15.00% 10.00% 5.00% 0.00% Min Average Max Block Sizes Office 1 -CDC32 Office 1 -CDC16 Office 1 -CDC8 Office 1 -CDC4 18

% of Chunks 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% Min Average Max Block Sizes Figure 8: FS-C Chunk Distributions Office 3 -CDC32 Office3 - CDC16 Office3 - CDC8 Office3 - CDC4 Based on figure 8 distributions, the percentage of data chunks between the minimum and average block size is 50% to 55% of the total unique chunks, which in terms of the total data size is 20-25%. We can derive the total number of chunks based on these calculations. In the worst case scenario (most number of chunks), we would assume that 25% of the data is chunked at the minimum block size, and the remaining 75% of the data would chunk just above the average chunk size. The best case scenario (least number of chunks), 25% of the data would chunk at the average chunk size and the remaining 75% at the max chunk size. Worst Case Total Chunks = ((.25 * DataSize) / Min Block Size) + ((.75 * Datasize ) / Average Block Size) Best Case Total Chunks = ((.25 * DataSize) / Average Block Size) + ((.75 * Datasize) / Max Block Size) As an example - in a dataset size of 100GB (107374182400 bytes) and chunking at a variable block size of 16KB (4KB lower threshold, 64KB upper threshold), with an 19

estimated de-duplication percentage of 25%, and signature size of 20 bytes, the memory requirements range is: Worst Case Total Chunks = ((.25 * 107374182400) / 4096 ) + ((.75 * 107374182400) / 16384) = 11468800 Chunks Best Case Total Chunks = ((.25 * 107374182400) / 16384 ) + ((.75 * 107374182400) / 65536) = 2867200 Chunks From the worst and best case chunk estimates, we can now utilize our memory estimation formula presented earlier to estimate the minimum and maximum memory requirements for the index when running the CDC16 algorithm against the 100GB dataset. Minimum Memory Requirements = 2867200 * (1 -.25) * (20) = 43008000 bytes ~ 42MB Maximum Memory Requirements = 11468800 * (1 -.25) * (20) = 172032000 bytes ~ 165MB Therefore, the memory requirements for our 100GB dataset are in the range of 42MB to 165MB. 20

Validation of Method We performed experiments with small (150GB or less) and large (500GB or more) datasets with both the fixed and variable algorithms to test how well the memory estimation formula applies to real world scenarios. Fixed Block Index Memory Requirements Dataset Size (GB) Alg Estimated Minimum Memory (MB) Actual Memory (MB) % Error 111.35 FIXED16 132.228125 139.4957161 5% 111.35 FIXED8 267.5920949 278.7025261 4% 924.41 FIXED16 1097.736875 1034.24 6% 924.41 FIXED8 2221.507036 2058.24 8% Table 2: Fixed Index Memory Estimates vs. Actual Using the fs-c [7] fixed chunking algorithms, we tested an office type dataset extracted from a corporate office file share environment. To obtain our memory estimates we assumed the de-duplication percentage to be at or around the 5% mark for the small and large dataset. This percentage was obtained from a sample run of the fixed algorithm on a dataset a fraction of the size. Additionally, the SHA-1[6] data signature size of 20 bytes was selected at execution time. Based on our assumptions of the de-duplication percentage and parameters selected at run time (signature size, average chunk size) the memory estimates calculate from the formula presented previously were within 8% of the actual memory requirements. The estimate error is dependent on the percentage of deduplication assumed versus the actual, and is only improved by using a larger sample size in the de-duplication percentage estimate [30]. 21

The variable block chunking experiments again used the same dataset as the fixed and assumed the chunk distribution discussed previously to obtain the estimate range for the index memory. The de-duplication percentage estimate used for the CDC16 and CDC8 algorithm was 15% and 20% respectively. These estimates were obtained from local sample runs on the dataset. The SHA-1 data signature size was again set to 20 bytes at execution [6]. The minimum and maximum block thresholds set by the fs-c [7] algorithms were 4KB and 64KB respectively for the CDC16 algorithm and 2KB and 32KB respectively for the CDC8 algorithm. Variable Block Index Memory Requirements Dataset Size (GB) Alg Minimum Memory (MB) Maximum Memory (MB) Actual Memory (MB) 111.35 CDC16 51.76035156 118.309375 115.6596375 111.35 CDC8 98.09142787 222.7 224.4574738 924.41 CDC16 429.7062109 982.185625 741.23 924.41 CDC8 814.3394417 1848.82 1433.417511 Table 3: Variable Index Memory Estimates vs. Actual Based on the assumptions, the chunk distributions, and the parameters set at execution time, the actual memory requirements for the variable block executions on the small and large datasets were within the estimated range for the index memory - trending toward the higher end of the range for both the small and large dataset. For the variable algorithm, the index memory estimates are based not only on the de-duplication percentage estimate, but also on the best and worst case chunk distribution estimates. 22

Resource considerations regarding cloud instance type selection around the required index memory have been examined in relation to the chunking algorithm selected for duplicated detection. A methodology for estimating memory requirements was presented and tested against real world datasets. From our real world test performed on the corporate file share datasets the index memory estimates presented for both fixed and variable block algorithms provide good estimates for sizing the compute instance required to perform de-duplication using the sub file level granularity. We can now proceed with our experimental evaluation of the tradeoffs cost between the compute and storage when introducing de-duplication algorithms in a cloud environment. 23

Chapter 4: Experimental Evaluation In our experimental evaluation of de-duplication in a cloud based environment we look at the following factors namely the dataset size, the cloud compute instance requirements, and the length at which the data is going to be retained in the cloud to analyze the potential cost avoidance surrounding performing fixed and variable deduplication detection on a given dataset. We performed our analysis on the Amazon Web Services offerings, using elastic compute (EC2) for the compute platform and simple storage services (S3) for the storage infrastructure. The standard small and large instance types along with the high cpu medium instances were used in our testing. Below is a recap of resources specifications: o Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of local instance storage, 32-bit platform [26] o Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform [26] o High-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of local instance storage, 32-bit platform [26] Amazon defines one EC2 Compute Unit (ECU) as providing the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor [26]. Additionally, Amazon s Linux AMI operating system was selected for the instance builds. The cost 24

analysis is based on the Amazon s pricing for the US East region where our testing was performed. We used the fs-c algorithm developed by [7] with reporting and statistical gathering modification as our de-duplication engine. The fs-c algorithm has both fixed and variable chunking options. Chunk size options vary from 2KB to 32KB for both fixed and variable algorithms. The variable chunking approached is based on the two threshold two divisor algorithm [8] using Rabin fingerprinting [1] to determine the natural content boundaries. Additionally, the fs-c approach is an out-of-band approach to de-duplication, which allows the analysis of data in place. Our initial evaluation centered on small datasets extracted from a corporate file share environment that were 300GB or less in size. We grouped our datasets around the following data classifications: Office data types: Microsoft Word (doc,docx), Excel (xls, xlsx), PowerPoint (ppt, pptx), Adobe s Portable Document Format(pdf), rich text documents(rtf). Database file types: Microsoft SQL master database files(mdf), Microsoft Access(mdb) Virtual Machine Data files: VMware virtual machine (vmdk) files Media Files: JPEG, GIF, PNG, MP3, MP4,WAV As a first step we performed testing on a local system on the above datasets to gauge de-duplication percentages and instance type requirements. This allowed us to determine the dataset to focus on when moving the testing to cloud resources. 25

Below is a summary of the results from the first dataset of each type against both fixed and variable (CDC) algorithms using various block sizes. The algorithms used followed a format of chunk type followed by a number that indicates the average or fixed chunk size used. For example the cdc8 algorithm uses an average chunk size of 8KB with a lower threshold of 2KB and upper threshold of 32KB. The lower and upper bound thresholds remain proportionally consistent to the average chunk size for the other variable (CDC) algorithms. Refer to table 1 for algorithm specifications. The local system specifications used in our initial testing were as follows: Hardware Brand : HP DL580 G5 CPU : 2 Dual Core Intel(R) Xeon(R) CPU @ 3.00GHz Memory: 8GB Hard Drives : 2 X 73GB 10K SAS drives RAID 1 for OS OS : Ubuntu 11.0.4 (x64) Data Storage: EMC VNXe 3100 The datasets were stored on an EMC VNXe 3100 and accessed via NFS. Each dataset was run in isolate to the others eliminating any competition for resources. 26

# of Chunks Memory Requirements (MBs) % Deduplication Execution Time Algorithm Total Size Office cdc4 23582739 365.286681 18.79% 34 min 111.35G cdc8 11768036 186.9730756 16.70% 32 min 111.35G cdc16 6063896 98.99308369 14.41% 32 min 111.35G cdc32 2980966 49.93786693 12.17% 33 min 111.35G fixed4 29205380 540.1141443 3.04% 32 min 111.35G fixed8 14612039 271.2054281 2.69% 32 min 111.35G fixed16 7313593 136.0362223 2.48% 33 min 111.35G fixed32 3662809 68.28364404 2.26% 32 min 111.35G VMDK cdc4 32066793 116.5758275 80.94% 57 min 274.08G cdc8 15870107 61.29639945 79.75% 64 min 274.08G cdc16 7952461 32.62661669 78.49% 63 min 274.08G cdc32 4007976 17.36854834 77.28% 64 min 274.08G fixed8 35923993 181.7139233 73.48% 64 min 274.08G fixed16 17962030 98.59985798 71.22% 63 min 274.08G fixed32 8981049 51.73257442 69.80% 64 min 274.08G DB cdc4 21235250 187.2049818 53.78% 45 min 159.77G cdc8 11162546 98.42767746 53.77% 44 min 159.77G cdc16 6015166 53.05123822 53.76% 45 min 159.77G cdc32 3363838 29.66763861 53.76% 44 min 159.77G fixed8 20940832 184.6494033 53.77% 46 min 159.77G fixed16 10470416 92.32470163 53.77% 47 min 159.77G fixed32 5235208 46.16235081 53.77% 46 min 159.77G Media cdc4 27980371 492.8564571 7.65% 30 min 129.20G cdc8 13944798 245.8149397 7.58% 35 min 129.20G cdc16 6990914 123.3405199 7.50% 31 min 129.20G cdc32 3525336 62.29155103 7.36% 34 min 129.20G fixed8 16978508 302.7250152 6.52% 35 min 129.20G fixed16 8508362 151.7843433 6.47% 35 min 129.20G fixed32 4279530 76.38519619 6.42% 35 min 129.20G Table 4: Small Dataset Results As expected overall the variable algorithms were able to find more redundancy within each dataset type. The office dataset had the largest de-duplication percent change 27

between fixed and variable block algorithms. Surprisingly the execution time did not vary when changing algorithm chunking granularity or between the fixed and variable block algorithms. We examined this more closely and discovered that the bottleneck was not the CPU in processing the fixed or variable block chunks but at the disk I/O when trying to process the data out-of-band. We recorded high I/O wait times during each execution which caused the CPU to wait on the I/O to finish. This explained the consistency around the execution time regardless of the algorithm. Additionally, the VMDK de-duplication percentage is the highest based on the data redundancy inherent across similar operating system builds. The DB percentage remains the same for all the test perform due to the fact the SQL database files were extracted from a system that had the allocation unit size set to 64K, therefore no additional duplicates would be discovered by reducing the chunk size less than 64K. Finally, as expected based on our research the more random data types, such as media formats, produced the lowest de-duplication percentages. The small dataset memory requirements are within the resources available on the small and medium cloud compute instance types. Additionally, from our local testing the office dataset provides the most interesting analysis given the range of de-duplication percentage, therefore moving forward we will focus solely on office type datasets. Also to ensure the result consistency we collected another office dataset of roughly the same size for our remaining small set testing. After completing the initial testing on our local system our remaining testing will be using Amazon s cloud resources. With the small dataset our testing will focus on the small and medium instance types that differ in the amount of available ECUs [26]. The 28

dataset was transferred to the Amazon S3 storage in original form to perform the out-ofband de-duplication testing. Our motivation for the small dataset test using cloud resources is to gauge the execution time differences between the small and medium instance type to analyze any cost savings. Again all tests were run in isolate on separate instances types and only a single test was accessing the S3 storage bucket [26] at one time. Memory Required (MBs) % Deduplication Execution Time Algorithm # of Chunks Total Size Office1 cdc4 23582739 365.286681 18.79% 152 min 111.35G cdc8 11768036 186.9730756 16.70% 137 min 111.35G cdc16 6063896 98.99308369 14.41% 147 min 111.35G cdc32 2980966 49.93786693 12.17% 147 min 111.35G fixed4 29205380 540.1141443 3.04% 140 min 111.35G fixed8 14612039 271.2054281 2.69% 138 min 111.35G fixed16 7313593 136.0362223 2.48% 145 min 111.35G fixed32 3662809 68.28364404 2.26% 150 min 111.35G Office2 cdc4 24453549 305.4548119 34.51% 203 min 113.61G cdc8 12056195 155.8396025 32.23% 206 min 113.61G cdc16 6073155 81.40970867 29.72% 205 min 113.61G cdc32 3072219 42.46005797 27.54% 212 min 113.61G fixed4 29825063 471.1933076 17.17% 205 min 113.61G fixed8 14934682 237.6557387 16.57% 210 min 113.61G fixed16 7489809 119.7711156 16.16% 206 min 113.61G fixed32 3767320 60.54580368 15.74% 214 min 113.61G Table 5: EC2 m1.small Instance Small Dataset Results 29

Memory Required (MBs) % Deduplication Execution Time Algorithm # of Chunks Total Size Office1 cdc4 23582739 365.286681 18.79% 50 min 111.35G cdc8 11768036 186.9730756 16.70% 50 min 111.35G cdc16 6063896 98.99308369 14.41% 49 min 111.35G cdc32 2980966 49.93786693 12.17% 46 min 111.35G fixed4 29205380 540.1141443 3.04% 50 min 111.35G fixed8 14612039 271.2054281 2.69% 40 min 111.35G fixed16 7313593 136.0362223 2.48% 46 min 111.35G fixed32 3662809 68.28364404 2.26% 46 min 111.35G Office2 cdc4 24453549 305.4548119 34.51% 67 min 113.61G cdc8 12056195 155.8396025 32.23% 68 min 113.61G cdc16 6073155 81.40970867 29.72% 69 min 113.61G cdc32 3072219 42.46005797 27.54% 69 min 113.61G fixed4 29825063 471.1933076 17.17% 69 min 113.61G fixed8 14934682 237.6557387 16.57% 67 min 113.61G fixed16 7489809 119.7711156 16.16% 70 min 113.61G fixed32 3767320 60.54580368 15.74% 70 min 113.61G Table 6: EC2 c1.medium Instance Small Dataset Results Based on the results of the cloud testing on the small office dataset the execution time difference is inline with the cost difference based on Amazon s EC2 pricing at the time of this publication going from the small instance to the medium instance. Also since the memory resources are the same on the small and medium instance type a more aggressive algorithm cannot be used as a differentiator in terms of space and cost savings. Therefore there is little to no cost savings when comparing the executions times and the related compute cost differences of the small and medium size instances on a small dataset. One interesting aspect of this testing is the relative consistency in the percentage of additional redundancy detected between the fixed and variable block algorithms for both office datasets. 30

Transitioning into the larger dataset of 500GB and larger we again focus our attention on an office type dataset extracted from a corporate file share environment. The goal of the large dataset is to examine more aggressive algorithms that exhaust the memory resources available in the small and medium instance type for the global chunk index. This will allow us to explore the cost model and tradeoffs associated with choosing a more aggressive algorithm and large instance type versus a less aggressive and smaller instance type over varying storage durations. Using a dataset size of 764GB on the small and medium instance type the fixed16 and cdc16 were the most aggressive algorithms able to be run within the memory constraints of the instances of 1.7GB, after memory for operating system and the execution of the de-duplication algorithm were allocated. The execution times within a particular instance type are again controlled by the large I/O wait time experienced processing the data. We again see notable increases in duplicate detection with the variable algorithms over the fixed. When using CDC4, a more aggressive algorithm on the larger instance an additional 5 percent of redundancy was detected over that of the CDC16 algorithm on the smaller instances. This translates into approximately 41GB of additional redundant data eliminated. The execution times on the large instance with more aggressive algorithms are slightly longer compared with the medium instance. 31

# of Chunks Memory Requirements (MBs) % Deduplication Execution Time (min) Total Size Chunker m1.small fixed16 43828883 835.9696007 12.66% 1552.283333 764.04G cdc16 30729169 586.1123848 22.00% 1515.566667 764.04G c1.medium fixed16 43828883 835.9696007 12.66% 875.7333333 764.04G cdc16 30729169 586.1123848 22.00% 896.2833333 764.04G m1.large fixed16 43828883 835.9696007 12.66% 916.5666667 764.04G fixed8 87026664 1659.901886 13.10% 964.7333333 764.04G cdc16 30729169 586.1123848 22.00% 940.5333333 764.04G cdc8 58972915 1124.819088 25.05% 853.1333333 764.04G cdc4 116120895 2214.830303 27.37% 959.8833333 764.04G Table 7: EC2 Instance Large Dataset Results Using these results we are able to now construct and analyze a cost model associated with the cost tradeoff when selecting a smaller instance type and less aggressive algorithms versus the option to select a larger instance type and a more aggressive algorithm. We also looked at the cost model when storing the data for varying lengths of time from one month to one year and the affect storage duration have on the cost savings and decision in selecting an instance type. To recap the factors of our cost analysis: instance type (m1_small, c1_medium, m1_large), de-duplication algorithms (fixed16,fixed8, cdc16,cdc8,cdc4), and the storage duration (1month, 3 months, 6 months, 1 year). Comparisons will be performed using the small and medium instance types against the large instance types. As discovered with the small dataset testing the cost savings are nonexistent or insignificant to compare the small and medium instances against each other. To start we will look at the cost breakdowns of the Amazon EC2 and the S3 offerings. The Amazon EC2 compute costs are based on per instance hour used and data 32

transfer in and out of the EC2 environment. Partial consumed instance hours are billed as full hours, so all execution times will be rounded to the next hour for cost comparison. As for the data transfer in to the EC2 environment, this cost will be excluded from our analysis since this cost does not change depending on the instance type we are selecting. Amazon s S3 cost model is based on the following factors: standard storage pricing which is the pricing for the amount of storage used; request pricing cost for the number of put, copy, post, list or get operations performed on your S3 storage bucket; data transfer cost the cost to transfer data into and out of S3. Since we are using EC2 to communicate with S3 there is no data transfer charge for data transferred between Amazon EC2 and Amazon S3 within the same region, for our case both the EC2 and S3 are within the East region [26]. Additionally, we are focusing on the standard storage pricing as opposed to the reduced redundancy storage which introduces the risk of data loss. In the case of de-duplication data protection is critical due to the large percentage of files that can be referencing a single data block. The reduced storage option is available as a lower cost option for data that is reproducible [26]. Below are a couple tables with the breakdown of the E2 and S3 pricing for the East region at the time of publication. AWS EC2 Compute Pricing Type $ Cost/Hr Small (m1_small) 0.08 Medium (c1_medium) 0.165 Large (m1_large) 0.32 Table 8: AWS EC2 Pricing AWS S3 Storage Pricing Tiers $ Cost/GB First 1TB / Month 0.125 Next 49TB 0.11 Next 450TB 0.095 Request Cost per 1,000 0.01 Table 9: AWS S3 Pricing 33

Based on the execution times seen on the three instance types we will begin by breaking out the compute and storage cost associated with small instance type running the CDC16 algorithm. For the compute cost we look at the execution time which is 1552.283333 minutes which translates to 26 hours after rounding up to the nearest hour. The compute cost is a straight calculation using the 26 hour multiplied by the per hour cost of the small instance type of $.08 per hour, which equals $2.08. The storage cost has a couple factors to take into account. One being the storage cost of $.125 per GB for the first TB stored. After running the CDC16 algorithm 22% redundant data was removed leaving approximately 596GB which has an associated cost of $74.50 per month. The second component of the storage cost is the request pricing. The request pricing is based on PUT, COPY, POST, LIST, or GET requests. The pricing for the PUT, COPY, POST, or LIST are $.01 per 1,000 requests while the GET and other request are $.01 per 10,000 requests [26]. In order to calculate the number of request we need to determine the number of files that make up our dataset which translates into the number of object put requests. The 764.04GB dataset is comprised of 450,990 files and directories, which translates to an estimated 450,990 PUT operations that has an associated cost of $4.51. Using these figures we are able to calculate the cost for a one month, three month, six month, and one year storage period. The calculations are the same for the medium and large instance types with the exception of the values for execution time and deduplication percentage. The request cost remains the same as the dataset and number of files remains consistent across instance tests. 34

Algorithm / Instance Storage TimeFrame CDC16 on Small Instance 1 Month 3 Month 6 Month 1 Year Compute Cost $2.08 $2.08 $2.08 $2.08 Storage Cost $79.01 $237.03 $474.06 $948.12 CDC16 on Medium Instance 1 Month 3 Month 6 Month 1 Year Compute Cost $2.48 $2.48 $2.48 $2.48 Storage Cost $79.01 $237.03 $474.06 $948.12 CDC8 on Large Instance 1 Month 3 Month 6 Month 1 Year Compute Cost $4.80 $4.80 $4.80 $4.80 Storage Cost $76.01 $228.03 $456.06 $912.12 CDC4 on Large Instance 1 Month 3 Month 6 Month 1 Year Compute Cost $5.12 $5.12 $5.12 $5.12 Storage Cost $73.76 $221.28 $442.56 $885.12 Table 10: Instance Cost Assessment With the varying storage durations the following assumptions were made: Compute Cost the compute cost was only calculated for the initial data de-duplication process, any subsequent data accesses are not taken into account. The data access frequency is independent of the instance type. Another aspect that was not taken into account is additional data being added or delete to the cloud instance over the storage duration. Data additions based on our research have a positive impact on the cost savings seen with the more aggressive algorithms. Looking at the cost savings of the large and small instances, the use of the large instance type with the aggressive algorithm of CDC4 over a year storage time frame produces a cost savings of 6.15% or $58.37 compared with running the CDC16 using the smaller instance type. The smaller storage timeframes also produce a cost savings in the range of 2.5% for the first month to 5.82% at the six month mark. When comparing the CDC8 on the large instance a cost savings is not realized immediately, with the first month savings at less than 1%. Therefore, in order to maximize the cost savings in the 35