Cloud De-duplication Cost Model THESIS

Size: px
Start display at page:

Download "Cloud De-duplication Cost Model THESIS"

Transcription

1 Cloud De-duplication Cost Model THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Christopher Scott Hocker Graduate Program in Computer Science and Engineering The Ohio State University 2012 Master's Examination Committee: Dr. Gagan Agrawal, Advisor, Dr. Christopher Stewart

2 Copyright by Christopher Scott Hocker 2012

3 Abstract De-duplication algorithms have been used extensively in the backup and recovery realm to exploit the temporal data redundancy on the versioning nature of backup applications. More recently de-duplication algorithms have progressed into the primary storage area where more spatial data exist, in either case increasing the efficiency of the usable storage capacity. In parallel another industry trend is the use of cloud resources to provide a utility model for software and infrastructure services. Cloud resources aid in decreasing the provisioning time for applications and infrastructure, while increasing the scalability to meet the elastic nature of application and infrastructure requirements. Using the de-duplication algorithms within cloud resources is the next logic step to increase the efficiency and cost related to cloud computing. Cloud environments involve pricing models for both computing and storage. In this work, we examine the tradeoffs between computing costs and reduction in storage costs (due to higher de-duplication) for a number of popular de-duplication algorithms. Since the main factor impacting the computing cost is the memory availability in different type of instances, we also develop a methodology for predicting the memory requirements for executing a given algorithm on a particular dataset. ii

4 Dedication Dedicated to those who supported me throughout my academic career my Wife, Parents, Brother, Sister and Friends. iii

5 Acknowledgments First I would like to thank my advisor, Dr. Gagan Agrawal for challenging me and providing guidance from the beginning of my time at Ohio State. The support and advice he provided was invaluable. Additionally, I would like to thank my thesis committee member Dr. Christopher Stewart for his time and participation during this work. I would also like to thank my Wife for her support and understanding of the long hours during this work and throughout my entire academic career. Finally, I would like to thank the rest of my support system, my Parents, Brother, Sister, and Friends who were always there to provide a word of encouragement and motivation. iv

6 Vita Vandalia Butler High School B.S. CS, Wright State University 2008 to present...m.s. CSE, The Ohio State University Fields of Study Major Field: Computer Science and Engineering v

7 Table of Contents Abstract... ii Dedication... iii Acknowledgments... iv Vita... v Fields of Study... v Table of Contents... vi List of Tables... vii List of Figures... viii Chapter 1: Introduction... 1 Chapter 2: De-duplication Algorithms... 4 Chapter 3: Memory Prediction Chapter 4: Experimental Evaluation Chapter 5: Related Research References: vi

8 List of Tables Table 1: fs-c Algorithm Chunk Selection Table 2: Fixed Index Memory Estimates vs. Actual Table 3: Variable Index Memory Estimates vs. Actual Table 4: Small Dataset Results Table 5: EC2 m1.small Instance Small Dataset Results Table 6: EC2 c1.medium Instance Small Dataset Results Table 7: EC2 Instance Large Dataset Results Table 8: AWS EC2 Pricing Table 9: AWS S3 Pricing Table 10: Instance Cost Assessment Table 11: m1_small vs. m1_large Instance Cost Table 12: c1_medium vs. m1_large Instance Cost vii

9 List of Figures Figure 1: Out-of-Band vs. In-band De-duplication... 4 Figure 2: Basic Sliding Window Algorithm [8]... 8 Figure 3: TTTD Pseudo Code [8] Figure 4: Chunk Distribution of TTTD algorithm [9] Figure 5: TTTD-S Chunk Distribution Improvements [9] Figure 6: TTTD-S algorithm pseudo code Figure 7: De-duplication ratio and percent savings [19] Figure 8: FS-C Chunk Distributions viii

10 Chapter 1: Introduction De-duplication algorithms have been used extensively in the backup and recovery realm to exploit the temporal data redundancy on the versioning nature of backup applications. More recently de-duplication algorithms have progressed into the primary storage area where more spatial data exist, in either case increasing the efficiency of the usable storage capacity. In parallel another industry trend is the use of cloud resources to provide a utility model for software and infrastructure services. Cloud resources aid in decreasing the provisioning time for applications and infrastructure, while increasing the scalability to meet the elastic nature of application and infrastructure requirements. As companies begin to transition their data and infrastructure to the cloud, the methods of cost savings they are accustomed such as de-duplication should transition as well. The increased efficiency in usable capacity gained through the use of de-duplication will translate into a positive impact on the cloud pay as you go model. De-duplication does come with a tradeoff of additional compute resources required to analyze the data for duplicates, so selecting the right instance type to run a given de-duplication algorithm is an important aspect. Therefore we will examine the resources and cost factors related to the cloud environments and how de-duplication can be effectively integrated. Cloud environments involve pricing models for both computing and storage. In this work, we examine the tradeoffs between computing costs and reduction in storage costs (due to higher de-duplication) for a number of popular de-duplication algorithms. Since the main factor impacting the computing cost is the memory availability in 1

11 different type of instances, we also develop a methodology for estimating the memory requirements for executing a given algorithm on a particular dataset. Through experiments we show support for more aggressive de-duplication algorithms to maximize cost savings on a larger cloud compute instance versus that of a smaller instance type and less aggressive de-duplication algorithm. In some cases the dataset size does not warrant a larger instance type to run more granular de-duplication algorithms since the index memory requirements are satisfied by the entry level instance types. In these situations there is no benefit to choosing a larger instance type in an effort to reduce cloud resource cost. Thesis Statement Integrating de-duplication effectively and efficiently into a cloud environment requires an understanding of the resource requirements, specifically the memory requirements and the tradeoff in compute cost for processing data for duplicates at various levels of granularity. Contributions This thesis makes the following contributions: 1. Proposes a methodology to predict the required cloud instance type based on memory requirements to run popular de-duplication algorithms on a given dataset. 2. Analyze cloud compute requirements for running de-duplication algorithms at varying chunk granularity. 2

12 3. Evaluate cost factors associated with running de-duplication in a cloud environment including compute instance types, de-duplication algorithm chunk granularity, and data storage durations. 3

13 Chapter 2: De-duplication Algorithms The implementation of data de-duplication duplication technologies varies in terms of de- duplication placement, the timing of the de-duplication duplication of data, and the granularity at which data is analyzed to find duplicate data. The placement of the de-duplication process can occur at the client or target device [4]. Additionally, a hybrid approach of both the client and target also exist. The timing of the de-duplication duplication process is performed either in-band as the data is received/sent versus out-of-band at a scheduled time. Figure 1: Out-of-Band vs. In-band De-duplication The placement and timing are key components of de-duplication duplication algorithms however in this research we will focus on client based de-duplication duplication algorithms and out of band timing which will allow us to integrate de-duplication duplication into an existing cloud environment and analyze the data in place. We will explore in more detail the granularity at which duplicates are detected specifically at a fixed and variable block levels. Finally, we will provide a brief explanation of de-duplication ratios and the contributing factors to the ratio. 4

14 Duplicate Detection For duplicate detection there are three main approaches: whole file, often called single instance storage, sub file chunking which is comprised of fixed and variable block hashing, and delta encoding. For whole file, entire files are given a hash signature using MD5 or SHA-1 [6]. Files with exact hash signatures are assigned a pointer to the single file instance previously stored. In certain algorithms a byte by byte comparison is performed to eliminate the potential for hash collisions, which is often a concern with hash based comparisons [27]. Lower de-duplication ratios are generally obtained due to the large dataset that has to be matched; any small change in the file will alter the hash and affect any previous match. The second and more popular approach is the sub file chunking [4]. Methods such as fixed and variable block hashing are two types of sub file chunking. Fixed block deduplication chunks and hashes a byte stream based on a fixed block size. The hash signatures are stored and referenced in a global index, which is implemented using a bloom filter type data structure to quickly identify new unique block segments [14]. If the block signature already exists in the index, a pointer to the existing block is created; otherwise the signature is stored and the block is written to disk. In contrast, variable block algorithms use methods such as Rabin Fingerprinting [1], a hashing algorithm that uses a sliding window to determine the natural block boundaries with the highest probability of matching other blocks. Variable block algorithms still employ the bloom filter based data structure for the in-memory index [11]. Variable block proves to be a 5

15 more efficient approach compared to fixed and whole file hashing, regardless of the slight variations or offsets that exist due to modifications in similar files and blocks [4]. Exact and similarity matching techniques are used in the sub file hashing algorithms. Exact matching algorithms examine the chunk hash index for exact signature match. For exact matching algorithms, the block size has a direct impact on the hash index size which can present a problem when storing in memory [29]. An example was provided in [2], where the index size required for 1 Petabyte of de-duplicated data, assuming an average block size of 8KB would require 128GB of memory for the index at a minimum. Similarity matching techniques address the index size by increasing the block size to 16MB in [2], which in turn will reduce the index size to 4GB for the 1 Petabyte of data. A similarity signature approach consists of a number of block signature based on a subset of the chunk bytes. If a similarity signature matches more than some threshold of block signatures then there is a reasonable probability that the two chunks will share common block based signatures [2]. Thus a match is found, and the new similarity chunk is compared and de-duplicated against the similar chunks [29]. The trade off with the similarity matching technique is de-duplication performance is highly dependent upon the speed of the de-duplication repository storage; as the repository is referenced for similarity block signature matches. Also since each chunk is only compared against a limited number of other chunks in similarity matching, occasionally duplicates are stored [29]. Finally, delta encoding (also called data differencing), processes files based on a reference file for differences, storing the deltas in a patch files [17]. Selecting the 6

16 reference file previously stored is a key operation in delta encoding algorithms and often selected based on a fingerprinting technique similar to that of the whole file and sub chunking algorithms [4]. The sub file chunking approach with exact matching used with variable block algorithms are able to detect the varying offsets of data blocks. This addresses the boundary shifting concerns of fixed or whole file algorithms. The tradeoff comes in terms of additional resources required to maintain the metadata associated with the increase chunks seen with variable block approaches. Additionally, index lookups increase during the variable chunk detection process. Even with the increased resource requirements for variable block algorithms the majority of current algorithms use variable block sub file hashing with exact matching to maximize efficiency and overall de-duplication ratios. De-duplication Implementations The process is straight forward for the whole file hashing and fixed block hashing: perform the hash index lookups based on the SHA-1 [6] or MD5 hash value of the file or fixed bock to determine if the data is unique. If a duplicate is detected in the hash lookup, optionally perform a byte by byte comparison then modify the file metadata to reference the previously stored data. If the data is unique, record the hash value and perform local compression (e.g. LZ, gzip) on the unique block and store only that data [13]. For the variable block otherwise known as content based chunk algorithms, there are several algorithms that vary in their implementation, overall performance and effectiveness in identifying duplicate chunks. The main motivation behind the variable 7

17 block algorithms is elimination of the boundary shifting problem [18]. If a small modification is made to a file, the chunk boundaries for whole file and fixed block chunking shift, causing a poor duplicate detection against the file or data. The low bandwidth network filesystem [18] first introduced the basic sliding window (BSW) using three parameters as inputs: fixed window size W, integer divisor D, and integer remainder R. The window shifts one byte at a time to the max window size of W from the beginning of the file to the end. Fingerprints (h) of the window contents are generated with Rabin fingerprinting. Rabin introduced the idea of detecting the natural block boundaries in a byte stream and assigning the variable chunks a signature [1]. The algorithm then tests if (h mod D ) = R which if true, a D-Match has been found and the current position is set as a breakpoint for that chunk. The parameter D can be configured to make the chunk size as close to the data expectations as possible to maximize the de-duplication. The parameter R must be between 0 and D-1, and most often is configured as D-1. Figure 2, provides a visual representation of the basic sliding window approach. Figure 2: Basic Sliding Window Algorithm [8] 8

18 Problems presented by the basic sliding window approach include the large chunk sizes generated when a match is not detected and the data is chunked at the window size. This leads to boundary shifting problems when small modifications are made, making the large chunk matches more difficult. Additional improvements were made which resulted in the introduction of the two divisor (TD) algorithm [8]. This algorithm addressed the issue with the basic sliding window algorithm by introducing a second divisor (S) that is smaller than D which increases the chance of a match. Both D and S are calculated at each byte shift to increase the chances of chunk match, decreasing the number of large chunks. Using the BSW or the TD algorithms, the chunk size is only upper bounded and could cause chunks to vary greatly in size. Small chunk sizes greatly increases the number of chunks which is directly related to the memory overhead required for exact matching techniques [29]. The two threshold two divisor algorithm was developed to address the range of chunk size generated during duplication detection [8]. TTTD added two threshold parameters to the BSW and TD algorithms which control the upper (Tmax) and lower bounds (Tmin) of chunk sizes [8]. Data fingerprints are not generated until the minimum byte threshold is met; addressing the overhead issues related to small chunk size, while still addressing the boundary shifting concerns of chunking at large chunk sizes. 9

19 int p=0; //current position int l=0; //position of last breakpoint int backupbreak=0; //position of backup breakpoint for (;!endoffile(input);p++){ unsigned char c=getnextbyte(input); unsigned int hash=updatehash(c); if (p - l<tmin){ //not at minimum size yet continue; } if ((hash % Ddash)==Ddash-1){ //secondary divisor backupbreak=p; } if ((hash% D) == D-1){ //we found a breakpoint //before maximum threshold. addbreakpoint(p); backupbreak=0; l=p; continue; Figure 3: TTTD Pseudo Code [8] } } if (p-l<tmax){ //we have failed to find a breakpoint, //but we are not at the maximum yet continue; } //when we reach here, we have //not found a breakpoint with //the main divisor, and we are //at the threshold. If there //is a backup breakpoint, use it. //Otherwise impose a hard threshold. if (backupbreak!=0){ addbreakpoint(backupbreak); l=backupbreak; backupbreak=0; } else{ addbreakpoint(p); l=p; backupbreak=0; } Additional studies found that when maximum threshold (Tmax) of TTTD was reached, that only the last secondary divisor (S) was used for chunking. Therefore, all other secondary divisors calculated were not considered, causing a large distribution of chunk sizes. See Figure 4 to view the chunk distribution of the TTTD algorithm. 10

20 Figure 4: Chunk Distribution of TTTD algorithm [9] The chunk distribution contains two groupings (as seen on the figure above) - the first around the chunk size detected by the expected main divisor (D), the second near the max chunk threshold where a match was not discovered and the previous secondary divisor (S) was used. TTTD-S [9] improves upon the large distribution of the chunk grouping by introducing a switchp value that is set to 1.6 times the expected chunk-size [9]. Once the window size has reached switchp, the divisors are reduced in half to shorten the match process. This in turn helps to shorten the process of finding a breakpoint before max chunk threshold is reached. Additionally, it can improve the distribution and bring the second chunk grouping closer to the average chunk size detected by the main divisor. Figure 5 illustrates the improvements in the chunk distribution made by the switchp parameter introduced in the TTTD-S algorithm, further reducing the chances of a boundary shifting conditions with data modifications. 11

21 Figure 5: TTTD-S Chunk Distribution Improvements [9] int currp = 0, lastp= 0, backupbreak = 0; for ( ;! endoffile( input ) ; currp++ ) { unsigned char c = getnextbyte( input ); unsigned int hash = updatehash( c ) ; if ( currp lastp < mint ) { continue ; } if ( currp lastp > switchp ) { switchdivisor( ) ; } if (( hash % secondd ) == secondd 1 ) { backupbreak = currp ; } if (( hash % maind ) = = maind 1 ) { addbreakpoint( currp ) ; backupbreak = 0 ; lastp = currp ; resetdivisor( ); 12 continue; } if ( currp lastp < maxt ) { continue ; } if ( backupbreak!= 0 ) { addbreakpoint( backupbreak ) ; lastp = backupbreak ; backupbreak = 0 ; resetdivisor( ) ; } else { addbreakpoint( currp ) ; lastp = currp ; backupbreak = 0 resetdivisor( ) ; } } Figure 6: TTTD-S algorithm pseudo code Additional algorithms exist that use a hybrid approach by incorporating variable and fixed block techniques, as well as small chunk merging techniques to reduce the

22 number of small chunks therefore, reducing the overhead associated with the large number of small chunks. Compression is also often used in conjunction with deduplication to increase the storage space utilization. Data de-duplication algorithms have been extensively researched. The techniques available today vary depending on the de-duplication placement, timing of the detection process, and the granularity at which duplicates are discovered [4]. Regardless of the techniques, the overall effectiveness of any of the de-duplication algorithms remains data dependent [30]. The fs-c algorithm [7] used in our research is based on the TTTD algorithm using the Rabin [1] fingerprinting for generating the natural chunk boundaries. De-duplication Savings In addition to the inner workings of the de-duplication algorithms, the data characteristics will help in understanding the space savings obtained by the various algorithms. There are several factors such as data type, scope of the de-duplication, and data storage period that play a role in overall de-duplication savings. De-duplication savings are often stated in terms of ratios, which is in relation to the number of input bytes to the de-duplication process divided by the number of bytes of output. [19]. Figure 7 depicts the calculation of de-duplication ratios to percentages. In our studies we will use percentages to eliminate any confusion and make the overall savings more apparent. 13

23 Figure 7: De-duplication ratio and percent savings [19] Data files types are one component that has an impact on the de-duplication saving expectations. For example, files generated by humans in applications such as text documents, spreadsheets, presentations, etc often contain a large amount of redundant data, while data generated by a computer system such as images, media, archived files, etc often have less redundancy due to the random nature of the data [19]. The scope of de-duplication refers to the range of datasets examined during duplicate detection. For example, the term global de-duplication allows for detection of duplicates across multiple data sources which can span multiple storage systems or locations [19]. Conversely, the de-duplication across just a single appliance or within a single client s data only looks at the data contained within that appliance or client, creating silos of de-duplication stores. In general, the larger the data scope for duplication detection, the higher the expected space savings.. Data storage periods have an effect on the de-duplication savings by increasing the chance of exploiting temporal redundancy. For example, in a backup type scenario 14

24 where temporal data is accumulated over time due to the versioning nature of backup applications, ratios are expected to be higher. In a primary storage scenario, spatial data exists across a broad spectrum of data types which lead to lower de-duplication ratios overall. This outlines data de-duplication approaches in terms of de-duplication placement, timing, and the granularity at which data is analyzed to find duplicates. These approaches combined with the algorithms outlined above provide a foundation for investigating de-duplication resource considerations within a cloud environment. 15

25 Chapter 3: Memory Prediction To examine the tradeoffs between compute and storage cost with the addition of de-duplication, we need a method to estimate the instance type required to execute a given de-duplication algorithm. Since the main factor impacting the computing cost is the memory availability in different type of instances, we developed a methodology for predicting the memory requirements for executing a given algorithm on a particular dataset. Estimation Method With de-duplication there is not a one size fits all configuration. Depending on the application and resources available, certain algorithms might be more effective than another. One resource consideration is the total index memory size required to store the index of unique data signatures. For both fixed and variable block exact matching implementations an in memory index is used for data signature lookups for determining duplicate blocks. There are some techniques that use similarity signatures that increase the chunk size to control the memory index size, the tradeoff being the increased reliance on the speed and responsiveness of the de-duplication store for data comparisons. Therefore, our focus when utilizing cloud resources is the exact matching techniques where the memory size for the index is a concern. Providing a means to estimate the index size is important when sizing system requirements for a de-duplication implementation. Memory size estimates for both fixed and variable block algorithms follow the same formula, only varying in how the specific 16

26 variables are derived. We provide the following formula for the basic memory requirements estimations: Memory Size Estimates = (Data Size / Chunk Size) * (1 De-duplication %) * (Signature Bytes) To estimate the index memory size for a given dataset, the following variables have to be determined and or estimated: Data size - what is the total data set size that is targeted for de-duplication. Chunk size - for a fixed chunking algorithm what is the size of the chunk used in the de-duplication implementation. De-duplication percentage is based on the percentage seen during a sample run on a subset of the dataset. From our testing a sample size of 10 to 15% provides a good sample for the various data types we tested. The de-duplication percentage estimates are on par with similar measurement results in other studies for the given data types [4] [7]. Signature bytes refer to the number of bytes used for generating a chunk s signature hash. In most implementations a 20 byte SHA-1 hash signature for each chunk is used for the collision resistant properties that SHA-1 provides [6]. Variable block index memory estimates are more complex since the chunk sizes are not static but a distribution between the minimum and maximum chunk sizes set at execution time. Figure 8 shows the distribution based on the testing performed using the fs-c algorithm [7] on multiple datasets. The fs-c algorithm uses the TTTD approach to the variable block de-duplication. The CDC32 (content defined chunking) has an expected 17

27 (average) block side of 32KB, a lower threshold (Tmin) set at 8KB and the upper threshold (Tmax) at 128KB. The threshold proportions remain consistent with CDC16, CDC8, and CDC4 algorithms. The following table outlines the different fixed and variable algorithms used in the fs-c algorithm [7] tests. Average Chunk Size (bytes) Minimum Chunk Size (bytes) Maximum Chunk Size (bytes) Chunker Type Fixed8 Fixed 8192 Fixed16 Fixed Fixed32 Fixed CDC4 Variable CDC8 Variable CDC16 Variable CDC32 Variable Table 1: fs-c Algorithm Chunk Selection 25.00% 20.00% % of Chunks 15.00% 10.00% 5.00% 0.00% Min Average Max Block Sizes Office 1 -CDC32 Office 1 -CDC16 Office 1 -CDC8 Office 1 -CDC4 18

28 % of Chunks 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% Min Average Max Block Sizes Figure 8: FS-C Chunk Distributions Office 3 -CDC32 Office3 - CDC16 Office3 - CDC8 Office3 - CDC4 Based on figure 8 distributions, the percentage of data chunks between the minimum and average block size is 50% to 55% of the total unique chunks, which in terms of the total data size is 20-25%. We can derive the total number of chunks based on these calculations. In the worst case scenario (most number of chunks), we would assume that 25% of the data is chunked at the minimum block size, and the remaining 75% of the data would chunk just above the average chunk size. The best case scenario (least number of chunks), 25% of the data would chunk at the average chunk size and the remaining 75% at the max chunk size. Worst Case Total Chunks = ((.25 * DataSize) / Min Block Size) + ((.75 * Datasize ) / Average Block Size) Best Case Total Chunks = ((.25 * DataSize) / Average Block Size) + ((.75 * Datasize) / Max Block Size) As an example - in a dataset size of 100GB ( bytes) and chunking at a variable block size of 16KB (4KB lower threshold, 64KB upper threshold), with an 19

29 estimated de-duplication percentage of 25%, and signature size of 20 bytes, the memory requirements range is: Worst Case Total Chunks = ((.25 * ) / 4096 ) + ((.75 * ) / 16384) = Chunks Best Case Total Chunks = ((.25 * ) / ) + ((.75 * ) / 65536) = Chunks From the worst and best case chunk estimates, we can now utilize our memory estimation formula presented earlier to estimate the minimum and maximum memory requirements for the index when running the CDC16 algorithm against the 100GB dataset. Minimum Memory Requirements = * (1 -.25) * (20) = bytes ~ 42MB Maximum Memory Requirements = * (1 -.25) * (20) = bytes ~ 165MB Therefore, the memory requirements for our 100GB dataset are in the range of 42MB to 165MB. 20

30 Validation of Method We performed experiments with small (150GB or less) and large (500GB or more) datasets with both the fixed and variable algorithms to test how well the memory estimation formula applies to real world scenarios. Fixed Block Index Memory Requirements Dataset Size (GB) Alg Estimated Minimum Memory (MB) Actual Memory (MB) % Error FIXED % FIXED % FIXED % FIXED % Table 2: Fixed Index Memory Estimates vs. Actual Using the fs-c [7] fixed chunking algorithms, we tested an office type dataset extracted from a corporate office file share environment. To obtain our memory estimates we assumed the de-duplication percentage to be at or around the 5% mark for the small and large dataset. This percentage was obtained from a sample run of the fixed algorithm on a dataset a fraction of the size. Additionally, the SHA-1[6] data signature size of 20 bytes was selected at execution time. Based on our assumptions of the de-duplication percentage and parameters selected at run time (signature size, average chunk size) the memory estimates calculate from the formula presented previously were within 8% of the actual memory requirements. The estimate error is dependent on the percentage of deduplication assumed versus the actual, and is only improved by using a larger sample size in the de-duplication percentage estimate [30]. 21

31 The variable block chunking experiments again used the same dataset as the fixed and assumed the chunk distribution discussed previously to obtain the estimate range for the index memory. The de-duplication percentage estimate used for the CDC16 and CDC8 algorithm was 15% and 20% respectively. These estimates were obtained from local sample runs on the dataset. The SHA-1 data signature size was again set to 20 bytes at execution [6]. The minimum and maximum block thresholds set by the fs-c [7] algorithms were 4KB and 64KB respectively for the CDC16 algorithm and 2KB and 32KB respectively for the CDC8 algorithm. Variable Block Index Memory Requirements Dataset Size (GB) Alg Minimum Memory (MB) Maximum Memory (MB) Actual Memory (MB) CDC CDC CDC CDC Table 3: Variable Index Memory Estimates vs. Actual Based on the assumptions, the chunk distributions, and the parameters set at execution time, the actual memory requirements for the variable block executions on the small and large datasets were within the estimated range for the index memory - trending toward the higher end of the range for both the small and large dataset. For the variable algorithm, the index memory estimates are based not only on the de-duplication percentage estimate, but also on the best and worst case chunk distribution estimates. 22

32 Resource considerations regarding cloud instance type selection around the required index memory have been examined in relation to the chunking algorithm selected for duplicated detection. A methodology for estimating memory requirements was presented and tested against real world datasets. From our real world test performed on the corporate file share datasets the index memory estimates presented for both fixed and variable block algorithms provide good estimates for sizing the compute instance required to perform de-duplication using the sub file level granularity. We can now proceed with our experimental evaluation of the tradeoffs cost between the compute and storage when introducing de-duplication algorithms in a cloud environment. 23

33 Chapter 4: Experimental Evaluation In our experimental evaluation of de-duplication in a cloud based environment we look at the following factors namely the dataset size, the cloud compute instance requirements, and the length at which the data is going to be retained in the cloud to analyze the potential cost avoidance surrounding performing fixed and variable deduplication detection on a given dataset. We performed our analysis on the Amazon Web Services offerings, using elastic compute (EC2) for the compute platform and simple storage services (S3) for the storage infrastructure. The standard small and large instance types along with the high cpu medium instances were used in our testing. Below is a recap of resources specifications: o Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of local instance storage, 32-bit platform [26] o Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform [26] o High-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of local instance storage, 32-bit platform [26] Amazon defines one EC2 Compute Unit (ECU) as providing the equivalent CPU capacity of a GHz 2007 Opteron or 2007 Xeon processor [26]. Additionally, Amazon s Linux AMI operating system was selected for the instance builds. The cost 24

34 analysis is based on the Amazon s pricing for the US East region where our testing was performed. We used the fs-c algorithm developed by [7] with reporting and statistical gathering modification as our de-duplication engine. The fs-c algorithm has both fixed and variable chunking options. Chunk size options vary from 2KB to 32KB for both fixed and variable algorithms. The variable chunking approached is based on the two threshold two divisor algorithm [8] using Rabin fingerprinting [1] to determine the natural content boundaries. Additionally, the fs-c approach is an out-of-band approach to de-duplication, which allows the analysis of data in place. Our initial evaluation centered on small datasets extracted from a corporate file share environment that were 300GB or less in size. We grouped our datasets around the following data classifications: Office data types: Microsoft Word (doc,docx), Excel (xls, xlsx), PowerPoint (ppt, pptx), Adobe s Portable Document Format(pdf), rich text documents(rtf). Database file types: Microsoft SQL master database files(mdf), Microsoft Access(mdb) Virtual Machine Data files: VMware virtual machine (vmdk) files Media Files: JPEG, GIF, PNG, MP3, MP4,WAV As a first step we performed testing on a local system on the above datasets to gauge de-duplication percentages and instance type requirements. This allowed us to determine the dataset to focus on when moving the testing to cloud resources. 25

35 Below is a summary of the results from the first dataset of each type against both fixed and variable (CDC) algorithms using various block sizes. The algorithms used followed a format of chunk type followed by a number that indicates the average or fixed chunk size used. For example the cdc8 algorithm uses an average chunk size of 8KB with a lower threshold of 2KB and upper threshold of 32KB. The lower and upper bound thresholds remain proportionally consistent to the average chunk size for the other variable (CDC) algorithms. Refer to table 1 for algorithm specifications. The local system specifications used in our initial testing were as follows: Hardware Brand : HP DL580 G5 CPU : 2 Dual Core Intel(R) Xeon(R) 3.00GHz Memory: 8GB Hard Drives : 2 X 73GB 10K SAS drives RAID 1 for OS OS : Ubuntu (x64) Data Storage: EMC VNXe 3100 The datasets were stored on an EMC VNXe 3100 and accessed via NFS. Each dataset was run in isolate to the others eliminating any competition for resources. 26

36 # of Chunks Memory Requirements (MBs) % Deduplication Execution Time Algorithm Total Size Office cdc % 34 min G cdc % 32 min G cdc % 32 min G cdc % 33 min G fixed % 32 min G fixed % 32 min G fixed % 33 min G fixed % 32 min G VMDK cdc % 57 min G cdc % 64 min G cdc % 63 min G cdc % 64 min G fixed % 64 min G fixed % 63 min G fixed % 64 min G DB cdc % 45 min G cdc % 44 min G cdc % 45 min G cdc % 44 min G fixed % 46 min G fixed % 47 min G fixed % 46 min G Media cdc % 30 min G cdc % 35 min G cdc % 31 min G cdc % 34 min G fixed % 35 min G fixed % 35 min G fixed % 35 min G Table 4: Small Dataset Results As expected overall the variable algorithms were able to find more redundancy within each dataset type. The office dataset had the largest de-duplication percent change 27

37 between fixed and variable block algorithms. Surprisingly the execution time did not vary when changing algorithm chunking granularity or between the fixed and variable block algorithms. We examined this more closely and discovered that the bottleneck was not the CPU in processing the fixed or variable block chunks but at the disk I/O when trying to process the data out-of-band. We recorded high I/O wait times during each execution which caused the CPU to wait on the I/O to finish. This explained the consistency around the execution time regardless of the algorithm. Additionally, the VMDK de-duplication percentage is the highest based on the data redundancy inherent across similar operating system builds. The DB percentage remains the same for all the test perform due to the fact the SQL database files were extracted from a system that had the allocation unit size set to 64K, therefore no additional duplicates would be discovered by reducing the chunk size less than 64K. Finally, as expected based on our research the more random data types, such as media formats, produced the lowest de-duplication percentages. The small dataset memory requirements are within the resources available on the small and medium cloud compute instance types. Additionally, from our local testing the office dataset provides the most interesting analysis given the range of de-duplication percentage, therefore moving forward we will focus solely on office type datasets. Also to ensure the result consistency we collected another office dataset of roughly the same size for our remaining small set testing. After completing the initial testing on our local system our remaining testing will be using Amazon s cloud resources. With the small dataset our testing will focus on the small and medium instance types that differ in the amount of available ECUs [26]. The 28

38 dataset was transferred to the Amazon S3 storage in original form to perform the out-ofband de-duplication testing. Our motivation for the small dataset test using cloud resources is to gauge the execution time differences between the small and medium instance type to analyze any cost savings. Again all tests were run in isolate on separate instances types and only a single test was accessing the S3 storage bucket [26] at one time. Memory Required (MBs) % Deduplication Execution Time Algorithm # of Chunks Total Size Office1 cdc % 152 min G cdc % 137 min G cdc % 147 min G cdc % 147 min G fixed % 140 min G fixed % 138 min G fixed % 145 min G fixed % 150 min G Office2 cdc % 203 min G cdc % 206 min G cdc % 205 min G cdc % 212 min G fixed % 205 min G fixed % 210 min G fixed % 206 min G fixed % 214 min G Table 5: EC2 m1.small Instance Small Dataset Results 29

39 Memory Required (MBs) % Deduplication Execution Time Algorithm # of Chunks Total Size Office1 cdc % 50 min G cdc % 50 min G cdc % 49 min G cdc % 46 min G fixed % 50 min G fixed % 40 min G fixed % 46 min G fixed % 46 min G Office2 cdc % 67 min G cdc % 68 min G cdc % 69 min G cdc % 69 min G fixed % 69 min G fixed % 67 min G fixed % 70 min G fixed % 70 min G Table 6: EC2 c1.medium Instance Small Dataset Results Based on the results of the cloud testing on the small office dataset the execution time difference is inline with the cost difference based on Amazon s EC2 pricing at the time of this publication going from the small instance to the medium instance. Also since the memory resources are the same on the small and medium instance type a more aggressive algorithm cannot be used as a differentiator in terms of space and cost savings. Therefore there is little to no cost savings when comparing the executions times and the related compute cost differences of the small and medium size instances on a small dataset. One interesting aspect of this testing is the relative consistency in the percentage of additional redundancy detected between the fixed and variable block algorithms for both office datasets. 30

40 Transitioning into the larger dataset of 500GB and larger we again focus our attention on an office type dataset extracted from a corporate file share environment. The goal of the large dataset is to examine more aggressive algorithms that exhaust the memory resources available in the small and medium instance type for the global chunk index. This will allow us to explore the cost model and tradeoffs associated with choosing a more aggressive algorithm and large instance type versus a less aggressive and smaller instance type over varying storage durations. Using a dataset size of 764GB on the small and medium instance type the fixed16 and cdc16 were the most aggressive algorithms able to be run within the memory constraints of the instances of 1.7GB, after memory for operating system and the execution of the de-duplication algorithm were allocated. The execution times within a particular instance type are again controlled by the large I/O wait time experienced processing the data. We again see notable increases in duplicate detection with the variable algorithms over the fixed. When using CDC4, a more aggressive algorithm on the larger instance an additional 5 percent of redundancy was detected over that of the CDC16 algorithm on the smaller instances. This translates into approximately 41GB of additional redundant data eliminated. The execution times on the large instance with more aggressive algorithms are slightly longer compared with the medium instance. 31

41 # of Chunks Memory Requirements (MBs) % Deduplication Execution Time (min) Total Size Chunker m1.small fixed % G cdc % G c1.medium fixed % G cdc % G m1.large fixed % G fixed % G cdc % G cdc % G cdc % G Table 7: EC2 Instance Large Dataset Results Using these results we are able to now construct and analyze a cost model associated with the cost tradeoff when selecting a smaller instance type and less aggressive algorithms versus the option to select a larger instance type and a more aggressive algorithm. We also looked at the cost model when storing the data for varying lengths of time from one month to one year and the affect storage duration have on the cost savings and decision in selecting an instance type. To recap the factors of our cost analysis: instance type (m1_small, c1_medium, m1_large), de-duplication algorithms (fixed16,fixed8, cdc16,cdc8,cdc4), and the storage duration (1month, 3 months, 6 months, 1 year). Comparisons will be performed using the small and medium instance types against the large instance types. As discovered with the small dataset testing the cost savings are nonexistent or insignificant to compare the small and medium instances against each other. To start we will look at the cost breakdowns of the Amazon EC2 and the S3 offerings. The Amazon EC2 compute costs are based on per instance hour used and data 32

42 transfer in and out of the EC2 environment. Partial consumed instance hours are billed as full hours, so all execution times will be rounded to the next hour for cost comparison. As for the data transfer in to the EC2 environment, this cost will be excluded from our analysis since this cost does not change depending on the instance type we are selecting. Amazon s S3 cost model is based on the following factors: standard storage pricing which is the pricing for the amount of storage used; request pricing cost for the number of put, copy, post, list or get operations performed on your S3 storage bucket; data transfer cost the cost to transfer data into and out of S3. Since we are using EC2 to communicate with S3 there is no data transfer charge for data transferred between Amazon EC2 and Amazon S3 within the same region, for our case both the EC2 and S3 are within the East region [26]. Additionally, we are focusing on the standard storage pricing as opposed to the reduced redundancy storage which introduces the risk of data loss. In the case of de-duplication data protection is critical due to the large percentage of files that can be referencing a single data block. The reduced storage option is available as a lower cost option for data that is reproducible [26]. Below are a couple tables with the breakdown of the E2 and S3 pricing for the East region at the time of publication. AWS EC2 Compute Pricing Type $ Cost/Hr Small (m1_small) 0.08 Medium (c1_medium) Large (m1_large) 0.32 Table 8: AWS EC2 Pricing AWS S3 Storage Pricing Tiers $ Cost/GB First 1TB / Month Next 49TB 0.11 Next 450TB Request Cost per 1, Table 9: AWS S3 Pricing 33

43 Based on the execution times seen on the three instance types we will begin by breaking out the compute and storage cost associated with small instance type running the CDC16 algorithm. For the compute cost we look at the execution time which is minutes which translates to 26 hours after rounding up to the nearest hour. The compute cost is a straight calculation using the 26 hour multiplied by the per hour cost of the small instance type of $.08 per hour, which equals $2.08. The storage cost has a couple factors to take into account. One being the storage cost of $.125 per GB for the first TB stored. After running the CDC16 algorithm 22% redundant data was removed leaving approximately 596GB which has an associated cost of $74.50 per month. The second component of the storage cost is the request pricing. The request pricing is based on PUT, COPY, POST, LIST, or GET requests. The pricing for the PUT, COPY, POST, or LIST are $.01 per 1,000 requests while the GET and other request are $.01 per 10,000 requests [26]. In order to calculate the number of request we need to determine the number of files that make up our dataset which translates into the number of object put requests. The GB dataset is comprised of 450,990 files and directories, which translates to an estimated 450,990 PUT operations that has an associated cost of $4.51. Using these figures we are able to calculate the cost for a one month, three month, six month, and one year storage period. The calculations are the same for the medium and large instance types with the exception of the values for execution time and deduplication percentage. The request cost remains the same as the dataset and number of files remains consistent across instance tests. 34

44 Algorithm / Instance Storage TimeFrame CDC16 on Small Instance 1 Month 3 Month 6 Month 1 Year Compute Cost $2.08 $2.08 $2.08 $2.08 Storage Cost $79.01 $ $ $ CDC16 on Medium Instance 1 Month 3 Month 6 Month 1 Year Compute Cost $2.48 $2.48 $2.48 $2.48 Storage Cost $79.01 $ $ $ CDC8 on Large Instance 1 Month 3 Month 6 Month 1 Year Compute Cost $4.80 $4.80 $4.80 $4.80 Storage Cost $76.01 $ $ $ CDC4 on Large Instance 1 Month 3 Month 6 Month 1 Year Compute Cost $5.12 $5.12 $5.12 $5.12 Storage Cost $73.76 $ $ $ Table 10: Instance Cost Assessment With the varying storage durations the following assumptions were made: Compute Cost the compute cost was only calculated for the initial data de-duplication process, any subsequent data accesses are not taken into account. The data access frequency is independent of the instance type. Another aspect that was not taken into account is additional data being added or delete to the cloud instance over the storage duration. Data additions based on our research have a positive impact on the cost savings seen with the more aggressive algorithms. Looking at the cost savings of the large and small instances, the use of the large instance type with the aggressive algorithm of CDC4 over a year storage time frame produces a cost savings of 6.15% or $58.37 compared with running the CDC16 using the smaller instance type. The smaller storage timeframes also produce a cost savings in the range of 2.5% for the first month to 5.82% at the six month mark. When comparing the CDC8 on the large instance a cost savings is not realized immediately, with the first month savings at less than 1%. Therefore, in order to maximize the cost savings in the 35

Data Deduplication and Tivoli Storage Manager

Data Deduplication and Tivoli Storage Manager Data Deduplication and Tivoli Storage Manager Dave Cannon Tivoli Storage Manager rchitect Oxford University TSM Symposium September 2007 Disclaimer This presentation describes potential future enhancements

More information

Hardware Configuration Guide

Hardware Configuration Guide Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...

More information

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of

More information

How AWS Pricing Works May 2015

How AWS Pricing Works May 2015 How AWS Pricing Works May 2015 (Please consult http://aws.amazon.com/whitepapers/ for the latest version of this paper) Page 1 of 15 Table of Contents Table of Contents... 2 Abstract... 3 Introduction...

More information

Deploying De-Duplication on Ext4 File System

Deploying De-Duplication on Ext4 File System Deploying De-Duplication on Ext4 File System Usha A. Joglekar 1, Bhushan M. Jagtap 2, Koninika B. Patil 3, 1. Asst. Prof., 2, 3 Students Department of Computer Engineering Smt. Kashibai Navale College

More information

How AWS Pricing Works

How AWS Pricing Works How AWS Pricing Works (Please consult http://aws.amazon.com/whitepapers/ for the latest version of this paper) Page 1 of 15 Table of Contents Table of Contents... 2 Abstract... 3 Introduction... 3 Fundamental

More information

Data Deduplication HTBackup

Data Deduplication HTBackup Data Deduplication HTBackup HTBackup and it s Deduplication technology is touted as one of the best ways to manage today's explosive data growth. If you're new to the technology, these key facts will help

More information

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS Nehal Markandeya 1, Sandip Khillare 2, Rekha Bagate 3, Sayali Badave 4 Vaishali Barkade 5 12 3 4 5 (Department

More information

Data Deduplication: An Essential Component of your Data Protection Strategy

Data Deduplication: An Essential Component of your Data Protection Strategy WHITE PAPER: THE EVOLUTION OF DATA DEDUPLICATION Data Deduplication: An Essential Component of your Data Protection Strategy JULY 2010 Andy Brewerton CA TECHNOLOGIES RECOVERY MANAGEMENT AND DATA MODELLING

More information

Speeding Up Cloud/Server Applications Using Flash Memory

Speeding Up Cloud/Server Applications Using Flash Memory Speeding Up Cloud/Server Applications Using Flash Memory Sudipta Sengupta Microsoft Research, Redmond, WA, USA Contains work that is joint with B. Debnath (Univ. of Minnesota) and J. Li (Microsoft Research,

More information

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON UNDERSTANDING DATA DEDUPLICATION Thomas Rivera SEPATON SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may use this material

More information

Barracuda Backup Deduplication. White Paper

Barracuda Backup Deduplication. White Paper Barracuda Backup Deduplication White Paper Abstract Data protection technologies play a critical role in organizations of all sizes, but they present a number of challenges in optimizing their operation.

More information

LDA, the new family of Lortu Data Appliances

LDA, the new family of Lortu Data Appliances LDA, the new family of Lortu Data Appliances Based on Lortu Byte-Level Deduplication Technology February, 2011 Copyright Lortu Software, S.L. 2011 1 Index Executive Summary 3 Lortu deduplication technology

More information

A Deduplication File System & Course Review

A Deduplication File System & Course Review A Deduplication File System & Course Review Kai Li 12/13/12 Topics A Deduplication File System Review 12/13/12 2 Traditional Data Center Storage Hierarchy Clients Network Server SAN Storage Remote mirror

More information

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges September 2011 Table of Contents The Enterprise and Mobile Storage Landscapes... 3 Increased

More information

Understanding EMC Avamar with EMC Data Protection Advisor

Understanding EMC Avamar with EMC Data Protection Advisor Understanding EMC Avamar with EMC Data Protection Advisor Applied Technology Abstract EMC Data Protection Advisor provides a comprehensive set of features to reduce the complexity of managing data protection

More information

Data Deduplication and Tivoli Storage Manager

Data Deduplication and Tivoli Storage Manager Data Deduplication and Tivoli Storage Manager Dave annon Tivoli Storage Manager rchitect March 2009 Topics Tivoli Storage, IM Software Group Deduplication technology Data reduction and deduplication in

More information

Enterprise Backup and Restore technology and solutions

Enterprise Backup and Restore technology and solutions Enterprise Backup and Restore technology and solutions LESSON VII Veselin Petrunov Backup and Restore team / Deep Technical Support HP Bulgaria Global Delivery Hub Global Operations Center November, 2013

More information

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services Jiansheng Wei, Hong Jiang, Ke Zhou, Dan Feng School of Computer, Huazhong University of Science and Technology,

More information

WHITE PAPER. How Deduplication Benefits Companies of All Sizes An Acronis White Paper

WHITE PAPER. How Deduplication Benefits Companies of All Sizes An Acronis White Paper How Deduplication Benefits Companies of All Sizes An Acronis White Paper Copyright Acronis, Inc., 2000 2009 Table of contents Executive Summary... 3 What is deduplication?... 4 File-level deduplication

More information

Demystifying Deduplication for Backup with the Dell DR4000

Demystifying Deduplication for Backup with the Dell DR4000 Demystifying Deduplication for Backup with the Dell DR4000 This Dell Technical White Paper explains how deduplication with the DR4000 can help your organization save time, space, and money. John Bassett

More information

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007 Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication February 2007 Though data reduction technologies have been around for years, there is a renewed

More information

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation Top Ten Questions to Ask Your Primary Storage Provider About Their Data Efficiency May 2014 Copyright 2014 Permabit Technology Corporation Introduction The value of data efficiency technologies, namely

More information

UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s.

UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s. UNDERSTANDING DATA DEDUPLICATION Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s. SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual

More information

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard UNDERSTANDING DATA DEDUPLICATION Tom Sas Hewlett-Packard SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may use this material

More information

Theoretical Aspects of Storage Systems Autumn 2009

Theoretical Aspects of Storage Systems Autumn 2009 Theoretical Aspects of Storage Systems Autumn 2009 Chapter 3: Data Deduplication André Brinkmann News Outline Data Deduplication Compare-by-hash strategies Delta-encoding based strategies Measurements

More information

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Tradeoffs in Scalable Data Routing for Deduplication Clusters Tradeoffs in Scalable Data Routing for Deduplication Clusters Wei Dong Princeton University Fred Douglis EMC Kai Li Princeton University and EMC Hugo Patterson EMC Sazzala Reddy EMC Philip Shilane EMC

More information

Deduplication Demystified: How to determine the right approach for your business

Deduplication Demystified: How to determine the right approach for your business Deduplication Demystified: How to determine the right approach for your business Presented by Charles Keiper Senior Product Manager, Data Protection Quest Software Session Objective: To answer burning

More information

NETAPP WHITE PAPER Looking Beyond the Hype: Evaluating Data Deduplication Solutions

NETAPP WHITE PAPER Looking Beyond the Hype: Evaluating Data Deduplication Solutions NETAPP WHITE PAPER Looking Beyond the Hype: Evaluating Data Deduplication Solutions Larry Freeman, Network Appliance, Inc. September 2007 WP-7028-0907 Table of Contents The Deduplication Hype 3 What Is

More information

sulbhaghadling@gmail.com

sulbhaghadling@gmail.com www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 4 Issue 3 March 2015, Page No. 10715-10720 Data DeDuplication Using Optimized Fingerprint Lookup Method for

More information

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs Data Reduction: Deduplication and Compression Danny Harnik IBM Haifa Research Labs Motivation Reducing the amount of data is a desirable goal Data reduction: an attempt to compress the huge amounts of

More information

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos Symantec Research Labs Symantec FY 2013 (4/1/2012 to 3/31/2013) Revenue: $ 6.9 billion Segment Revenue Example Business

More information

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression Sponsored by: Oracle Steven Scully May 2010 Benjamin Woo IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA

More information

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE White Paper IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE Abstract This white paper focuses on recovery of an IBM Tivoli Storage Manager (TSM) server and explores

More information

EMC VNXe File Deduplication and Compression

EMC VNXe File Deduplication and Compression White Paper EMC VNXe File Deduplication and Compression Overview Abstract This white paper describes EMC VNXe File Deduplication and Compression, a VNXe system feature that increases the efficiency with

More information

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression Philip Shilane, Mark Huang, Grant Wallace, and Windsor Hsu Backup Recovery Systems Division EMC Corporation Abstract

More information

MySQL and Virtualization Guide

MySQL and Virtualization Guide MySQL and Virtualization Guide Abstract This is the MySQL and Virtualization extract from the MySQL Reference Manual. For legal information, see the Legal Notices. For help with using MySQL, please visit

More information

Understanding Enterprise NAS

Understanding Enterprise NAS Anjan Dave, Principal Storage Engineer LSI Corporation Author: Anjan Dave, Principal Storage Engineer, LSI Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA

More information

3Gen Data Deduplication Technical

3Gen Data Deduplication Technical 3Gen Data Deduplication Technical Discussion NOTICE: This White Paper may contain proprietary information protected by copyright. Information in this White Paper is subject to change without notice and

More information

Reducing Backups with Data Deduplication

Reducing Backups with Data Deduplication The Essentials Series: New Techniques for Creating Better Backups Reducing Backups with Data Deduplication sponsored by by Eric Beehler Reducing Backups with Data Deduplication... 1 Explaining Data Deduplication...

More information

09'Linux Plumbers Conference

09'Linux Plumbers Conference 09'Linux Plumbers Conference Data de duplication Mingming Cao IBM Linux Technology Center cmm@us.ibm.com 2009 09 25 Current storage challenges Our world is facing data explosion. Data is growing in a amazing

More information

EMC DATA DOMAIN OPERATING SYSTEM

EMC DATA DOMAIN OPERATING SYSTEM ESSENTIALS HIGH-SPEED, SCALABLE DEDUPLICATION Up to 58.7 TB/hr performance Reduces protection storage requirements by 10 to 30x CPU-centric scalability DATA INVULNERABILITY ARCHITECTURE Inline write/read

More information

The assignment of chunk size according to the target data characteristics in deduplication backup system

The assignment of chunk size according to the target data characteristics in deduplication backup system The assignment of chunk size according to the target data characteristics in deduplication backup system Mikito Ogata Norihisa Komoda Hitachi Information and Telecommunication Engineering, Ltd. 781 Sakai,

More information

Hardware and Software Requirements. Release 7.5.x PowerSchool Student Information System

Hardware and Software Requirements. Release 7.5.x PowerSchool Student Information System Release 7.5.x PowerSchool Student Information System Released October 2012 Document Owner: Documentation Services This edition applies to Release 7.5.x of the PowerSchool software and to all subsequent

More information

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution Jonathan Halstuch, COO, RackTop Systems JHalstuch@racktopsystems.com Big Data Invasion We hear so much on Big Data and

More information

Amazon Elastic Compute Cloud Getting Started Guide. My experience

Amazon Elastic Compute Cloud Getting Started Guide. My experience Amazon Elastic Compute Cloud Getting Started Guide My experience Prepare Cell Phone Credit Card Register & Activate Pricing(Singapore) Region Amazon EC2 running Linux(SUSE Linux Windows Windows with SQL

More information

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda 1 Outline Build a cost-efficient Swift cluster with expected performance Background & Problem Solution Experiments

More information

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP Dilip N Simha (Stony Brook University, NY & ITRI, Taiwan) Maohua Lu (IBM Almaden Research Labs, CA) Tzi-cker Chiueh (Stony

More information

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY White Paper CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY DVTel Latitude NVMS performance using EMC Isilon storage arrays Correct sizing for storage in a DVTel Latitude physical security

More information

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication Table of Contents Introduction... 3 Shortest Possible Backup Window... 3 Instant

More information

Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card

Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card Version 1.0 April 2011 DB15-000761-00 Revision History Version and Date Version 1.0, April 2011 Initial

More information

A Data De-duplication Access Framework for Solid State Drives

A Data De-duplication Access Framework for Solid State Drives JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 28, 941-954 (2012) A Data De-duplication Access Framework for Solid State Drives Department of Electronic Engineering National Taiwan University of Science

More information

Multi-level Metadata Management Scheme for Cloud Storage System

Multi-level Metadata Management Scheme for Cloud Storage System , pp.231-240 http://dx.doi.org/10.14257/ijmue.2014.9.1.22 Multi-level Metadata Management Scheme for Cloud Storage System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3, Chuck Yoo 2 and Young Woong Ko 1

More information

We look beyond IT. Cloud Offerings

We look beyond IT. Cloud Offerings Cloud Offerings cstor Cloud Offerings As today s fast-moving businesses deal with increasing demands for IT services and decreasing IT budgets, the onset of cloud-ready solutions has provided a forward-thinking

More information

EMC BACKUP-AS-A-SERVICE

EMC BACKUP-AS-A-SERVICE Reference Architecture EMC BACKUP-AS-A-SERVICE EMC AVAMAR, EMC DATA PROTECTION ADVISOR, AND EMC HOMEBASE Deliver backup services for cloud and traditional hosted environments Reduce storage space and increase

More information

Inline Deduplication

Inline Deduplication Inline Deduplication binarywarriors5@gmail.com 1.1 Inline Vs Post-process Deduplication In target based deduplication, the deduplication engine can either process data for duplicates in real time (i.e.

More information

Technology and Cost Considerations for Cloud Deployment: Amazon Elastic Compute Cloud (EC2) Case Study

Technology and Cost Considerations for Cloud Deployment: Amazon Elastic Compute Cloud (EC2) Case Study Creating Value Delivering Solutions Technology and Cost Considerations for Cloud Deployment: Amazon Elastic Compute Cloud (EC2) Case Study Chris Zajac, NJDOT Bud Luo, Ph.D., Michael Baker Jr., Inc. Overview

More information

TECHNICAL BRIEF. Primary Storage Compression with Storage Foundation 6.0

TECHNICAL BRIEF. Primary Storage Compression with Storage Foundation 6.0 TECHNICAL BRIEF Primary Storage Compression with Storage Foundation 6.0 Technical Brief Primary Storage Compression with Storage Foundation 6.0 Contents Introduction... 4 What is Compression?... 4 Differentiators...

More information

UBUNTU DISK IO BENCHMARK TEST RESULTS

UBUNTU DISK IO BENCHMARK TEST RESULTS UBUNTU DISK IO BENCHMARK TEST RESULTS FOR JOYENT Revision 2 January 5 th, 2010 The IMS Company Scope: This report summarizes the Disk Input Output (IO) benchmark testing performed in December of 2010 for

More information

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside Managing the information that drives the enterprise STORAGE Buying Guide: DEDUPLICATION inside What you need to know about target data deduplication Special factors to consider One key difference among

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

Cloud Computing on Amazon's EC2

Cloud Computing on Amazon's EC2 Technical Report Number CSSE10-04 1. Introduction to Amazon s EC2 Brandon K Maharrey maharbk@auburn.edu COMP 6330 Parallel and Distributed Computing Spring 2009 Final Project Technical Report Cloud Computing

More information

Deduplication has been around for several

Deduplication has been around for several Demystifying Deduplication By Joe Colucci Kay Benaroch Deduplication holds the promise of efficient storage and bandwidth utilization, accelerated backup and recovery, reduced costs, and more. Understanding

More information

Read Performance Enhancement In Data Deduplication For Secondary Storage

Read Performance Enhancement In Data Deduplication For Secondary Storage Read Performance Enhancement In Data Deduplication For Secondary Storage A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Pradeep Ganesan IN PARTIAL FULFILLMENT

More information

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January 2012. Permabit Technology Corporation

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January 2012. Permabit Technology Corporation WHITE PAPER Permabit Albireo Data Optimization Software Benefits of Albireo for Virtual Servers January 2012 Permabit Technology Corporation Ten Canal Park Cambridge, MA 02141 USA Phone: 617.252.9600 FAX:

More information

EMC DATA DOMAIN OPERATING SYSTEM

EMC DATA DOMAIN OPERATING SYSTEM EMC DATA DOMAIN OPERATING SYSTEM Powering EMC Protection Storage ESSENTIALS High-Speed, Scalable Deduplication Up to 58.7 TB/hr performance Reduces requirements for backup storage by 10 to 30x and archive

More information

Veritas Backup Exec 15: Deduplication Option

Veritas Backup Exec 15: Deduplication Option Veritas Backup Exec 15: Deduplication Option Who should read this paper Technical White Papers are designed to introduce IT professionals to key technologies and technical concepts that are associated

More information

Comparison of Windows IaaS Environments

Comparison of Windows IaaS Environments Comparison of Windows IaaS Environments Comparison of Amazon Web Services, Expedient, Microsoft, and Rackspace Public Clouds January 5, 215 TABLE OF CONTENTS Executive Summary 2 vcpu Performance Summary

More information

Quanqing XU Quanqing.Xu@nicta.com.au. YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

Quanqing XU Quanqing.Xu@nicta.com.au. YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud Quanqing XU Quanqing.Xu@nicta.com.au YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud Outline Motivation YuruBackup s Architecture Backup Client File Scan, Data

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Byte-index Chunking Algorithm for Data Deduplication System

Byte-index Chunking Algorithm for Data Deduplication System , pp.415-424 http://dx.doi.org/10.14257/ijsia.2013.7.5.38 Byte-index Chunking Algorithm for Data Deduplication System Ider Lkhagvasuren 1, Jung Min So 1, Jeong Gun Lee 1, Chuck Yoo 2 and Young Woong Ko

More information

Metadata Feedback and Utilization for Data Deduplication Across WAN

Metadata Feedback and Utilization for Data Deduplication Across WAN Zhou B, Wen JT. Metadata feedback and utilization for data deduplication across WAN. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 31(3): 604 623 May 2016. DOI 10.1007/s11390-016-1650-6 Metadata Feedback

More information

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose Abhirupa Chatterjee 1, Divya. R. Krishnan 2, P. Kalamani 3 1,2 UG Scholar, Sri Sairam College Of Engineering, Bangalore. India

More information

ESG REPORT. Data Deduplication Diversity: Evaluating Software- vs. Hardware-Based Approaches. By Lauren Whitehouse. April, 2009

ESG REPORT. Data Deduplication Diversity: Evaluating Software- vs. Hardware-Based Approaches. By Lauren Whitehouse. April, 2009 ESG REPORT : Evaluating Software- vs. Hardware-Based Approaches By Lauren Whitehouse April, 2009 Table of Contents ESG REPORT Table of Contents... i Introduction... 1 External Forces Contribute to IT Challenges...

More information

How to recover a failed Storage Spaces

How to recover a failed Storage Spaces www.storage-spaces-recovery.com How to recover a failed Storage Spaces ReclaiMe Storage Spaces Recovery User Manual 2013 www.storage-spaces-recovery.com Contents Overview... 4 Storage Spaces concepts and

More information

A Survey on Deduplication Strategies and Storage Systems

A Survey on Deduplication Strategies and Storage Systems A Survey on Deduplication Strategies and Storage Systems Guljar Shaikh ((Information Technology,B.V.C.O.E.P/ B.V.C.O.E.P, INDIA) Abstract : Now a day there is raising demands for systems which provide

More information

ExaGrid Product Description. Cost-Effective Disk-Based Backup with Data Deduplication

ExaGrid Product Description. Cost-Effective Disk-Based Backup with Data Deduplication ExaGrid Product Description Cost-Effective Disk-Based Backup with Data Deduplication 1 Contents Introduction... 3 Considerations When Examining Disk-Based Backup Approaches... 3 ExaGrid A Disk-Based Backup

More information

Hey, You, Get Off of My Cloud! Exploring Information Leakage in Third-Party Clouds. Thomas Ristenpart, Eran Tromer, Hovav Shacham, Stefan Savage

Hey, You, Get Off of My Cloud! Exploring Information Leakage in Third-Party Clouds. Thomas Ristenpart, Eran Tromer, Hovav Shacham, Stefan Savage Hey, You, Get Off of My Cloud! Exploring Information Leakage in Third-Party Clouds Thomas Ristenpart, Eran Tromer, Hovav Shacham, Stefan Savage UCSD MIT UCSD UCSD Today s talk in one slide Third-party

More information

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.

More information

E-Guide. Sponsored By:

E-Guide. Sponsored By: E-Guide An in-depth look at data deduplication methods This E-Guide will discuss the various approaches to data deduplication. You ll learn the pros and cons of each, and will benefit from independent

More information

Turnkey Deduplication Solution for the Enterprise

Turnkey Deduplication Solution for the Enterprise Symantec NetBackup 5000 Appliance Turnkey Deduplication Solution for the Enterprise Mayur Dewaikar Sr. Product Manager, Information Management Group White Paper: A Deduplication Appliance Solution for

More information

2009 Oracle Corporation 1

2009 Oracle Corporation 1 The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,

More information

VNX HYBRID FLASH BEST PRACTICES FOR PERFORMANCE

VNX HYBRID FLASH BEST PRACTICES FOR PERFORMANCE 1 VNX HYBRID FLASH BEST PRACTICES FOR PERFORMANCE JEFF MAYNARD, CORPORATE SYSTEMS ENGINEER 2 ROADMAP INFORMATION DISCLAIMER EMC makes no representation and undertakes no obligations with regard to product

More information

Case Studies. Data Sheets : White Papers : Boost your storage buying power... use ours!

Case Studies. Data Sheets : White Papers : Boost your storage buying power... use ours! TM TM Data Sheets : White Papers : Case Studies For over a decade Coolspirit have been supplying the UK s top organisations with storage products and solutions so be assured we will meet your requirements

More information

Release 8.2 Hardware and Software Requirements. PowerSchool Student Information System

Release 8.2 Hardware and Software Requirements. PowerSchool Student Information System Release 8.2 Hardware and Software Requirements PowerSchool Student Information System Released January 2015 Document Owner: Documentation Services This edition applies to Release 8.2 of the PowerSchool

More information

Cloud security CS642: Computer Security Professor Ristenpart h9p://www.cs.wisc.edu/~rist/ rist at cs dot wisc dot edu University of Wisconsin CS 642

Cloud security CS642: Computer Security Professor Ristenpart h9p://www.cs.wisc.edu/~rist/ rist at cs dot wisc dot edu University of Wisconsin CS 642 Cloud security CS642: Computer Security Professor Ristenpart h9p://www.cs.wisc.edu/~rist/ rist at cs dot wisc dot edu University of Wisconsin CS 642 Announcements Take- home final versus in- class Homework

More information

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014 VMware SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014 VMware SAN Backup Using VMware vsphere Table of Contents Introduction.... 3 vsphere Architectural Overview... 4 SAN Backup

More information

Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team

Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture Dell Compellent Product Specialist Team THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL

More information

The Classical Architecture. Storage 1 / 36

The Classical Architecture. Storage 1 / 36 1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage

More information

Effective Planning and Use of TSM V6 Deduplication

Effective Planning and Use of TSM V6 Deduplication Effective Planning and Use of IBM Tivoli Storage Manager V6 Deduplication 08/17/12 1.0 Authors: Jason Basler Dan Wolfe Page 1 of 42 Document Location This is a snapshot of an on-line document. Paper copies

More information

Riverbed Whitewater/Amazon Glacier ROI for Backup and Archiving

Riverbed Whitewater/Amazon Glacier ROI for Backup and Archiving Riverbed Whitewater/Amazon Glacier ROI for Backup and Archiving November, 2013 Saqib Jang Abstract This white paper demonstrates how to increase profitability by reducing the operating costs of backup

More information

Understanding data deduplication ratios June 2008

Understanding data deduplication ratios June 2008 June 2008 Mike Dutch Data Management Forum Data Deduplication & Space Reduction SIG Co-Chair EMC Senior Technologist Table of Contents Optimizing storage capacity...3 The impact on storage utilization...3

More information

Amazon EC2 XenApp Scalability Analysis

Amazon EC2 XenApp Scalability Analysis WHITE PAPER Citrix XenApp Amazon EC2 XenApp Scalability Analysis www.citrix.com Table of Contents Introduction...3 Results Summary...3 Detailed Results...4 Methods of Determining Results...4 Amazon EC2

More information

Part 1: Price Comparison Among The 10 Top Iaas Providers

Part 1: Price Comparison Among The 10 Top Iaas Providers Part 1: Price Comparison Among The 10 Top Iaas Providers Table of Contents Executive Summary 3 Estimating Cloud Spending 3 About the Pricing Report 3 Key Findings 3 The IaaS Providers 3 Provider Characteristics

More information

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton) Wide-area Network Acceleration for the Developing World Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton) POOR INTERNET ACCESS IN THE DEVELOPING WORLD Internet access is a scarce

More information

DeltaStor Data Deduplication: A Technical Review

DeltaStor Data Deduplication: A Technical Review White Paper DeltaStor Data Deduplication: A Technical Review DeltaStor software is a next-generation data deduplication application for the SEPATON S2100 -ES2 virtual tape library that enables enterprises

More information

Contents. WD Arkeia Page 2 of 14

Contents. WD Arkeia Page 2 of 14 Contents Contents...2 Executive Summary...3 What Is Data Deduplication?...4 Traditional Data Deduplication Strategies...5 Deduplication Challenges...5 Single-Instance Storage...5 Fixed-Block Deduplication...6

More information

An Oracle White Paper June 2011. Oracle Database Firewall 5.0 Sizing Best Practices

An Oracle White Paper June 2011. Oracle Database Firewall 5.0 Sizing Best Practices An Oracle White Paper June 2011 Oracle Database Firewall 5.0 Sizing Best Practices Introduction... 1 Component Overview... 1 Database Firewall Deployment Modes... 2 Sizing Hardware Requirements... 2 Database

More information

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant DISCOVER HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant HP StorageWorks Data Protection Solutions HP has it covered Near continuous data protection Disk Mirroring Advanced Backup

More information

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud StACC: St Andrews Cloud Computing Co laboratory A Performance Comparison of Clouds Amazon EC2 and Ubuntu Enterprise Cloud Jonathan S Ward StACC (pronounced like 'stack') is a research collaboration launched

More information