E-Guide An in-depth look at data deduplication methods This E-Guide will discuss the various approaches to data deduplication. You ll learn the pros and cons of each, and will benefit from independent expert insight that will help you get the most out of the approach you take with data deduplication technology. Sponsored By:
Table of Contents E-Guide An in-depth look at data deduplication methods Table of Contents: The pros and cons of file-level vs. block-level datadeduplication technology Data deduplication methods: Block-level versusbyte-level dedupe Resources from FalconStor Software Sponsored by: Page 2 of 8
The pros and cons of file-level vs. block-level data deduplication technology The pros and cons of file-level vs. block-level data deduplication technology Lauren Whitehouse Data Deduplication has dramatically improved the value proposition of disk-based data protection as well as WAN-based remote- and branch-office backup consolidation and disaster recovery (DR) strategies. It identifies duplicate data, removing redundancies and reducing the overall capacity of data transferred and stored. Some deduplication approaches operate at the file level, while others go deeper to examine data at a sub-file, or block, level. Determining uniqueness at either the file or block level will offer benefits, though results will vary. The differences lie in the amount of reduction each produces and the time each approach takes to determine what's unique. File-level deduplication Also commonly referred to as single-instance storage (SIS), file-level data deduplication compares a file to be backed up or archived with those already stored by checking its attributes against an index. If the file is unique, it is stored and the index is updated; if not, only a pointer to the existing file is stored. The result is that only one instance of the file is saved and subsequent copies are replaced with a "stub" that points to the original file. Block-level deduplication Block-level data deduplication operates on the sub-file level. As its name implies, the file is typically broken down into segments -- chunks or blocks -- that are examined for redundancy vs. previously stored information. The most popular approach for determining duplicates is to assign an identifier to a chunk of data, using a hash algorithm, for example, that generates a unique ID or "fingerprint" for that block. The unique ID is then compared with a central index. If the ID exists, then the data segment has been processed and stored before. Therefore, only a pointer to the previously stored data needs to be saved. If the ID is new, then the block is unique. The unique ID is added to the index and the unique chunk is stored. The size of the chunk to be examined varies from vendor to vendor. Some have fixed block sizes, while others use variable block sizes (and to make it even more confusing, a few allow end users to vary the size of the fixed block). Fixed blocks could be 8 KB or maybe 64 KB -- the difference is that the smaller the chunk, the more likely the opportunity to identify it as redundant. This, in turn, means even greater reductions as even less data is stored. The only issue with fixed blocks is that if a file is modified and the deduplication product uses the same fixed blocks from the last inspection, it might not detect redundant segments because as the blocks in the file are changed or moved, they shift downstream from the change, offsetting the rest of the comparisons. Variable-sized blocks help increase the odds that a common segment will be detected even after a file is modified. Sponsored by: Page 3 of 8
The pros and cons of file-level vs. block-level data deduplication technology This approach finds natural patterns or break points that might occur in a file and then segments the data accordingly. Even if blocks shift when a file is changed, this approach is more likely to find repeated segments. The tradeoff? A variable-length approach may require a vendor to track and compare more than just one unique ID for a segment, which could affect index size and computational time. The differences between file- and block-level deduplication go beyond just how they operate. There are advantages and disadvantages to each approach. File-level approaches can be less efficient than block-based deduplication: A change within the file causes the whole file to be saved again. A file, such as a PowerPoint presentation, can have something as simple as the title page changed to reflect a new presenter or date -- this will cause the entire file to be saved a second time. Block-based deduplication would only save the changed blocks between one version of the file and the next. Reduction ratios may only be in the 5:1 or less range whereas block-based deduplication has been shown to reduce capacity in the 20:1 to 50:1 range for stored data. File-level approaches can be more efficient than block-based data deduplication: Indexes for file-level deduplication are significantly smaller, which takes less computational time when duplicates are being determined. Backup performance is, therefore, less affected by the deduplication process. File-level processes require less processing power due to the smaller index and reduced number of comparisons. Therefore, the impact on the systems performing the inspection is less. The impact on recovery time is low. Block-based deduplication will require "reassembly" of the chunks based on the master index that maps the unique segments and pointers to unique segments. Since file-based approaches store unique files and pointers to existing unique files there is less to reassemble. Sponsored by: Page 4 of 8
We selected FalconStor because we were confident they could offer a highly scalable VTL solution that provides data de-duplication, offsite replication, and tape integration, with zero impact to our backup performance. Henry Denis, IT Director Epiq Systems FalconStor VirtualTape Library (VTL) provides disk-based data protection and de-duplication to vastly improve the reliability, speed, and predictability of backups. To learn more about our industry-leading VTL solutions with de-duplication: Contact FalconStor at 866-NOW-FALC (866-669-3252) or visit www.falconstor.com
Data deduplication methods: Block-level versus byte-level dedupe Data deduplication methods: Block-level versus byte-level dedupe Lauren Whitehouse Data deduplication identifies duplicate data, removing redundancies and reducing the overall capacity of data transferred and stored. In my last article, I reviewed the differences between file-level and block-level data deduplication. In this article, I'll assess byte-level versus block-level deduplication. Byte-level deduplication provides a more granular inspection of data than block-level approaches, ensuring more accuracy, but it often requires more knowledge of the backup stream to do its job. Block-level approaches Block-level data deduplication segments data streams into blocks, inspecting the blocks to determine if each has been encountered before (typically by generating a digital signature or unique identifier via a hash algorithm for each block). If the block is unique, it is written to disk and its unique identifier is stored in an index; otherwise, only a pointer to the original, unique block is stored. By replacing repeated blocks with much smaller pointers rather than storing the block again, disk storage space is saved. The criticism of block-based approaches are 1) the use of a hash algorithm to calculate the unique ID brings the risk of generating a false positive; and 2) storing unique IDs in an index can slow the inspection process as it grows larger and requires disk I/O (unless the index size is kept in check and data comparison occurs in memory). Hash collisions could spell a false positive when use a hash-based algorithm for determining duplicates. Hash algorithms, such as MD5 and SHA-1, generate a unique number for the chunk of data being examined. While hash collisions and the resulting data corruption are possible, the chances are slim that a hash collision will occur. Byte-level data deduplication Analyzing data streams at the byte level is another approach to deduplication. By performing a byte-by-byte comparison of new data streams versus previously stored ones, a higher level of accuracy can be delivered. Deduplication products that use this method have one thing in common: It's likely that the incoming backup data stream has been seen before, so it is reviewed to see if it matches similar data received in the past. Products leveraging a byte-level approach are typically "content aware," which means the vendor has done some reverse engineering of the backup application's data stream to understand how to retrieve information such as the file name, file type, date/time stamp, etc. This method reduces the amount of computation required to determine unique versus duplicate data. The caveat? This approach typically occurs post-process -- performed on backup data once the backup has completed. Backup jobs, therefore, complete at full disk performance, but require a reserve of disk cache to perform the deduplication process. It's also likely that the deduplication process is limited to a backup stream from a single backup set and not applied "globally" across backup sets. Sponsored by: Page 6 of 8
Data deduplication methods: Block-level versus byte-level dedupe Once the deduplication process is complete, the solution reclaims disk space by deleting the duplicate data. Before space reclamation is performed, an integrity check can be performed to ensure that the deduplicated data matches the original data objects. The last full backup can also be maintained so recovery is not dependent on reconstituting deduplicated data, enabling rapid recovery. Which approach Is best? Both block- and byte-level methods deliver the benefit of optimizing storage capacity. When, where, and how the processes work should be reviewed for your backup environment and its specific requirements before selecting one approach over another. Your vetting process should also include references from organizations with similar characteristics and requirements. Sponsored by: Page 7 of 8
Resources from FalconStor Software Resources from FalconStor Software Book Chapter: SAN For Dummies, Chapter 13: Using Data De-duplication to Lighten the Load White Paper: Demystifying Data De-Duplication: Choosing the Best Solution Webcast: Enhancing Disk-to-Disk Backup with Data Deduplication About FalconStor Software FalconStor Software, Inc. (NASDAQ: FALC) is the premier provider of TOTALLY Open data protection solutions. We deliver proven, comprehensive data protection solutions that facilitate the continuous availability of business-critical data with speed, integrity, and simplicity. Our technology independent solutions, built upon the award-winning IPStor virtualization platform, include the industry s leading Virtual Tape Library (VTL) with deduplication, Continuous Data Protector (CDP), File-interface Deduplication System, and Network Storage Server (NSS), each enabled with WAN-optimized replication for disaster recovery and remote office protection. Our products are available from major OEMs and solution providers and are deployed by thousands of customers worldwide, from small businesses to Fortune 1000 enterprises. For more information visit www.falconstor.com. Sponsored by: Page 8 of 8