Deduplication Demystified: How to determine the right approach for your business Presented by Charles Keiper Senior Product Manager, Data Protection Quest Software
Session Objective: To answer burning questions about data deduplication and things you should consider Does this approach deliver maximum disk savings? Will this approach limit my business flexibility in any way? How effective will it be at shrinking my backup window?
Deduplication Demystified Deduplication is one of today s hottest data protection technologies Gone from nice to have to must have Deduplication can be confusing Which technique is best for my environment? Is software or an appliance based approach best? Deduplication can be very effective for your business Implemented properly, deduplication can help you reduce storage costs, maximize flexibility and reduce backup windows
What is Deduplication? The process of examining a data set or byte stream at the sub-file level and storing and/or sending only unique data. 1 Single Instance Storage is not Deduplication Duplicate data segments are replaced with a pointer to the first occurrence of the data Example: If your backup is 100GB and 40GB is duplicate data, you would only store the unique 60GB. And with compression, there s even more disk space savings. 1. Storage Networking Industry Association (SNIA)
Storing Only Unique Data
(Block) Size Matters Fixed-length Blocks Variable-length Segments When data in a file shifts, all subsequent blocks in the file will be rewritten and are likely to be considered as unique A more advanced approach is to anchor variable-length segments based on their interior data patterns Compression is less significant Solves the data shifting problem of fixed-size block approach
Fixed- vs. Variable- Block Size
Inline vs. Post-Process Inline Deduplication Data is deduplicated before it is written to disk (inline). Post-Process Analyzes and reduces data after it has been stored to disk Significantly reduces the raw disk capacity needed in the system since the full, not-yetdeduplicated data set is never written to disk. If replication is supported by the inline deduplication process Time-to-DR Readiness is also going to be Optimized Waits for data to be committed on disk before initiating the deduplication process Requires a greater initial capacity overhead than inline solutions
Source-side vs. Target-side Source-side Dedupe Target-side Dedupe Ensures that data on the source system is deduplicated before being moved to a target System will periodically scan new file segments creating hashes and compare them to hashes of existing target segments When hashes match, the segment is dropped and the file points to the duplicate segment on the target The process of removing duplicates of data in the secondary store Generally this will be a backup store such as a disk based data repository or a NAS Appliance Less overhead on primary system, but more data is sent over the bus or network
Dedupe Ratios Deduplication is typically reported as a ratio. The ratio is calculated as ratio = bytes in / bytes out Example: 12:1 or 12x Each vendor in the deduplication market has its own set of tests; leads to a different ratio for the vendors Ratios only meaningful if compared using same set of assumptions Even low deduplication ratios provide significant space savings After 10:1 higher ratios yield marginally less space reduction Deduplication solutions must be evaluated on factors other than deduplication ratios
Types of Data The data type is a good indicator of how well it will benefit from deduplication Good: Files created by office workers and frequently distributed or copied Bad: Data derived from natural sources (audio/video) and most types of executable files
Data Change Rates In general, the data deduplication ratio will be higher when the change rate is lower When the change rate is higher, there s a good chance that the new data is unique, resulting in a lower data deduplication ratio
Backup Types Types of backups: Full, Incremental, Differential Regardless of the backup type, deduplication can significantly reduce the amount of redundant data Full backups, incremental backups and differential backups do not scan for uniqueness Actual deduplication rate depends on a number of variables (data growth, rate of change) 10x to 20x-plus is typical
Backup Job Retentions The more data that is examined, the greater the likelihood that duplicate data will be found Increase your disk space savings When you deduplicate your backups, each additional week of backups can be retained using less incremental disk space
Backup Job Retention Data de-duplication can reduce the amount of physical disk needed for backup. Users can use this reclaimed space for: 1. Bringing other backup data onto disk 2. Lengthening retention periods backed up to disk Deduplication allows users to leverage disk as a backup target for more data. Data can be kept on disk for longer periods of time.
Deduplication Benefits Storage-based deduplication reduces the amount of storage needed for most data types. This is generally an appliance based approach When used with backup solutions- Improves data recovery times i.e., more disk based recoveries Network data deduplication is used to reduce the absolute number of bytes transferred between endpoints, reducing the amount of bandwidth required Virtual Machines benefit by allowing system files for each virtual server to be coalesced into a single storage space
Questions?