NETAPP WHITE PAPER Looking Beyond the Hype: Evaluating Data Deduplication Solutions



Similar documents
Inline Deduplication

Lab Benchmark Testing Report. Joint Solution: Syncsort Backup Express (BEX) and NetApp Deduplication. Comparative Data Reduction Tests

Reducing Backups with Data Deduplication

Backup and Recovery: The Benefits of Multiple Deduplication Policies

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Data Deduplication: An Essential Component of your Data Protection Strategy

ExaGrid Product Description. Cost-Effective Disk-Based Backup with Data Deduplication

REDUCING DATA CENTER POWER CONSUMPTION THROUGH EFFICIENT STORAGE

WHITE PAPER. How Deduplication Benefits Companies of All Sizes An Acronis White Paper

Demystifying Deduplication for Backup with the Dell DR4000

Identifying the Hidden Risk of Data Deduplication: How the HYDRAstor TM Solution Proactively Solves the Problem

Eight Considerations for Evaluating Disk-Based Backup Solutions

Understanding the Economics of Flash Storage

REDUCING DATA CENTER POWER CONSUMPTION THROUGH EFFICIENT STORAGE

How To Make A Backup System More Efficient

Data Deduplication and Tivoli Storage Manager

Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski Spała

E-Guide. Sponsored By:

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Deduplication and Beyond: Optimizing Performance for Backup and Recovery

Deduplication Demystified: How to determine the right approach for your business

3Gen Data Deduplication Technical

Business Benefits of Data Footprint Reduction

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

NetApp Data Compression and Deduplication Deployment and Implementation Guide

Don t be duped by dedupe - Modern Data Deduplication with Arcserve UDP

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Moving Beyond RAID DXi and Dynamic Disk Pools

Data Deduplication HTBackup

WHITE PAPER. Dedupe-Centric Storage. Hugo Patterson, Chief Architect, Data Domain. Storage. Deduplication. September 2007

Sales Tool. Summary DXi Sales Messages November NOVEMBER ST00431-v06

Barracuda Backup Deduplication. White Paper

Availability Digest. Data Deduplication February 2011

Veritas Backup Exec 15: Deduplication Option

DeltaStor Data Deduplication: A Technical Review

ESG REPORT. Data Deduplication Diversity: Evaluating Software- vs. Hardware-Based Approaches. By Lauren Whitehouse. April, 2009

Backup Exec 15: Deduplication Option

W H I T E P A P E R R e a l i z i n g t h e B e n e f i t s o f Deduplication in a Backup and Restore System

Backup Exec 2014: Deduplication Option

WHITE PAPER. Storage Savings Analysis: Storage Savings with Deduplication and Acronis Backup & Recovery 10

A Survey of Contemporary Enterprise Storage Technologies from a Digital Forensics Perspective

Data Deduplication and Tivoli Storage Manager

Doc. Code. OceanStor VTL6900 Technical White Paper. Issue 1.1. Date Huawei Technologies Co., Ltd.

Real-time Compression: Achieving storage efficiency throughout the data lifecycle

The Power of Deduplication-Enabled Per-VM Data Protection SimpliVity s OmniCube Aligns VM and Data Management

An Authorized Duplicate Check Scheme for Removing Duplicate Copies of Repeating Data in The Cloud Environment to Reduce Amount of Storage Space

The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle

Turnkey Deduplication Solution for the Enterprise

Optimizing Backup and Data Protection in Virtualized Environments. January 2009

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

09'Linux Plumbers Conference

Tiered Data Protection Strategy Data Deduplication. Thomas Störr Sales Director Central Europe November 8, 2007

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

Technical White Paper for the Oceanspace VTL6000

Manage rapid NAS data growth while reducing storage footprint and costs

89 Fifth Avenue, 7th Floor. New York, NY White Paper. HP 3PAR Thin Deduplication: A Competitive Comparison

The Microsoft Large Mailbox Vision

Business-Centric Storage FUJITSU Storage ETERNUS CS800 Data Protection Appliance

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

Business-centric Storage FUJITSU Storage ETERNUS CS800 Data Protection Appliance

LDA, the new family of Lortu Data Appliances

ADVANCED DEDUPLICATION CONCEPTS. Larry Freeman, NetApp Inc Tom Pearce, Four-Colour IT Solutions

Whitepaper: Back Up SAP HANA and SUSE Linux Enterprise Server with SEP sesam. Copyright 2014 SEP

Data Compression and Deduplication. LOC Cisco Systems, Inc. All rights reserved.

Technology Fueling the Next Phase of Storage Optimization

WHITE PAPER. DATA DEDUPLICATION BACKGROUND: A Technical White Paper

Data Deduplication Background: A Technical White Paper

Rose Business Technologies

STORAGE SOURCE DATA DEDUPLICATION PRODUCTS. Buying Guide: inside

Deduplication, Incremental Forever, and the. Olsen Twins. 7 Technology Circle Suite 100 Columbia, SC 29203

Speeding Up Cloud/Server Applications Using Flash Memory

How To Protect Your Data With Netapp Storevault

Effectively Reducing the Cost of Backup and Storage

WHITE PAPER. Effectiveness of Variable-block vs Fixedblock Deduplication on Data Reduction: A Technical Analysis

Hardware Configuration Guide

Effective Planning and Use of TSM V6 Deduplication

Understanding EMC Avamar with EMC Data Protection Advisor

Deduplication has been around for several

Cloud De-duplication Cost Model THESIS

EMC VNXe File Deduplication and Compression

How To Protect Data On Network Attached Storage (Nas) From Disaster

A Survey on Deduplication Strategies and Storage Systems

Exam Number/Code:HP0-J59. Exam Name:StoreOnce Solutions. Version: Demo. Advanced Service and Support.

Reducing Data Center Power Consumption Through Efficient Storage

Online Transaction Processing in SQL Server 2008

Why StrongBox Beats Disk for Long-Term Archiving. Here s how to build an accessible, protected long-term storage strategy for $.003 per GB/month.

ACHIEVING STORAGE EFFICIENCY WITH DATA DEDUPLICATION

The Business Value of Data Deduplication DDSR SIG

Get Success in Passing Your Certification Exam at first attempt!

Keys to Successfully Architecting your DSI9000 Virtual Tape Library. By Chris Johnson Dynamic Solutions International

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

WHITE PAPER. Symantec Enterprise Vault and Exchange Alex Brown Product Manager

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

Introduction to NetApp Infinite Volume

June Blade.org 2009 ALL RIGHTS RESERVED

Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

Transcription:

NETAPP WHITE PAPER Looking Beyond the Hype: Evaluating Data Deduplication Solutions Larry Freeman, Network Appliance, Inc. September 2007 WP-7028-0907

Table of Contents The Deduplication Hype 3 What Is Deduplication? 3 How Does Deduplication Work? 3 Deduplication Design Considerations 4 Hashing 4 Beyond the Hype: What You Should Know about Hashes 5 Indexing 5 Beyond the Hype: What You Should Know about Indexing 6 Inline or Postprocessing 7 Beyond the Hype: What You Should Know about Inline and Postprocessing 7 Source or Destination Deduplication 8 Beyond the Hype: What You Should Know about Source and Destination 8 Deduplication Space Savings 8 Beyond the Hype: What You Should Know about Space Savings 10 Summary 10 2

The Deduplication Hype Deduplication is without a doubt one of this year s hottest topics in data storage. The rationale behind deduplication is simple: Eliminate your duplicate data and reduce the capacity needed during backups and other data copy activities. Unfortunately, the many different deduplication approaches from various vendors, with much hype about their unique benefits, can leave users bewildered. As they consider the variety of deduplication offerings, they often fail to understand the basic design nuances that are important to them. This paper looks beyond the hype and focuses on the important design aspects of deduplication, giving evaluators the information they need to make informed decisions when examining deduplication solutions. What Is Deduplication? Deduplication is the process of unduplicating data. The term deduplication was coined by database administrators many years ago as a way of describing the process of removing duplicate database records after two databases have been merged. In the context of disk storage, deduplication refers to any algorithm that searches for duplicate data objects, such as blocks, chunks, or files, and discards these duplicates. When a duplicate object is detected, its reference pointers are modified so that the object can still be located and retrieved, but it shares its physical location with other identical objects. This data sharing is the foundation of all types of data deduplication. How Does Deduplication Work? Regardless of operating system, application, or file system type, all data objects are written to a storage system using a data reference pointer, without which the data could not be referenced or retrieved. In traditional (non-deduplicated) file systems, data objects are stored without regard to any similarity with other objects in the same file system. In Figure 1, five identical objects are stored in a file system, each with a separate data pointer. Although all five data objects are identical, each is stored as a separate instance and each consumes physical disk space. Data Object Reference Pointers Figure 1) Non-deduplicated data 3

In a deduplicated file system, two new and important concepts are introduced: A catalog of all data objects is maintained. This catalog contains a record of all data objects using a hash that identifies the unique contents of each object. Hashing is discussed in detail in Deduplication Design Considerations, later in this paper. The file system is capable of allowing many data pointers to reference the same physical data object. Cataloging data objects, comparing the objects, and redirecting reference pointers forms the basis of the deduplication algorithm. As shown in Figure 2, referencing several identical objects with a single master object allows the space that is normally occupied by the duplicate objects to be given back to the storage system. Data Object Reference Pointers Deduplication Catalog/Index Free Free Free Free Figure 2) Deduplicating identical objects creates free space. Deduplication Design Considerations Given the fact that all deduplication vendors must maintain some form of catalog and must support some form of block referencing, there is a surprising variety of implementations (and they all have subtle differences that allow them all to be patented). The following sections explain the methods that vendors use when designing deduplication. Hashing Data deduplication begins with a comparison of two data objects. It would be impractical (and very arduous) to scan an entire data volume for duplicate objects each time a new object is written to that volume. For that reason, deduplication vendors create small hash values for each new object, and store these values in a catalog. A hash value, also called a digital fingerprint or digital signature, is a small number that is generated from a longer string of data. A hash value is substantially smaller than the data object itself, and is generated by a mathematical formula in such a way that it is unlikely (although not impossible) for two nonidentical data objects to produce the same hash value. A hash value can be as simple as a parity calculation or as elaborate as a SHA-1 or MD-5 encryption hash. In any case, once the hash values have been created, they can be easily compared and deduplication candidates can be identified. 4

If matching hash values are discovered, there are two approaches to handling these candidates. First, you can assume that identical hash values always indicate that the data objects are identical, and move straight to the deduplication phase. Or, as an alternative, you can add a secondary operation to scan each data object to validate that the data objects are indeed identical before performing the deduplication. Data Object Hash Algorithm Hash Value Figure 3) A hash value is a digital fingerprint that represents a much larger object. Beyond the Hype: What You Should Know about Hashes Understanding a vendor s hashing algorithm is a major criterion when evaluating deduplication. If the vendor depends solely on the hash match to decide whether two objects are to be deduplicated, then you are accepting the possibility that false positive matches could occur, with data corruption as the likely result. If any possibility of data corruption, however small, is not acceptable, you should make sure that the vendor has taken steps to perform secondary data object validation after the hash compare is complete. Alternatively, if you are willing to accept that false positive hash matches may occur, and speed of deduplication is a greater priority, then a trusted hash design may be adequate for your needs. Indexing When the duplicate objects have been identified (and optionally validated), it s time to deduplicate. As you might expect, different vendors employ varying methods when modifying the data pointer structure. However, all forms of data pointer indexing fall into two broad categories: Hash catalog: A catalog of hash values is used to identify candidates for deduplication. A system process identifies duplicates, and data pointers are modified accordingly. The advantage of catalog deduplication is that the catalog is used only to identify duplicate objects; it is not accessed during the actual reading or writing of the data objects. That task is still handled by the normal file system data structure. 5

System Read/Write Requests Native File System Hash Catalog Figure 4 ) Catalog indexing: The file system controls block sharing of duplicate blocks. Lookup table: A lookup table extends the functionality of the hash catalog to also contain a hash lookup table in order to index the deduplicated object s parent data pointer. The advantage of a lookup table is that it can be used on file systems that do not support multiple block referencing; a single data object can be stored and referenced many times by using the lookup table. System Read/Write Requests Hash Catalog & Lookup Table Native File System Figure 5) Lookup table indexing: Block sharing is managed and controlled by the lookup table. Beyond the Hype: What You Should Know about Indexing A seemingly subtle design difference, deduplication indexing is actually an important consideration in vendor evaluation, particularly when it comes to resiliency. When you are using lookup tables to index data objects, the table itself becomes a single point of failure. Any corruption of the table is likely to render the entire file system unusable. Catalog-based deduplication, on the other hand, is used only for discovery and has no dependency for the actual reading and writing of objects. Catalog deduplication, however, requires that the native file system support multiple block referencing. If resiliency is paramount, catalog-based deduplication should be your first choice. If breadth of product support, including remote systems, is more important, then lookup deduplication may be your only alternative. 6

Inline or Postprocessing Another vendor design distinction is when to perform deduplication. Again, there are two options: Inline deduplication: Deduplication is performed as the data is written to the storage system. With inline deduplication, the entire hash catalog is usually placed into system memory to facilitate fast object comparisons. The advantage of inline deduplication is that it does not require the duplicate data to actually be written to disk. The duplicate object is hashed, compared, and re-referenced on the fly. The disadvantage of this method is that significant system resources are required to handle the entire deduplication operation in milliseconds. Duplicate object validation beyond a quick hash compare is generally not feasible when done inline; therefore these systems usually rely on trusted hash compares without validating that the objects are indeed identical. Postprocessing deduplication: Deduplication is performed after the data is written to the storage system. With postprocessing, deduplication can be performed at a more leisurely pace, and it typically does not require heavy utilization of system resources. The disadvantage of postprocessing is that all duplicate data must first be written to the storage system, requiring additional, although temporary, physical space on the system. Beyond the Hype: What You Should Know about Inline and Postprocessing The decision regarding inline versus postprocessing deduplication has more to do with your application than with any technical advantages or disadvantages. When performing data backups, the user s objective is completion of backups within an allowed time window. When adding deduplication to the backups, the user s objective is to free up the redundant storage space required for these backups. These two objectives should not compete the additional time required for deduplication should not drive backups beyond their allotted time window. An assessment should be made to determine if the time penalty of deduplication is offset by the space savings realized after deduplication, regardless of whether the deduplication is performed inline or postprocessing. Other applications, such as primary storage and data archiving, do not lend themselves well to inline deduplication. In the case of primary storage, systems rarely have the performance headroom to facilitate inline deduplication. In the case of archival data, users may simply want to sweep their file systems of duplicate data during periods of low activity, similar to other occasional storage housekeeping chores. In these environments, postprocessing deduplication is the preferred method. The bottom line is that you should evaluate your application environment and the cost of deduplication. If your priority is high-speed data backups with optimal space conservation, choose inline deduplication. If you are interested in deduplicating primary storage or archival data, postprocessing deduplication is probably your best bet. Note: Some vendors have recently designed their products with the ability to perform either inline or postprocessing deduplication. If you are evaluating a product that makes this claim, make sure that the deduplication mode is user selectable, so that you can set the mode based on your application s characteristics and the storage system workload. 7

Source or Destination Deduplication The final design criterion in choosing a deduplication vendor is where the deduplication occurs. Once again you have two choices: at the source or at the destination. Source deduplication refers to the comparison of data objects at the source, before they are sent to a destination (usually a data backup destination). The advantage of source deduplication is that less data is required to be transmitted and stored at the destination point. The disadvantage is that the deduplication catalog and indexing components are dispersed over the network so that deduplication potentially becomes more difficult to administer. Destination deduplication refers to the comparison of data objects after they arrive at the destination point. The advantage of destination deduplication is that all the deduplication management components are centralized. The disadvantage is that the entire data object must be transmitted over the network before deduplicating. Beyond the Hype: What You Should Know about Source and Destination The decision to select source or destination deduplication is determined by your objectives. If your main objective is to reduce the amount of network traffic when copying files, source deduplication is the only option. If your goal is to simplify the management of deduplication and the amount of storage required at the destination, destination deduplication is the preferred choice. Deduplication Space Savings Now that you have carefully evaluated all aspects of deduplication design and made your selection what should you expect for space savings? Deduplication vendors often claim that their products offer 20:1, 50:1, or even greater data reduction ratios. These claims actually refer to the time-based space savings effect of deduplication on repetitive data backups. Figure 6 shows one vendor s theoretical space savings calculation over time. The space savings shown in this figure depend on repeating backups of highly redundant data. Because the backups contain mostly unchanged data, once the first full backup has been stored, all subsequent full backups see a very high occurrence of deduplication. But what if you don t retain 64 backup copies? What if your backups have a higher change rate than the example in Figure 6? Realizing that space savings numbers from a vendor s marketing department often don t represent a real-life environment, what should you expect for space savings on your backup data sets? 8

Deduplication Ratio & Savings 1 2 3 4 Number of Backups 5 8 16 20 24 32 36 40 48 64 Note: Assumes 2% change rate on each backup instance 67% 75% 80% 90% 95% 96.7% 97.5% 98.0% 99.0% 99.5% 99.8% 33% 43% 50% 1.5:1 1.75:1 2:1 3:1 4:1 5:1 10:1 20:1 30:1 40:1 50:1 100:1 200:1 500:1 Deduplication Ratio (n:1) Figure 6) Deduplication ratios increase as backups accumulate over time. Another area to consider is your nonbackup data volumes, such as primary storage and archival data, where the rules of time-based data reduction ratios do not apply. In those environments, volumes do not receive a steady supply of redundant data backups, but they may still contain a large number of duplicate data objects. The ability to reduce space in these volumes through deduplication is measured in spatial terms. In other words, if a 500GB data archival volume can be reduced to 400GB through deduplication, the spatial (volume) reduction is 100GB, or 20%. Deduplication Ratio & Savings Number of Backups 8 16 20 96.7% 97.5% 98.0% 99.0% 99.5% 75% 80% 90% 95.0% Deduplication Efficiency Range (Backup Data) 67% 20% Deduplication 33% Efficiency Range (Nonbackup Data) 43% 50% 1.25 1.5:1 1.75:1 2:1 3:1 4:1 5:1 10:1 20:1 30:1 40:1 50:1 100:1 200:1 Deduplication Ratio (n:1) Figure 7) Users should look for opportunities to reduce both backup and nonbackup data. 9

Beyond the Hype: What You Should Know about Space Savings When evaluating deduplication space savings, you should have two goals. First, examine your backup data. As shown in Figure 7, it is reasonable to expect 5:1 to 20:1 space savings (over time) with a backup change rate of 2%. If you store more than 20 backup copies on disk, or if your change rate is less than 2%, your deduplication ratio will increase. Conversely, if you retain a lower number of backup copies, or if your data change rate is higher than 2%, your deduplication space savings ratio will be reduced. Next, examine your nonbackup data. Are there any opportunities to eliminate duplicate data on those volumes? Generally speaking, if you can reduce this data by 1.25 to 1.75:1, deduplication would be economically feasible. Think of it as receiving a storage rebate by reducing the storage capacity of these volumes by 20% to 40%. Enterprise storage systems today can easily exceed $10,000 per TB. Saving just a few TBs across your enterprise could justify the implementation of volume-based deduplication. Summary Data deduplication is an important new technology that is quickly being embraced by users as they struggle to control data proliferation. By eliminating redundant data objects, an immediate benefit is obtained through space efficiencies. When choosing a deduplication product, however, it is important to consider all aspects of design, including hashing, indexing, inline or postprocessing, source or destination, and of course space savings efficiency. Many vendors currently offer deduplication, with more sure to follow, all with various approaches and techniques. It is clear that data deduplication will someday be a requirement for every storage vendor, much as snapshots became a requirement years ago. Well-designed deduplication must perform without compromising data integrity and reliability. Deduplication magnifies the effect of data corruption. If a deduplicated data object becomes corrupt, it has far-reaching implications, because it is referenced by many other files and applications. Vendors will be required to provide 100% assurance that their design will prevent any such data inconsistencies Deduplication must operate seamlessly in existing user environments. Users will not build a storage infrastructure around deduplication; rather, deduplication must fit into their existing environment with minimal disruption. Ultimately, deduplication must be a transparent background process. Finally, deduplication will be required to have minimal impact on system performance. Users will not implement deduplication if it has a negative impact on their system workloads. This is particularly true as deduplication makes its way from backup applications to more performancesensitive primary storage environments. NetApp, a leader in data storage efficiency since 1992, has established A-SIS deduplication as the first deduplication product to be used broadly across many applications, including data backup, data archival and primary data. A-SIS deduplication combines the benefits of granularity, performance, and resiliency to provide users with significant data deduplication benefits. 10

2007 Network Appliance, Inc. All rights reserved. Specifications subject to change without notice. NetApp and the Network Appliance logo are registered trademarks and Network Appliance is a trademark of Network Appliance, Inc. in the U.S. and other countries. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such. 11