Data Deduplication and Tivoli Storage Manager



Similar documents
Data Deduplication and Tivoli Storage Manager

Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski Spała

Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication

Deduplication Demystified: How to determine the right approach for your business

Effective Planning and Use of TSM V6 Deduplication

Understanding Disk Storage in Tivoli Storage Manager

WHITE PAPER. DATA DEDUPLICATION BACKGROUND: A Technical White Paper

Demystifying Deduplication for Backup with the Dell DR4000

Creating a Cloud Backup Service. Deon George

Data Deduplication Background: A Technical White Paper

Data Deduplication: An Essential Component of your Data Protection Strategy

09'Linux Plumbers Conference

Deduplication and Beyond: Optimizing Performance for Backup and Recovery

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle

DeltaStor Data Deduplication: A Technical Review

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Data Deduplication HTBackup

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

Enterprise Backup and Restore technology and solutions

A Survey on Deduplication Strategies and Storage Systems

EMC Disk Library with EMC Data Domain Deployment Scenario

A Deduplication-based Data Archiving System

Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

Tivoli Storage Manager Explained

WHITE PAPER Data Deduplication for Backup: Accelerating Efficiency and Driving Down IT Costs

UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s.

Availability Digest. Data Deduplication February 2011

Accelerating Backup/Restore with the Virtual Tape Library Configuration That Fits Your Environment

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

ADVANCED DEDUPLICATION CONCEPTS. Larry Freeman, NetApp Inc Tom Pearce, Four-Colour IT Solutions

DXi Accent Technical Background

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

Protect Microsoft Exchange databases, achieve long-term data retention

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

IBM Tivoli Storage Manager 6

3Gen Data Deduplication Technical

Checklist and Tips to Choosing the Right Backup Strategy

Business-Centric Storage FUJITSU Storage ETERNUS CS800 Data Protection Appliance

Don t be duped by dedupe - Modern Data Deduplication with Arcserve UDP

The Business Value of Data Deduplication DDSR SIG

Understanding EMC Avamar with EMC Data Protection Advisor

White. Paper. Addressing NAS Backup and Recovery Challenges. February 2012

EMC Backup Storage Solutions: The Value of EMC Disk Library with TSM

Unitrends Integrated Backup and Recovery of Microsoft SQL Server Environments

Backups in the Cloud Ron McCracken IBM Business Environment

WHY DO I NEED FALCONSTOR OPTIMIZED BACKUP & DEDUPLICATION?

Restoration Technologies. Mike Fishman / EMC Corp.

Data Deduplication for Corporate Endpoints

HP Store Once. Backup to Disk Lösungen. Architektur, Neuigkeiten. rené Loser, Senior Technology Consultant HP Storage Switzerland

How to Get Started With Data

Hardware Configuration Guide

Barracuda Backup Deduplication. White Paper

Deploying De-Duplication on Ext4 File System

Business Benefits of Data Footprint Reduction

IBM Spectrum Protect in the Cloud

How To Make A Backup System More Efficient

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

TSM (Tivoli Storage Manager) Backup and Recovery. Richard Whybrow Hertz Australia System Network Administrator

Considerations when Choosing a Backup System for AFS

EMC AVAMAR. a reason for Cloud. Deduplication backup software Replication for Disaster Recovery

IBM Tivoli Storage Manager

Protect Data... in the Cloud

Turnkey Deduplication Solution for the Enterprise

Introduction to Data Protection: Backup to Tape, Disk and Beyond. Michael Fishman, EMC Corporation

Efficient Backup with Data Deduplication Which Strategy is Right for You?

Trends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation (Author and Presenter)

Comparison of Cloud vs. Tape Backup Performance and Costs with Oracle Database

Reduce your data storage footprint and tame the information explosion

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

Managed Services - A Paradigm for Cloud- Based Business Continuity

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Rapid Data Backup and Restore Using NFS on IBM ProtecTIER TS7620 Deduplication Appliance Express IBM Redbooks Solution Guide

CA ARCserve r Data Deduplication Frequently Asked Questions

A Deduplication File System & Course Review

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

EMC VNXe File Deduplication and Compression

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

EMC PERSPECTIVE. An EMC Perspective on Data De-Duplication for Backup

NETAPP WHITE PAPER Looking Beyond the Hype: Evaluating Data Deduplication Solutions

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

Understanding EMC Avamar with EMC Data Protection Advisor

RUBRIK CONVERGED DATA MANAGEMENT. Technology Overview & How It Works

Transcription:

Data Deduplication and Tivoli Storage Manager Dave annon Tivoli Storage Manager rchitect March 2009

Topics Tivoli Storage, IM Software Group Deduplication technology Data reduction and deduplication in TSM 2 Data Deduplication and Tivoli Storage Manager

Data Reduction Methods ompression Single instance store (SIS) Data deduplication Encoding of data to reduce size Typically localized, such as to a single file, directory tree or storage volume form of compression, usually applied to a large collection of files in a shared data store Only one instance of a file is retained in the data store Duplicate instances of the file reference the stored instance lso known as redundant file elimination form of compression, usually applied to a large collection of files in a shared data store In contrast to SIS, deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents) Only one instance is stored for each common chunk Duplicate instances of the chunk reference the stored instance This terminology is not used consistently throughout the industry. In particular, the terms SIS and deduplication are sometimes used interchangeably. 3 Data Deduplication and Tivoli Storage Manager

Deduplication oncept G E J H L D H D I J Unique subfiles I K F H G E Duplicate subfiles E J L K F Data source 1 K Data source 2 E Data source 3 Data at source locations (backup client machines) Target data store (backup server) 4 Data Deduplication and Tivoli Storage Manager

How Deduplication Works Data Store Data Store Data Store b a Data Store a a c b 1. Data chunks are evaluated to determine a unique signature for each 2. Signature values are compared to identify all duplicates 3. Duplicate data chunks are replaced with pointers to a single stored chunk, saving storage space This This section section provides a generalized description of of deduplication technology. Individual deduplication products and and solutions may may vary. vary. 5 Data Deduplication and Tivoli Storage Manager

Data Deduplication Value Proposition Potential advantages Reduced storage capacity required for a given amount of data bility to store significantly more data on given amount of disk Restore from disk rather than tape may improve ability to meet recovery time objective (RTO) Network bandwidth savings (some implementations) Lower storage-management and energy costs resulting from reduced storage requirements Potential tradeoffs/limitations Significant PU and I/O resources required for deduplication processing Deduplication might not be compatible with encryption Increased sensitivity to media failure because many files could be affected by loss of common chunk Deduplication may not be suitable for data on tape because increased fragmentation of data could greatly increase access time 6 Data Deduplication and Tivoli Storage Manager

Deduplication Design onsiderations Source-side vs. target-side In-band vs. out-of-band Method used for data chunking How redundant chunks are identified voiding false matches How redundant chunks are eliminated and tracked 7 Data Deduplication and Tivoli Storage Manager

Where Deduplication is Performed pproach dvantages Disadvantages Source-side (client-side) Deduplication performed at the data source (e.g., by a backup client), before transfer to target location Target-side (server-side) Deduplication performed at the target (e.g., by backup software or storage appliance) Deduplication before transmission conserves network bandwidth wareness of data usage and format may allow more effective data reduction Processing at the source may facilitate scale-out No deployment of client software at endpoints Possible use of direct comparison to confirm duplicates Deduplication consumes PU cycles on the file/ application server Requires software deployment at source (and possibly target) endpoints Depending on design, may be subject to security attack via spoofing Deduplication consumes PU cycles on the target server or storage device Data may be discarded after being transmitted to the target 8 Data Deduplication and Tivoli Storage Manager

When Deduplication is Performed pproach dvantages Disadvantages In-band Deduplication performed during data transfer from source to target Immediate data reduction, minimizing disk storage requirement No post-processing May be bottleneck for data ingestion (e.g., longer backup times) Only one deduplication process for each I/O stream May not support deduplication of legacy data on the target server Out-of-band Deduplication performed after data ingestion at the target No impact to data ingestion Potential for deduplication of legacy data Possibility for parallel data deduplication processing Data must be processed twice (during ingestion and subsequent deduplication) Storage needed to retain data until deduplication occurs 9 Data Deduplication and Tivoli Storage Manager

Generalized Deduplication Processing 1. hunk the object Divide object into logical segments called chunks 2. Identify duplicate chunks Hash each chunk to produce unique identifier ompare each chunk identifier with index to determine whether chunk is already stored 3. Eliminate redundant chunks Update index to reference matching chunks Deallocate space for redundant chunks 5 91 17 5 91 17 25 Object hunks hunk identifiers Index 25 hunk identifiers Index 10 Data Deduplication and Tivoli Storage Manager

Data hunking Methods Lowest overhead (PU, I/O, indexing) Whole file chunking Each file is treated as a single chunk No detection of duplicate data at subfile level Fixed-size chunking hunk boundaries occur at fixed intervals, irrespective of data content Method is unable to detect duplicate data if there is an offset difference ecause redundant data has shifted due to insertion/deletion ecause redundant data is embedded within another file or contained in a composite structure Variable-size chunking Rolling hash algorithm is used to determine chunk boundaries to achieve an expected average chunk size an detect redundant data, irrespective of offset differences Often referred to as fingerprinting (e.g., Rabin fingerprinting) Greatest data reduction Format-aware chunking In setting chunk boundaries, algorithm considers data format/structure Examples: awareness of backup stream formatting; awareness of PowerPoint slide boundaries; awareness of file boundaries within a composite 11 Data Deduplication and Tivoli Storage Manager

Identification of Redundant hunks Unique identifier is determined for each chunk Identifiers are typically calculated using a hash function that outputs a digest based on the data in each chunk MD5 (message-digest algorithm) SH (secure hash algorithm) For each chunk, the identifier is compared against an index of identifiers to determine whether that chunk is already in the data store Selection of hash function involves tradeoffs between Processing time to compute hash values Index space required to store hash values Risk of false matches 12 Data Deduplication and Tivoli Storage Manager

False Matches Possibility exists that two different data chunks could hash to the same identifier (such an event is called a collision) Should a collision occur, the chunks could be falsely matched and data loss could result ollision probability can be calculated from the possible number of unique identifiers and the number of chunks in the data store Longer digest More unique identifiers Lower probability of collisions More chunks Higher probability of collisions pproaches to avoiding data loss due to collisions Use a hash function that produces a long digest to increase the possible number of unique identifiers ombine values from multiple hash functions ombine hash value with other information about the chunk Perform byte-wise comparison of chunks in the data store to confirm matches 13 Data Deduplication and Tivoli Storage Manager

Hash Functions Hash functions take a message of arbitrary length as input and output a fixed length digest of L bits. They are published algorithms, normally standardized as RF. Name Output size L (bits) Performance (cycles/byte) Intel Xeon: / assembly* ollision chance 50% (or greater) when these many chunks (or more) are generated ** hance of one collision in a 40 P archive*** (using 4K / chunk) Year of the standard MD5 128 9.4 / 3.7 2 64 10 20 1.5*10-13 1992 SH-1 160 25 / 8.3 2 80 10 24 3.4*10-23 1995 SH-256 256 39 / 20.6 2 128 10 40 4.3*10-52 2002 SH-512 512 135 / 40.2 2 256 10 80 3.7*10-129 2002 Whirlpool 512 112 / 36.5 2 256 10 80 3.7*10-129 2003 * Performance analysis and parallel implementation of dedicated hash functions, Proc. of EURORYPT 2002, pp 165-180, 2002. ** The probability of one collision out of k chunks is p 1-e -(k^2)/2*n, where N=2 L ; when p=0.5, we get k N 1/2 = 2 L/2 (from birthday paradox). Probability of collision is extremely low and can be *** The probability of one hard-drive bit-error is about 10-14. reduced at the expense of performance by using hash function that produces longer digest 14 Data Deduplication and Tivoli Storage Manager

Elimination of Redundant hunks For each redundant chunk, the index is updated to reference the matching chunk Index is updated with metadata indicating how to reconstruct the object from chunks, some of which may be shared with other objects ny space occupied by the redundant chunks can be deallocated and reused Deduplication index is critical Integrity Performance Scalability Protection 15 Data Deduplication and Tivoli Storage Manager

Deduplication Ratios Used to indicate compression achieved by deduplication If deduplication reduces 500 T of data to 100 T, ratio is 5:1 Ratios reflect design tradeoffs involving performance and compression ctual compression ratios will be highly dependent on other variables Data from each source: redundancy, change rate, retention Number of data sources and redundancy of data among those sources ackup methodology: incremental forever, full+incremental, full+differential Whether data encryption occurs prior to deduplication In addition to above variables, some vendors include data reduction achieved by incremental backup and conventional compression Deduplication vendors claim ratios in the range 2:1 to 500:1 Meaningful comparison of ratios is extremely difficult beware of hype! 16 Data Deduplication and Tivoli Storage Manager

Deduplication and Encryption Data source 1 Important text No encryption Important text Data Data encryption encryption prior prior to to deduplication deduplicationprocessing processing can can subvert subvert data data reduction reduction Data source 2 Important text Encryption key 1 txpt tnatroemi Data deduplication Data store Data source 3 Important text Encryption key 2 te tarpixtntom Important text txpt tnatroemi te tarpixtntom 1. Three data sources have the same text file 2. fter encryption, text files do not match 3. Deduplication processing does not detect redundancy 4. Text files are stored without data reduction 17 Data Deduplication and Tivoli Storage Manager

Topics Tivoli Storage, IM Software Group Deduplication technology Data reduction and deduplication in TSM 18 Data Deduplication and Tivoli Storage Manager

TSM Data Reduction lient compression Files compressed by client before transmission onserves network bandwidth and server storage Server-side deduplication dded in TSM 6 onserves server storage Improves recovery time as compared to storage on tape Device compression ompression performed by storage hardware onserves server storage lient Subfile backup Only changed portions of files are transmitted onserves network bandwidth and server storage Incremental forever fter initial backup, file is not backed up again unless it changes onserves network bandwidth and server storage Server Incremental forever greatly reduces data redundancy compared to traditional methodologies. TSM has less potential for data reduction via deduplication as compared to other backup products. Storage Hierarchy ppliance deduplication Deduplication performed by storage appliance (VTL or NS) onserves server storage 19 Data Deduplication and Tivoli Storage Manager

TSM 6 Deduplication Overview lient 1 Deduplication Node File Deduplicated disk storage pool stores unique chunks to reduce disk utilization lient 2 Server lient 1 lient 2 lient 3 Files, and have common data lient 3 TSM Database Tape copy pool stores,, and individually to avoid performance degradation during restore 20 Data Deduplication and Tivoli Storage Manager

Deduplication Highlights Disk storage requirement reduced via optional data deduplication for FILE storage pools Deduplication processing performed on TSM server and tracked in database Reduced redundancy for Identical objects from same or different client nodes (even if file names are different) ommon data chunks (subfiles, extents) in objects from same or different nodes Post-ingestion (out-of-band) detection of duplicate data on TSM server to minimize impact to backup windows Space occupied by duplicate data will be removed during reclamation processing llowed for all data types: backup, archive, HSM, TDP, PI applications Transparent client access to deduplicated objects 21 Data Deduplication and Tivoli Storage Manager

Deduplication Highlights Deployment of new clients or PI applications not required Legacy data stored in or moved to enabled FILE storage pools can be deduplicated Data migrated or copied to tape will be reduplicated to avoid excessive mounting and positioning during subsequent access bility to control number, duration and scheduling of PU-intensive background processes for identification of duplicate data Reporting of space savings in deduplicated storage pools Deduplication processing will skip client-encrypted objects, but should work with storage-device encryption Native TSM implementation, with no dependency on specific hardware 22 Data Deduplication and Tivoli Storage Manager

Deduplication Example 1. lient1 backs up files,, and D. Files and have different names, but the same data. 2. lient2 backs up files E, F and G. File E has data in common with files and G. D E F G lient1 Server lient2 Server Vol1 D Vol1 D Vol2 E F G 3. Server process divides data into chunks and identifies duplicate chunks 1, E2 and G1. 4. Reclamation processing recovers space occupied by duplicate chunks. Vol1 Vol2 Server 0 1 0 1 0 2 D0 1 D1 1 E1 E1 E2 E3 F1 G1 Server Vol3 0 1 0 1 0 2 D0 D1 1 E1 E1 E3 F1 23 Data Deduplication and Tivoli Storage Manager

Design Points for Deduplication in TSM 6 Design point Server-side Out-of-band Index maintained in TSM server database (D2) Variable-size chunking voidance of false matches verage chunk size 256K Space occupied by redundant chunks recovered during reclamation omments voids need for deployment of client software Effective for all types of stored data llows deduplication of legacy data in addition to new data Minimizes impact to backup windows oncurrent processes to identify duplicate data Transactional integrity, scalability, performance, disaster protection Rabin fingerprinting with awareness of TSM data format SH-1 digest for each chunk omparison of chunk size MD5 digest for entire data object (checked during restore) Larger chunks require less database overhead Larger chunks reduce the total number of chunks required for given amount of data and therefore reduce collision probability Smaller chunks could improve deduplication ratio llows coordinated recovery of space occupied by deleted objects and redundant chunks 24 Data Deduplication and Tivoli Storage Manager

omparison of TSM Data Reduction Methods lient compression Incremental forever Subfile backup Deduplication in TSM 6 How data reduction is achieved lient compresses files lient only sends changed files lient only sends changed subfiles Server eliminates redundant data chunks onserves storage pool space? Yes Yes Yes Yes onserves network bandwidth? Yes Yes Yes No Data supported ackup, archive, HSM, PI ackup ackup (Windows only) ackup, archive, HSM, PI Scope of data reduction Redundant data within same file on client node Files that do not change between backups Subfiles that do not change between backups Redundant data from any files in storage pool voids storing identical files renamed, copied, or relocated on client node? No No No Yes Removes redundant data for files from different client nodes? No No No Yes 25 Data Deduplication and Tivoli Storage Manager

onsiderations for Use of TSM Deduplication onsider deduplication if Data recovery would improve by storing more data objects on limited amount of disk Data will remain on disk for extended period of time Much redundancy in data stored by TSM (e.g., for common operating-system or project files) TSM server PU and disk I/O resources are available for intensive processing to identify duplicate chunks Deduplication might not be indicated for Mission-critical data, whose recovery could be delayed by accessing chunks that are not stored contiguously TSM servers that do not have sufficient resources Data that will soon be migrated to tape 26 Data Deduplication and Tivoli Storage Manager

Summary Tivoli Storage, IM Software Group Data deduplication can reduce storage requirements, allowing more data to be retained on disk for fast access Deduplication involves tradeoffs relating to degree of compression, performance, risk of data loss and compatibility with encryption TSM s incremental forever method avoids periodic full backups, reducing the potential for additional data reduction via deduplication Server-side deduplication is a key enhancement in TSM 27 Data Deduplication and Tivoli Storage Manager