Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski 19-05-2011 Spała



Similar documents
Data Deduplication and Tivoli Storage Manager

Data Deduplication and Tivoli Storage Manager

Effective Planning and Use of TSM V6 Deduplication

Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication

Understanding Disk Storage in Tivoli Storage Manager

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

Deduplication Demystified: How to determine the right approach for your business

Creating a Cloud Backup Service. Deon George

EMC Backup Storage Solutions: The Value of EMC Disk Library with TSM

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Efficient Backup with Data Deduplication Which Strategy is Right for You?

EMC Disk Library with EMC Data Domain Deployment Scenario

Protect Microsoft Exchange databases, achieve long-term data retention

DeltaStor Data Deduplication: A Technical Review

Protect Data... in the Cloud

Don t be duped by dedupe - Modern Data Deduplication with Arcserve UDP

09'Linux Plumbers Conference

Demystifying Deduplication for Backup with the Dell DR4000

Restoration Technologies. Mike Fishman / EMC Corp.

Deduplication and Beyond: Optimizing Performance for Backup and Recovery

WHY DO I NEED FALCONSTOR OPTIMIZED BACKUP & DEDUPLICATION?

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

How To Make A Backup System More Efficient

IBM Tivoli Storage Manager 6

Hardware Configuration Guide

WHITE PAPER. DATA DEDUPLICATION BACKGROUND: A Technical White Paper

Reduce your data storage footprint and tame the information explosion

Real-time Compression: Achieving storage efficiency throughout the data lifecycle

Backup and Recovery Redesign with Deduplication

Identifying the Hidden Risk of Data Deduplication: How the HYDRAstor TM Solution Proactively Solves the Problem

IBM Tivoli Storage Manager

IBM TSM Backup with EMC Data Domain Deduplication Storage

DXi Accent Technical Background

The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle

Data Deduplication: An Essential Component of your Data Protection Strategy

TSM (Tivoli Storage Manager) Backup and Recovery. Richard Whybrow Hertz Australia System Network Administrator

The Business Value of Data Deduplication DDSR SIG

Riverbed Whitewater/Amazon Glacier ROI for Backup and Archiving

Alternate Methods of TSM Disaster Recovery: Exploiting Export/Import Functionality

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

Trends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation (Author and Presenter)

Data Deduplication Background: A Technical White Paper

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

WHITE PAPER: customize. Best Practice for NDMP Backup Veritas NetBackup. Paul Cummings. January Confidence in a connected world.

IBM Tivoli Storage Manager

Get Success in Passing Your Certification Exam at first attempt!

TSM Family Capacity Pricing Special Bid Offering IBM Corporation

Protecting Information in a Smarter Data Center with the Performance of Flash

Disk-To-Disk Backup: Making a Bigger D. Presented by Kelly J. Lipp CTO STORServer, Inc.

IBM Tivoli Storage Manager for Enterprise Resource Planning Version Data Protection for SAP HANA Installation and User's Guide

Barracuda Backup Deduplication. White Paper

Business Benefits of Data Footprint Reduction

Tiered Data Protection Strategy Data Deduplication. Thomas Störr Sales Director Central Europe November 8, 2007

IBM Data Deduplication Strategy and Operations

The Archival Upheaval Petabyte Pandemonium Developing Your Game Plan Fred Moore President

Protecting enterprise servers with StoreOnce and CommVault Simpana

IBM Tivoli Storage Manager and Front-safe TSM Portal

WHITE PAPER Data Deduplication for Backup: Accelerating Efficiency and Driving Down IT Costs

Using HP StoreOnce Backup systems for Oracle database backups

EMC AVAMAR. a reason for Cloud. Deduplication backup software Replication for Disaster Recovery

UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s.

EMC DATA DOMAIN OVERVIEW. Copyright 2011 EMC Corporation. All rights reserved.

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Maximize Your Virtual Environment Investment with EMC Avamar. Rob Emsley Senior Director, Product Marketing

3Gen Data Deduplication Technical

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

Future-Proofed Backup For A Virtualized World!

EMC Data Domain Boost for Oracle Recovery Manager (RMAN)

Tivoli Storage Manager Explained

Introduction to Data Protection: Backup to Tape, Disk and Beyond. Michael Fishman, EMC Corporation

Backup and Recovery: The Benefits of Multiple Deduplication Policies

Redefining Backup for VMware Environment. Copyright 2009 EMC Corporation. All rights reserved.

Symantec NetBackup 5220

Enterprise Backup and Restore technology and solutions

IBM Spectrum Protect in the Cloud

DEDUPLICATION BASICS

Business-centric Storage FUJITSU Storage ETERNUS CS800 Data Protection Appliance

IBM Tivoli Storage Manager Suite for Unified Recovery

Backup Exec Private Cloud Services. Planning and Deployment Guide

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

Quantum DXi6500 Family of Network-Attached Disk Backup Appliances with Deduplication

Rapid Data Backup and Restore Using NFS on IBM ProtecTIER TS7620 Deduplication Appliance Express IBM Redbooks Solution Guide

Using HP StoreOnce Backup Systems for NDMP backups with Symantec NetBackup

Step by Step Guide To vstorage Backup Server (Proxy) Sizing

ITCertMaster. Safe, simple and fast. 100% Pass guarantee! IT Certification Guaranteed, The Easy Way!

Technical White Paper for the Oceanspace VTL6000

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Mayur Dewaikar Sr. Product Manager Information Management Group Symantec Corporation

es T tpassport Q&A * K I J G T 3 W C N K V [ $ G V V G T 5 G T X K E G =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX *VVR YYY VGUVRCUURQTV EQO

Eliminating Backup System Bottlenecks: Taking Your Existing Backup System to the Next Level. Jacob Farmer, CTO, Cambridge Computer

Data Deduplication HTBackup

Take Advantage of Data De-duplication for VMware Backup

Transcription:

Data Deduplication in Tivoli Storage Manager Andrzej Bugowski 19-05-2011 Spała

Agenda Tivoli Storage, IBM Software Group Deduplication concepts Data deduplication in TSM 6.1 Planning for data deduplication in TSM V6 New/Changed commands and options for Data Deduplication in TSM V6 2 Data Deduplication in Tivoli Storage Manager 6.1

Deduplication Concepts Support Technical Exchange

Deduplication Concept A B C C G E J H L D H B D I B J Unique subfiles B I K F H G E Duplicate subfiles E J C C A L K F Data source 1 K Data source 2 E Data source 3 Data at source locations (backup client machines) Target data store (backup server) 4 Data Deduplication in Tivoli Storage Manager 6.1

How Deduplication Works Data Store Data Store Data Store C B A C B A C b a A A B A C B A A Data Store B A C B A a B a c b 1. Data chunks are evaluated to determine a unique signature for each 2. Signature values are compared to identify all duplicates 3. Duplicate data chunks are replaced with pointers to a single stored chunk, saving storage space 5 Data Deduplication in Tivoli Storage Manager 6.1

Data Deduplication Value Proposition Potential advantages Reduced storage capacity required for a given amount of data Ability to store significantly more data on given amount of disk Restore from disk rather than tape may improve ability to meet recovery time objective (RTO) Network bandwidth savings (some implementations) Lower storage-management and energy costs resulting from reduced storage requirements Potential tradeoffs/limitations Significant CPU and I/O resources required for deduplication processing Deduplication might not be compatible with encryption Increased sensitivity to media failure because many files could be affected by loss of common chunk Deduplication may not be suitable for data on tape because increased fragmentation of data could greatly increase access time 6 Data Deduplication in Tivoli Storage Manager 6.1

Where Deduplication is Performed Approach Advantages Disadvantages Source-side (client-side) Deduplication performed at the data source (e.g., by a backup client), before transfer to target location Target-side (server-side) Deduplication performed at the target (e.g. by backup software or storage appliance) Deduplication before transmission conserves network bandwidth Awareness of data usage and format may allow more effective data reduction Processing at the source may facilitate scale-out No deployment of client software at endpoints Possible use of direct comparison to confirm duplicates Deduplication consumes CPU cycles on the file/ application server Requires software deployment at source (and possibly target) endpoints Depending on design, may be subject to security attack via spoofing Deduplication consumes CPU cycles on the target server or storage device Data may be discarded after being transmitted to the target Note: Source-side and target-side deduplication are not mutually exclusive 7 Data Deduplication in Tivoli Storage Manager 6.1

When Deduplication is Performed Approach Advantages Disadvantages In-band Deduplication performed during data transfer from source to target Out-of-band Deduplication performed after data ingestion at the target Immediate data reduction, minimizing disk storage requirement No post-processing No impact to data ingestion Potential for deduplication of legacy data Possibility for parallel data deduplication processing May be bottleneck for data ingestion (e.g., longer backup times) Only one deduplication process for each I/O stream May not support deduplication of legacy data on the target server Data must be processed twice (during ingestion and subsequent deduplication) Storage needed to retain data until deduplication occurs Note: In-band and out-of-band deduplication are not mutually exclusive 8 Data Deduplication in Tivoli Storage Manager 6.1

Deduplication Ratios Used to indicate reduction achieved by deduplication If deduplication reduces 500 TB of data to 100 TB, ratio is 5:1 Ratios reflect design tradeoffs involving performance and reduction Actual compression ratios will be highly dependent on other variables Data from each source: redundancy, change rate, retention Number of data sources and redundancy of data among those sources Backup methodology: incremental forever, full+incremental, full+differential Whether data encryption occurs prior to deduplication In addition to above variables, some vendors include data reduction achieved by incremental backup and conventional compression Deduplication vendors claim ratios in the range 2:1 to 500:1 Meaningful comparison of ratios is extremely problematic beware of hype! 9 Data Deduplication in Tivoli Storage Manager 6.1

Deduplication and Encryption Data source 1 Important text No encryption Important text Data encryption prior to deduplication processing can subvert data reduction Data source 2 Important text Encryption key 1 txpt tnatroemi Data deduplication Data store Data source 3 Important text Encryption key 2 te tarpixtntom Important text txpt tnatroemi te tarpixtntom 1. Three data sources have the same text file 2. After encryption, text files do not match 3. Deduplication processing does not detect redundancy 4. Text files are stored without data reduction 10 Data Deduplication in Tivoli Storage Manager 6.1

Data Deduplication in TSM V6 Support Technical Exchange

Data Reduction with TSM Today Client compression Files compressed by client before transmission Conserves network bandwidth and server storage Device compression Compression performed by storage hardware Conserves server storage Client Subfile backup Only changed portions of files are transmitted Conserves network bandwidth and server storage Incremental forever After initial backup, file is not backed up again unless it changes Conserves network bandwidth and server storage Server Storage Hierarchy Appliance deduplication Deduplication performed by storage appliance (VTL or NAS) Conserves server storage 12 Data Deduplication in Tivoli Storage Manager 6.1

Native Data Deduplication in TSM TSM s incremental forever methodology greatly reduces data redundancy as compared to traditional methodologies based on periodic full backups Consequently, there is less potential for data reduction via deduplication in TSM as compared to other backup products Nevertheless, deduplication is an important function to TSM because it will allow more data objects to be stored on a given amount of disk for fast access Native deduplication is a key product enhancement in TSM 13 Data Deduplication in Tivoli Storage Manager 6.1

TSM 6.1 Deduplication Overview Client 1 A A Deduplication Node File Deduplicated disk storage pool stores unique chunks to reduce disk utilization Client 2 B B Server Client 1 Client 2 A B Client 3 C Files A, B and C have common data C Client 3 C TSM Database A B C Tape pool stores A, B, and C individually to avoid performance degradation during restore Allows more objects to be stored on disk for fast access 14 Data Deduplication in Tivoli Storage Manager 6.1

Deduplication Example 1. Client1 backs up files A, B, C and D. Files A and C have different names, but the same data. 2. Client2 backs up files E, F and G. File E has data in common with files B and G. A B C D E F G Client1 Server Client2 Server Vol1 A B C D Vol1 A B C D Vol2 E F G 3. Server process chunks the data and identifies duplicate chunks C1, E2 and G1. 4. Reclamation processing recovers space occupied by duplicate chunks. Server Server Vol1 A0 A1 B0 B1 C0 B2 D0 C1 D1 B1 E1 Vol3 A0 A1 B0 B1 C0 B2 D0 D1 B1 E1 E1 E3 F1 Vol2 E1 E2 E3 F1 G1 15 Data Deduplication in Tivoli Storage Manager 6.1

Comparison of TSM Data Reduction Methods Client compression Incremental forever Subfile backup Deduplication in TSM 6.1 How data reduction is achieved Client compresses files Client only sends changed files Client only sends changed subfiles Server eliminates redundant data chunks Conserves storage pool space? Conserves network bandwidth? Yes Yes Yes Yes Yes Yes Yes No Data supported Backup, archive, HSM, API Backup Backup (Windows only) Backup, archive, HSM, API Scope of data reduction Avoids storing identical files renamed, copied, or relocated on client node? Removes redundant data for files from different client nodes? Redundant data within same file on client node Files that do not change between backups Subfiles that do not change between backups Redundant data from any files in storage pool No No No Yes No No No Yes 16 Data Deduplication in Tivoli Storage Manager 6.1

Planning for Data Deduplication in TSM V6 Support Technical Exchange

When Do I Use ProtecTIER vs TSM 6 Built-in Deduplication? Both Solutions Offer the Benefits of Target side Deduplication: Greatly reduced storage capacity requirements Lower operational costs, energy usage and TCO Faster recoveries with more data on disk Use ProtecTIER When: Highest performance and capacity scaling are required! Up to 500 MB/sec (1GB/s with 2 node) deduplication rates are needed Deduplicated capacities up to 25 PB are required You desire deduplication be done inline during data ingest A VTL appliance model is desired Deduplicating across multiple TSM (or other backup) servers Use TSM 6 Built-in Deduplication When: Sufficient TSM server resources can be made available and you desire deduplication operations be completely integrated within TSM The benefits of deduplication are desired without separate hardware or software dependencies or licenses (ships with TSM Extended Edition) You desire end to end data lifecycle management with minimized data store Complementary Solutions Today! Can be used together but don t deduplicate the same data twice IBM ProtecTIER TSM 43 18 Data Deduplication in Tivoli Storage Manager 6.1

Considerations for Use of TSM Deduplication Consider deduplication if Data recovery would improve by storing more data objects on limited amount of disk Data will remain on disk for extended period of time Much redundancy in data stored by TSM (e.g., for common operating-system or project files) TSM server CPU and disk I/O resources are available for intensive processing to identify duplicate chunks Deduplication might not be suitable for Mission-critical data, whose recovery could be delayed by accessing chunks that are not stored contiguously TSM servers that do not have sufficient resources Data that will soon be migrated to tape 19 Data Deduplication in Tivoli Storage Manager 6.1

Planning for TSM Deduplication How do I want to control data duplication processes? You can have them running all the time and have them process data as transactions commit. This could be more CPU intensive. You can run them after backups have completed and then cancel them after the identification of duplicate data has finished. Consider whether you have the extra processing time in the day to run this as a separate step. (See discussion on controlling the identify duplicates process manually for suggestions). Do you want to set up new storage pools or use existing ones? You may want only certain nodes to perform data deduplication. These could be updated to new policy domains and management classes that point to new storagepools. 20 Data Deduplication in Tivoli Storage Manager 6.1

Planning for TSM Deduplication How can I estimate my space savings with data deduplication in my environment? (2 possible techniques) Best way to test this is with a test system you can delete when done. Another way: Consider backing up your data from a primary storage pool to a temporary copy storage pool that has data deduplication enabled to estimate the space savings. Downside to this technique is that it will increase DB size. 1. Create a copy stgpool using a devclass type of FILE. 2. Do a BACKUP STGPOOL primary stgpool to copy stgpool. 3. Run IDENTIFY DUPLICATES against the volumes in the copy storagepool. 4. When the IDENTIFY DUPLICATES process goes to idle state, set the reclamation threshold for the storagepool to 1%. 5. After reclamation finishes, issue the q stgpool command against the copy storagepool to check the amount of space that was saved. 6. If the results are satisfactory, then update the primary stgpool to specify deduplication is to be used. (Or if type DISK, move data to a new stgpool that is defined with devclass FILE.) 21 Data Deduplication in Tivoli Storage Manager 6.1

Planning for TSM Deduplication How much additional log space is required for data deduplication? There are many factors in calculating this. See slide topic How can I estimate my space savings with data deduplication in my environment?, This, is addition to looking for the number of logs archived during the identify process, could be used to estimate the log consumption during the data deduplication identification process. Is data deduplication suitable for all types of disk subsystems? Restores from a deduplicated storage pool are random in nature. For disk subsystems that are slower in speed and do not have adequate cache (for example SATA), restores will be impacted with deduplicated data. 22 Data Deduplication in Tivoli Storage Manager 6.1

Planning for TSM Deduplication What is the possible impact on restore performance if I implement TSM data deduplication? Smaller files (less than 100K) that have been deduplicated will restore slower than files that are not deduplicated. Having more sessions doing the restore will improve restore performance. How many Identify Processes should I run on my TSM server? The Identify process is both CPU and IO intensive. You should run no more than N identify processes for an N-Way CPU. Each identify process can use the entire CPU, so if you need CPU for other processes, use less. 23 Data Deduplication in Tivoli Storage Manager 6.1

Planning for TSM Deduplication What is the impact on database size when implementing data deduplication? The average chunk size is 256K, and for each chunk, there is approximately 500 bytes of metadata added to the TSM DB. Also, files less than 2K are not eligible for data deduplication. How do I change my daily housekeeping schedules for data deduplication? Where does the Identify Process fit in the daily cycle? If you are going to run the identify process as part of daily housekeeping, then run it after BACKUP STGPOOL has completed but before reclamations for devclass FILE storagepools have run. You need to know when the Identify Duplicates process has gone to IDLE state, as these processes are different than other TSM processes. Sample select for to use for automation: Select count(*) from processes where status like %State: idle% and process= Identify Process 24 Data Deduplication in Tivoli Storage Manager 6.1

Planning for TSM Deduplication How should I start with Data Deduplication? If you are going to use data deduplication on an existing storage pool, consider the following approach: Initially set identify processes to 0. On a daily basis, run identify duplicates for some duration until all volumes in storagepool processed. Then update storagepool identifyprocess parameter to appropriate value. If you are going to use data deduplication in a new storage pool, consider the following approach: Initially set identify processes to 1 or 2. On a daily basis, do a move nodedata for a few nodes into the new storage pool, and then point them to a new policy domain. Identify will process that set of node s data. When all nodes have been moved, delete old storagepool and change copygroups to reflect new storage hierarchy. Then update storagepool identifyprocess parameter to appropriate value. 25 Data Deduplication in Tivoli Storage Manager 6.1

Expected Deduplication Behavior Deployment of new clients or API applications not required (but use of TSM 6.1 client or higher may improve deduplication ratio). The TSM 6.1 client separates out the metadata which will slightly improve data deduplication for the client s data. Legacy data stored in or moved to enabled FILE storage pools can be deduplicated Data migrated or copied to tape will be reduplicated to avoid excessive mounting and positioning during subsequent access Ability to control number, duration and scheduling of CPU-intensive background processes for identification of duplicate data Reporting of space savings in deduplicated storage pools Deduplication processing will skip client-encrypted objects, but should work with storagedevice encryption Native TSM implementation, with no dependency on specific hardware 26 Data Deduplication in Tivoli Storage Manager 6.1

New/Changed commands and options for Data Deduplication in TSM V6 Support Technical Exchange

New Externals DEFINE / UPDATE STGPOOL stgpoolname DEDUPlicate=No Yes IDENTIFYPRocess=nn (default number of background processes) Number of identify process specified in the storage pool definitions run indefinitely, or until you issue the identify process command, update the storage pool definition again, or cancel the process. The Identify process is different from other server processes. When other server processes finish a task, they end. When duplicate identification processes finish, they quiesce and go into an idle state. 28 Data Deduplication in Tivoli Storage Manager 6.1

New Externals IDentify DUPlicates stgpoolname DUration=mm (minutes to run process) NUMPRocess=nn (override stgpool setting) When setting to a lower number of identify processes than are currently running, TSM will finish processing the current file before the identify duplicate process completes. When duration expires and the identify process completes, the number of identify processes that are run reverts back to the number in the storage pool definition. Result of command is different depending on whether the duration parameter is specified. See following slides for reference. These examples assume you have a storagepool initially defined with identify processes set to a value of 3. 29 Data Deduplication in Tivoli Storage Manager 6.1

Controlling Duplicate Identify Processes Manually If you had a critical restore executing, and you wanted to temporarily cut back on the identify processes for 2 hours, you could issue: Identify duplicates stgpool-name numpr=2 dur=120 30 Data Deduplication in Tivoli Storage Manager 6.1

Controlling Duplicate Identify Processes Manually If you had only 1 hour left before reclamation starts, and you were not done with identify duplicates processing, you could issue for example: Identify duplicates stgpool-name numpr=4 dur=60 31 Data Deduplication in Tivoli Storage Manager 6.1

Controlling Duplicate Identify Processes Manually Suggestion here for doing identify after backups, assuming an 8 hour backup window. Issue the following command at beginning of backups: identify duplicates stgpool-name numpr=0 duration=480 32 Data Deduplication in Tivoli Storage Manager 6.1

Controlling Duplicate Identify Processes Manually 33 Data Deduplication in Tivoli Storage Manager 6.1

New Server Option DEDUPREQUIRESBACKUP YES NO Indicates whether a volume in a PRIMARY DEDUP storage pool can be reclaimed before it is backed up to a non-deduplicated copy pool. Copying the data to an active data pool does not meet the backup requirement. If YES, a volume in a dedup primary pool cannot be reclaimed until it has been backed up via BACKUP STGPOOL YES is the default. If NO, the reclamation criteria remains unchanged (same as previous reclamation criteria). Setting this value to NO could cause unrecoverable data loss in the highly unlikely event an object generates a false-positive match on another extent. This option can be changed dynamically with the SETOPT command. 34 Data Deduplication in Tivoli Storage Manager 6.1

Changed Commands/Output Delete volume, delete filespace The count of objects deleted is the number of chunks deleted, not the number of actual files. This may not match up with the number of objects in a q occ command in a filespace with deduplicated data. Q opt Shows whether a file to be deduplicated first requires a backup of the file. (DedupRequiresBackup default Yes) 35 Data Deduplication in Tivoli Storage Manager 6.1

Changed Commands Q pr Identify duplicate data processes are different then other TSM server processes. After they finish processing all available files, they go into idle state. 36 Data Deduplication in Tivoli Storage Manager 6.1

Questions? 37 Data Deduplication in Tivoli Storage Manager 6.1