Effective Planning and Use of TSM V6 Deduplication

Transcription

1 Effective Planning and Use of IBM Tivoli Storage Manager V6 Deduplication 08/17/ Authors: Jason Basler Dan Wolfe Page 1 of 42

2 Document Location This is a snapshot of an on-line document. Paper copies are valid only on the day they are printed. The document is stored at the following location: Revision History Revision Revision Summary of Changes Number Date /17/12 Initial publication Disclaimer The information contained in this document is distributed on an "as is" basis without any warranty either expressed or implied. This document has been made available as part of IBM developerworks WIKI, and is hereby governed by the terms of use of the WIKI as defined at the following location: Page 2 of 42

3 Contents Document Location...2 Revision History...2 Disclaimer...2 Contents Introduction Overview Description of deduplication technology Data reduction and data deduplication Server-side and client-side deduplication Pre-requisites for configuring TSM deduplication Choosing between TSM deduplication and appliance deduplication Conditions for effective use of TSM deduplication Traditional TSM architectures compared with deduplication architectures Examples of appropriate use of TSM deduplication Data characteristics for effective deduplication When is it not appropriate to use TSM deduplication? Primary storage of backup data is on VTL or physical tape No flexibility with the backup processing window Restore performance considerations Resource requirements for TSM deduplication Database and log size requirements TSM database capacity estimation TSM database log size estimation Estimating capacity for deduplicated storage pools Estimating storage pool capacity requirements Hardware recommendations and requirements Hardware requirements for TSM client deduplication Implementation guidelines Deciding between client and server deduplication TSM Deduplication configuration recommendations Recommendations for deduplicated storage pools Recommended options for deduplication Best practices for ordering backup ingestion and data maintenance tasks...24 Page 3 of 42

4 4 Estimating deduplication savings Factors that influence the effectiveness of deduplication Characteristics of the data Impacts from backup strategy decisions Interaction of compression and deduplication How deduplication and compression interact with TSM Considerations related to compression when choosing between client-side and server-side deduplication Understanding the TSM deduplication tiering implementation Controls for deduplication tiering The impact of tiering to deduplication storage reduction Client controls that optimize deduplication efficiency What kinds of savings can I expect for different application types IBM DB Microsoft SQL Oracle VMware How to determine deduplication results Simple TSM Server Queries QUERY STGPOOL Other server queries affected by deduplication TSM client reports TSM deduplication report script...42 Page 4 of 42

5 1 Introduction Data deduplication is a technology that removes redundant data to reduce the storage capacity requirement for retaining the data. When deduplication technology is applied to data protection it can provide a highly effective means for reducing overall cost of a data protection solution. Tivoli Storage Manager introduced deduplication technology beginning with TSM V6.1. This document describes the benefits of deduplication and provides guidance on how to make effective use of the TSM deduplication feature as part of a welldesigned data protection solution. Following are key points regarding TSM deduplication: TSM deduplication is an effective tool for reducing overall cost of a backup solution Additional resources (DB capacity, CPU, and memory) must be configured for a TSM server that is configured with TSM deduplication. However, when properly configured, the benefit of storage pool capacity reduction will result in a significant cost reduction benefit. Cost reduction is the result of data reduction. Deduplication is just one of several methods that TSM provides for data reduction (such as progressive incremental backup). The goal is overall data reduction when all of the techniques are combined, rather than just on the deduplication ratio. TSM deduplication is an appropriate data reduction method for many situations, but some environments may benefit more by using other technologies such as appliance/hardware deduplication. This document is intended to provide guidance specific to the use of TSM deduplication. The document does not provide comprehensive instruction and guidance for the administration of TSM, and should be used in addition to the TSM product documentation. 1.1 Overview Description of deduplication technology Deduplication technology uses a computational technique to detect patterns within data that appear multiple times within the scope of a collection of data. For the purposes of this document, the collection of data consists of TSM backup, archive, and HSM data (all of these types of data will be referred to as backup data throughout this document). The patterns that are detected are represented as a hash value that is much smaller than the original pattern, specifically 20 bytes. Except for the original instance of the pattern, subsequent instances of the chunk are referenced by the hash value. As a result, for a pattern that appears many times throughout a given collection of data, significant reduction in storage can be achieved. Unlike compression, deduplication can take advantage of a pattern that occurs multiple times within a collection of data. With compression, a single instance of a pattern is represented by a smaller amount of data that is used to algorithmically recreate the original data pattern. Compression cannot take advantage of data redundancy for patterns that reoccur throughout the collection of data, and this significantly reduces the potential reduction capability. However, compression can be combined with deduplication to take advantage of both techniques and further reduce the required amount of data storage beyond what would be required by using just one technique or the other. Page 5 of 42

6 How does TSM perform deduplication TSM uses a proprietary algorithm to analyze variable sized, contiguous segments of data, called chunks, for patterns that are likely to be duplicated within the same TSM storage pool. This process is explained in more detail in a later section in this document. The implementation of TSM deduplication only applies to the FILE device class (sequential-access disk) storage pools, and can be used with primary, copy, or active-data pools Data reduction and data deduplication Data deduplication creates substantial opportunity for reduction of storage capacity requirements for backup data. However, it is important to consider deduplication within the context of other data reduction techniques that are available. When considering the effectiveness of deduplication, the deduplication ratio, or percentage of reduction is considered to be the ultimate measurement of effectiveness. However, it is more important to consider overall effectiveness of data reduction, including deduplication and other techniques that are available, rather than focus exclusively on deduplication effectiveness. Unlike other backup products, TSM provides a substantial advantage in data reduction through its incremental-forever technology. Combined with deduplication, compression, exclusion of specified objects, and appropriate retention policies, TSM provides highly effective data reduction. Therefore, the business objectives should be clearly defined and understood when considering how to measure data reduction effectiveness. If reduction of storage and infrastructure costs is the ultimate goal, the focus will be on overall data reduction effectiveness, with data deduplication effectiveness as one component. The following table provides a summary of the data reduction technologies that TSM offers: Client compression Incremental forever Subfile backup Deduplication How data reduction is achieved Client compresses files Client only sends changed files Client only sends Eliminates redundant changed regions of data chunks a file Conserves network bandwidth? Yes Yes Yes When client-side deduplication is used. Data supported Backup, archive, HSM, API Backup Backup (Windows only) Backup, archive, HSM, API (HSM supported only for server-side deduplication) Scope of data reduction Avoids storing identical files renamed, copied, or relocated on client node? Redundant data Files that do not within same file change between on client node backups Unchanged regions within previously backed up files No No No Yes Redundant data from any data in storage pool Page 6 of 42

7 Removes redundant data for No files from different No No Yes client nodes? Can be used with any type of storage pool configuration? Yes Yes Yes No Server-side and client-side deduplication TSM provides two options for performing deduplication: client-side and server-side deduplication. Both methods use the same algorithm to identify redundant data, however the when and where of the deduplication processing is different Server-side deduplication With server-side deduplication, all of the processing of redundant data occurs on the TSM server, after the data has been backed up. Server-side deduplication is also called target-side deduplication. The key characteristics of server-side deduplication are: Duplicate data is identified after backup data has been transferred to the storage pool volume. The duplicate identification processing must run regularly on the server, and will consume TSM server CPU and TSM database resources. Storage pool data reduction is not realized until data from the deduplication storage pool is moved to another storage pool volume, usually through a reclamation process, but can also occur during a TSM MOVE DATA process Client-side deduplication Client-side deduplication processes the redundant data during the backup process on the host system where the source data is located. The net results of deduplication are virtually the same as with server-side deduplication, except that the storage savings are realized immediately, since only the unique data needs to be sent to the server in its entirety. Data that is duplicate requires only a small signature to be sent to the TSM server. Client-side duplication is especially effective when it is important to conserve bandwidth between the TSM client and server Client deduplication cache Although it is necessary for the backup client to check in with the server to determine whether a chunk is unique or a duplicate, the amount of data transfer is small. The client must query the server for each chunk of data that is processed. The overhead associated with this query process can be reduced substantially by configuring a cache on the client, which allows previously discovered chunks on the client (during the backup session) to be identified without a query to the TSM server. For the backup-archive client (including VMware backup,) it is recommended to always configure a cache when using client-side deduplication. For applications that use the TSM API, the deduplication cache should not be used due to the potential for backup failures caused by the cache being out of sync with the TSM server. If multiple, concurrent TSM client sessions are configured (such as with a TSM for VMWare vstorage backup server), there must be a separate cache configured for each session. Page 7 of 42

8 1.1.4 Pre-requisites for configuring TSM deduplication This section provides general description of pre-requisites when using TSM deduplication. For a complete list of pre-requisites refer to the TSM administrator documentation Pre-requisites common to client and server-side deduplication The destination storage pool must be of type FILE (sequential disk) Pre-requisites specific to client-side deduplication When configuring client-side TSM deduplication, the following requirements must be met: The client and server must be at version or later. The latest maintenance version should always be used. The client must have the client-side deduplication option enabled (DEDUPLICATION YES). The server must enable the node for client-side deduplication with the DEDUP=CLIENTORSERVER parameter using either the REGISTER NODE or UPDATE NODE commands. The target storage pool must be a deduplication-enabled storage pool. Files must be bound to a correct management class whose destination is a deduplication-enabled storage pool. Files must not be excluded from client-side deduplication processing (by default all files are included). See the exclude.dedup client option for details. Files must be larger than 2 KB, and transactions must be below the value that is specified by the clientdeduptxnlimit option. The following TSM features are incompatible with TSM client-side deduplication: Client encryption LAN-free/storage agent UNIX HSM client Subfile backup Simultaneous storage pool write Choosing between TSM deduplication and appliance deduplication Deduplication of backup data can also be accomplished by using a deduplicating storage device in the TSM storage pool hierarchy. Virtual Tape Libraries (VTLs) such as IBM s ProtecTIER and EMC s Data Domain Page 8 of 42

9 provide deduplication capability at the storage device level. NAS devices are also available that provide NFS or CIFS mounted storage that removes redundant data through deduplication. A choice should be made between TSM deduplication and storage appliance deduplication. Although it is possible to use both deduplication techniques together, it would result in inefficient use of resources. For a deduplicating VTL, the TSM storage pool data would need to be rehydrated before moving to the VTL (as with any tape device), and there would be no data reduction as a result of the TSM deduplication. For a deduplicating NAS device, a FILE device type could be created on the NAS. However, since the data is already deduplicated by TSM there would be little to no additional data reduction possible by the NAS device Factors to consider when deciding between TSM and appliance deduplication There are three major factors to consider when deciding which deduplication technology to use: Scale Scope Cost Scale The software implementation of TSM deduplication makes heavy use of TSM database transactions and also has an impact on daily server processes such as reclamation and storage pool backup. For a specific TSM server hardware configuration (for example, TSM database disk speed, processor and memory capability, and storage pool device speeds), there is a practical limit to the amount of data that can be backed up using deduplication. Deduplication appliances have dedicated resources for deduplication processing and do not have a direct impact on TSM server performance and scalability. Therefore, if the scale of data to back up exceeds the recommended maximum of 300TB of source data, then appliance deduplication should be considered. Source data refers to the original non-deduplicated backup data and all retained versions. In addition to the scale of data stored, the scale of the daily amount of data backed up will also have a practical limit with TSM, currently 3-4TB of backup data per day (per TSM instance). Although more data can be backed up, post-processing such as reclamation (for server-side deduplication) and other operations such as storage pool backup to tape will be a limiting factor. Deduplicating appliances have far greater throughput capability due to the dedicated resources for deduplication processing, and are limited only by the throughput capacity Scope The scope of TSM deduplication is limited to a single TSM server instance and more precisely within a TSM storage pool. A single, shared deduplication appliance can provide deduplication across multiple TSM servers Cost TSM deduplication functionality is embedded in the product without an additional software license cost. It is important to consider that hardware resources must be appropriately sized and configured. Additional expense should be anticipated when planning a TSM server configuration that will be used with deduplication. However, these additional costs can easily be offset by the savings in disk storage. Also, the software license costs are reduced when capacity-based pricing is in effect. Deduplication appliances are priced for the performance and capability that they provide, and generally are considered more expensive per GB than the hardware requirements for TSM native deduplication. A detailed cost comparison should be done to determine the most cost-effective solution. Page 9 of 42

10 1.2 Conditions for effective use of TSM deduplication Although TSM deduplication provides a cost-effective and convenient method for reducing the amount of disk storage required for backups, there are specific conditions that can provide the most benefit when using TSM deduplication. Conversely, there are conditions where TSM deduplication will not be effective and in fact may reduce the efficiency of a backup operation. Conditions that lead to effective use of TSM deduplication including the following: Need for reduction of the disk space required for backup storage. Need for remote backups over limited bandwidth connections. Use of TSM node replication for disaster recovery across geographically dispersed locations. Total amount of backup data and data backed up per day are within the recommended limits of less than 300TB total and 3-4TB per day. Either a disk-to-disk backup should be configured (where the final destination of backup data is on a deduplicating disk storage pool), or data should reside in the FILE storage pool for a significant time (e.g., 30 days), or until expiration. The deduplication storage pools should not be used as a temporary staging pool before moving to tape or another non-deduplicating storage pool since this can be highly inefficient. Backup data should be a good candidate for data reduction through deduplication. This topic is covered in greater detail in later sections. High performance disk must be used for the TSM database to provide acceptable TSM deduplication performance Traditional TSM architectures compared with deduplication architectures A traditional TSM architecture ingests data into disk storage pools, and moves this data to tape on a frequent basis to maintain adequate free space on disk for continued ingestion. An architecture that includes deduplication changes this model to store the primary copy of data in a sequential file storage pool for its entire life cycle. Deduplication provides enough storage savings to make keeping the primary copy on disk an affordable possibility. Tape storage pools still have a place in this architecture for maintaining a secondary storage pool backup copy for disaster recovery purposes. Other architectures are possible where data remains in deduplicated storage pools for only a portion of its life cycle, but this requires reconstructing the deduplicated objects and can defeat the purpose of spending the processing resources that are required to deduplicate the data. Tip: Avoid architectures where data is moved from a deduplicated storage pool to a non-deduplicated storage pool, which will force the deduplicated data to be reconstructed and lose the storage savings that were previously gained. Page 10 of 42

11 1.2.2 Examples of appropriate use of TSM deduplication This section contains examples of TSM architectures that can make the most effective use of TSM deduplication Deduplication with a secondary storage pool backup architecture In this example the primary storage pool is a file-sequential disk storage pool configured for TSM deduplication. The deduplication storage pool is backed up to a tape library copy storage pool Deduplication with node replication copy The TSM 6.3 release provides a node replication capability, which allows for an alternative architecture where deduplicated data is replicated to a second server in an incremental fashion that takes advantage of deduplication, and avoids reconstructing the data. Page 11 of 42

12 Disk-to-disk backup Disk-to-disk backup refers to the scenario where the preferred backup storage device is disk-based, as opposed to tape or a virtual tape library (VTL). Disk-based backup has become more popular as the unit cost of disk storage has fallen. It has also become more common as companies distinguish between backup data, which is kept for a relatively short amount of time, and archive data, which has long term retention. Disk-to-disk backup still requires a backup of the storage pool data, and the backup or copy destination may be tape or disk. However, with disk-to-disk backup, the primary storage pool data remains on disk until it expires. A significant reduction of disk storage can be achieved if the primary storage pool is configured for deduplication Data characteristics for effective deduplication When considering the use of TSM deduplication, you should assess whether the characteristics of the backup data are appropriate for deduplication. A more detailed description of data characteristics for deduplication is provided in the section on estimating deduplication efficiency. General types of structured and unstructured data are good candidates for deduplication, but if your backup data consists mostly of unique binary images or encrypted data, you may wish to exclude these data types from a management class that uses a deduplicated storage pool. Page 12 of 42

13 1.3 When is it not appropriate to use TSM deduplication? TSM deduplication can provide significant benefits and cost savings, but it does not apply to all situations. The following situations are not appropriate for using TSM deduplication: Primary storage of backup data is on VTL or physical tape Movement to tape requires rehydration of the deduplicated data. This takes extra time and requires processing resources. If regular migration to tape is required, the benefits of using TSM deduplication may be reduced, since the goal is to reduce disk storage as the primary location of the backup data No flexibility with the backup processing window TSM deduplication processing requires additional resources, which can extend backup windows or server processing times for daily backup activities. For example, a duplicate identification process must run for server-side deduplication. Additional reclamation activity is required to remove the duplicate data from a storage pool after the duplicate identification processing completes. For client-side deduplication, the client backup speed will generally be reduced for local clients (remote clients may not be impacted if there is a bandwidth constraint). If the backup window has already reached the limit for service level agreements, TSM deduplication could possibly impact the backup window further unless careful planning is done Restore performance considerations Restore performance from deduplicated storage pools is slower than from a comparable disk storage pool that does not use deduplication. However, restore from a deduplicated storage pool can compare favorably to restore from tape devices for certain workloads. If fastest restore performance from disk is a high priority, then restore performance benchmarking should be done to determine whether the effects of deduplication can be accommodated. The following table compares the restore performance of small and large object workloads across several storage scenarios. Storage pool type Small object workload Large object workload Tape Typically slower due to tape mounts and seeks Typically faster due to streaming capabilities of modern tape drives Non-deduplicated disk Typically faster due to Comparable to or slightly absence of tape mounts and slower than tape quick seek times Page 13 of 42

14 Deduplicated disk Faster than tape, slower than non-deduplicated disk Slowest since data must be rehydrated, when compared to tape which is fast for streaming large objects that are not spread across many tapes. Page 14 of 42

15 2 Resource requirements for TSM deduplication TSM deduplication provides significant benefits as a result of its data reduction technology, particularly when combined with other data reduction techniques available with TSM. However, the use of deduplication in TSM adds additional requirements for hardware and database/log storage, which are essential for a successful implementation. When configuring TSM to use deduplication, you must ensure that proper resources have been allocated to support the use of the technology. The resources include hardware requirements necessary to meet the additional processing performed during deduplication, additional storage requirements for handling the TSM database records used to store the deduplication catalog, and additional storage requirements for the TSM server database logs. The TSM internal database plays a central role in enabling the deduplication technology. Deduplication requires additional database capacity to be available. In addition, there is a significant increase in the frequency of references to records in the database during many TSM operations including backup, restore, duplicate identification, and reclamation. These demands on the database require that the database disk storage be capable of sustaining higher rates of I/O operations than would be required without the use of deduplication. As a result, planning for resources used by the TSM database is critical for a successful deduplication deployment. This section guides you through the estimation of resource requirements to support TSM deduplication. 2.1 Database and log size requirements TSM database capacity estimation Use of TSM deduplication significantly increases the capacity requirements of the TSM database. This section provides some guidelines for estimating the capacity requirements of the database. It is important to plan ahead for the database capacity so an adequate amount of higher-performing disk can be reserved for the database (refer to the next section for performance requirements). The estimation guidelines are approximate, since actual requirements will depend on many factors including ones that cannot be predicted ahead of time (for example, a change in the data backup rate, the exact amount of backup data, and other factors) Planning database space requirements The use of deduplication in TSM requires more storage space in the TSM server database than without the use of deduplication. One important point to note is that when using deduplication, the TSM database grows proportionally to the amount of data that is stored in deduplicated storage pools. This is because each chunk of data that is stored in a deduplicated storage pool is referenced by an entry in the database. Without deduplication, each backed-up object (typically a file) is referenced by a database entry, and the database grows proportionally to the number of objects that are stored. With deduplication, the database grows proportionally to the total amount of data backed up. The document Determining the impact of deduplication on TSM server database and storage pools provides detailed information for estimating the amount of disk storage that will be required for your TSM database. The document provides formulas for estimating database size based on the volume of data to be stored. As a simplified rule-of-thumb for taking a rough estimate, you can plan for 150GB of database storage for every 10TB of data that will be protected in deduplicated storage pools. Page 15 of 42

16 2.1.2 TSM database log size estimation The use of deduplication adds additional requirements for the TSM server database, active log, and archive log storage. Properly sizing the storage capacity for these components is essential for a successful implementation of deduplication Planning active log space requirements The database active log stores information about database transactions that are in progress. With deduplication, transactions can run longer, requiring more space to store the active transactions. Tip: Use the maximum allowed size for the active log which is 128GB Planning archive log space requirements The archive log stores older log files for completed transactions until they are cleaned up as part of the TSM server database backup processing. The file system holding the archive log must be given sufficient capacity to avoid running out of space, which can cause the TSM server to be halted. Space is freed in the archive log every time a full backup is performed of the TSM server s database. See the document on Sizing the TSM archive log for detailed information on how to carefully calculate the space requirements for the TSM server archive log. Tip: A file system with 500GB of free space has proven to be more than adequate for a large-scale TSM server that ingests several terabytes a day of new data into deduplicated storage pools and performs a full TSM database backup once a day. 2.2 Estimating capacity for deduplicated storage pools TSM deduplication ratios typically range from 2:1 (50% reduction) to 15:1 (93% reduction), and is data dependent. Lower ratios are associated with backups of unique data (e.g., such as progressive incremental data), and higher ratios are associated with backups that are repeated, such as repeated full backups of databases or virtual machine images. Mixtures of unique and repeated data will result in ratios within that range. If you aren't sure of what type of data you have and how well it will reduce, use 3:1 for planning purposes when comparing with non deduplicated TSM storage pool occupancy. This ratio corresponds to an overall data reduction ratio of over 15:1 when factoring in the data reduction benefits of progressive incremental backups Estimating storage pool capacity requirements Delayed release of storage pool data Due to the latency for deletion of data chunks with multiple references, there is a need for transient storage associated with data chunks that must remain in a storage pool volume even though their associated file or object is deleted or expired. As a result of this behavior, storage pool capacity sizing must account for some percentage of data that is retained because of references by other objects. This latency results in the delayed deletion of a storage pool volume if it contains a single chunk that is still being referenced Delayed effect of post-identification processing Storage reduction does not always occur immediately with TSM deduplication. In the case of server-side deduplication, sufficient storage pool capacity is required to ingest the full amount of daily backup data. With Page 16 of 42

17 server-side deduplication, removal of redundant data does not occur until after storage pool reclamation completes, which in turn may not complete until after a storage pool backup is done. If client-side deduplication is used, this delay will not apply. Sufficient storage pool free capacity must be maintained to accommodate continued backup ingestion Estimating storage pool capacity requirements You can roughly estimate storage pool capacity requirements for a deduplicated storage pool using the following technique: Estimate the base size of the source data Estimate the daily backup size, using an estimated change and growth rate Determine retention requirements Estimate the total amount of source data by factoring in the base size, daily backup size, and retention requirements. Apply the deduplication ratio factor Uplift the estimate to consider transient storage pool usage The following example illustrates the estimation method: Parameter Value Notes Base size of the source data 40TB Data from all clients that will be backed up to the deduplicated storage pool. Estimated daily change rate 2% Includes new and changed data Retention requirement 30 days Estimated deduplication ratio 3:1 3:1 assumes compression is used with client-side deduplication Uplift for transient storage pool volumes 30% Computed Values: Parameter Computation Result Base source data 40TB 40TB Estimated daily backup amount 40TB * 0.02 change rate 0.8TB Total changed data retained 30 * 0.8TB daily backup 24TB Total data retained 40TB base data + 24TB retained 64TB Retained data after deduplication (3:1 ratio) 64TB/3 21.3TB Uplift for delays in chunk deletion (30%) 21.3TB * TB Add full daily backup amount 27.69TB + 0.8TB 28.49TB Round up: Storage pool capacity requirement 29TB Page 17 of 42

18 2.3 Hardware recommendations and requirements The use of deduplication requires additional processing, which increases the TSM server hardware requirements beyond what is required without the use of deduplication. The most critical hardware requirement when using deduplication is the I/O capability of the disk system that is used for the TSM database. You should begin by understanding the base hardware recommendations for the TSM server, which are described in the following documents: AIX, HPUX, Linux x86, Linux on Power, Linux on system Z, Solaris, Windows. Additional hardware recommendations are made in the TSM Version 6 deployment guide: TSM V6 Deployment Recommendations Database I/O requirements For optimal performance, fast disk storage is always recommended for the TSM database as measured in terms of Input/Output Operations Per Second (IOPS). Due to the random access I/O patterns of the TSM database, minimizing the latency of operations that access the database volumes is critical for optimizing the performance of the TSM server. The large tables used for storing deduplication information in the TSM database bring about an even more significant demand for disk storage that can handle a large number of IOPS. In general, systems based on solid-state disk technology and SAS/FC provide the best capabilities in terms of increased IOPS. Because the claims of disk manufacturers are not always reliable, we recommend measuring actual IOPS of a disk system before implementing a new TSM database. Details about how to configure high performing disk storage are beyond the scope of this document. The following key points should be considered when configuring disk storage for the TSM database: The disk used for the TSM database should be configured according to best practices for a transactional database. Low-latency, enterprise-class disk devices or storage subsystems should be used for the TSM database. Disk devices or storage systems that are capable of a minimum of approximately 3000 IOPS are suggested for the TSM Database disk device. An additional 1000 IOPS per TB of daily ingested data (pre-deduplication) should be considered. Lower-performing disk devices can be used, but performance may not be optimal. Refer to the Deduplication FAQ's for an example configuration. Disk I/O should be distributed over as many disk devices and controllers as possible. TSM database and logs should be configured on separate disk volumes (LUNS), and should not share disk volumes with the TSM storage pool or any other application or file system CPU The use of deduplication requires additional CPU resources on the TSM server, particularly for performing the task of duplicate identification. You should consider using a minimum of at least 8 (2.2Ghz or equivalent) processor cores in any TSM server that is configured for deduplication. Page 18 of 42

19 Memory For the highest performance of a large-scale TSM server using deduplication, additional memory is recommended. The memory is used to optimize the frequent lookup of deduplication chunk information stored in the TSM database. A minimum of 64GB of system memory should be considered for TSM servers using deduplication. If the retained capacity of backup data grows, the memory requirement may need to be as high as 128GB. It is beneficial to monitor memory utilization on a regular basis to determine if additional memory is required Hardware requirements for TSM client deduplication Client-side deduplication (and compression if used with deduplication) requires resources on the client system for processing. Prior to deciding to use client-side deduplication you should verify that client systems have adequate resources available during the backup window to perform the deduplication processing. A suggested minimum CPU requirement is the equivalent of one 2.2ghz CPU core per backup process with client-side deduplication. As an example, a system with a single-socket, quad-core, 2.2Ghz processor that is utilized 75% or less during the backup window would be a good candidate to use client-side deduplication Page 19 of 42

20 3 Implementation guidelines A successful implementation of TSM deduplication requires careful planning in the following areas: Implementing an appropriate architecture suitable for using deduplication Properly sizing your TSM server hardware and storage Configuring TSM following best practices for separating data ingestion and data maintenance tasks 3.1 Deciding between client and server deduplication After you decide on an architecture using deduplication for your TSM server, you need to decide whether you will perform deduplication on the TSM clients, the TSM server, or using a combination of the two. The TSM deduplication implementation allows storage pools to manage deduplication performed by both clients and the TSM server. The server is optimized to only perform deduplication on data that has not been deduplicated by the TSM clients. Furthermore, duplicate data can be identified across objects regardless of whether the deduplication is performed on the client or server. These benefits allow for hybrid configurations that efficiently apply client-side deduplication to a subset of clients, and use server-side deduplication for the remaining clients. Typically a combination of both client-side and server-side data deduplication is the most appropriate. Here are some further points to consider: Server-side deduplication is a two-step process of duplicate data identification followed by reclamation to remove the duplicate data. Client-side deduplication stores the data directly in a deduplicated format, reducing the need for the extra reclamation processing. Deduplication on the client can be combined with compression to provide the largest possible storage savings. Client-side deduplication processing can increase backup durations. Expect increased backup durations if network bandwidth is not restrictive. A doubling of backup durations is a reasonable estimate when using client-side deduplication in an environment that is not constrained by the network. Client-side deduplication can place a significant load on the TSM server in cases where a large number of clients are simultaneously driving deduplication processing. The load is a result of the TSM server processing duplicate chunk queries from the clients. Server-side deduplication, on the other hand, typically has a relatively small number of identification processes running in a controlled fashion. Client-side deduplication cannot be combined with LAN-free data movement using the Tivoli Storage Manager for SAN feature. If you are implementing one of TSM s supported LAN-free to disk solutions, then you can still consider using server-side deduplication. Tips: Perform deduplication at the client in combination with compression in the following circumstances: 1. Your backup network speed is a bottleneck. 2. Increased backup durations can be tolerated, and the maximum storage savings is more important than having the fastest possible backup elapsed times. Page 20 of 42

21 3. The client does not typically send objects larger than 500GB in size, or client configuration options can be used to break up large objects into smaller objects. These options are discussed in a later section. 3.2 TSM Deduplication configuration recommendations Recommendations for deduplicated storage pools The TSM deduplication feature is turned on at the storage pool level. The TSM server can be configured with more than one deduplicated storage pool, but duplicate data will not be identified across different storage pools. In most cases, using a single large deduplicated storage pool is recommended. The following commands provide an example of setting up a deduplicated storage pool on the TSM server. Some parameters are explained in further detail to give the rationale behind the values used, and later sections build upon those settings Device class A device class is used to define the storage that will be used for sequential file volumes by the deduplicated storage pool. Each of the directories specified should be backed by a separate file system, which corresponds to a distinct logical volume on the disk storage subsystem. By using multiple directories backed by different storage elements on the subsystem, the TSM round-robin implementation for volume allocation is able to achieve more throughput by spreading I/O across a large pool of physical disks. Here are some considerations for parameters with the DEFINE DEVCLASS command: The mountlimit parameter limits the number of volumes that can be simultaneously mounted by all storage pools that use this device class. Typically client sessions sending data to the server use the most mount points, so you will want to set this parameter high enough to handle the expected number of simultaneous client sessions. This parameter needs to be set very high for deduplicated storage pools to avoid having client session and server processes waiting for available mount points. The setting is influenced by the numopenvolsallowed option, which is discussed in a later section. To estimate the setting of this option, use the following formula where numprocs is the largest number of processes used for a data copy/movement task such as reclamation and migration: mountlimit = (numprocs * numopenvolsallowed) + max_backup_sessions + (restore_sessions * numopenvolsallowed) + buffer The maxcapacity parameter controls the size of each file volume that will be created for your storage pool. This parameter takes some planning. The goal is to avoid too small of a volume size, which will result in frequent end-of-volume processing and spanning of larger objects across multiple volumes, and also to avoid volume sizes that are too large to ensure that enough writeable volumes are available to handle your expected number of client backup sessions. The following example shows a volume size of 100GB, which has proven to be optimal in many environments. > define devclass dedupfile devtype=file mountlimit=150 maxcapacity=102400m directory=/tsmdedup1,/tsmdedup2,/tsmdedup3,/tsmdedup4,/tsmdedup5,/tsmdedup6,/tsm dedup7,/tsmdedup8 Page 21 of 42

22 Storage pools The storage pool is the repository for deduplicated storage and uses the device class previously defined. An example command for defining a deduplicated storage pool is given below, with explanations for parameters that vary from defaults. There are two methods for allocating volumes in a file-based storage pool. With the first method, volumes are pre-allocated and remain assigned to the same storage pool after they are reclaimed. The second method uses scratch volumes, which are allocated as needed, and return to the scratch pool once they are reclaimed. The examples below set up a storage pool using scratch volumes as this approach is more convenient and has shown in testing to more efficiently distribute the load across multiple storage containers within a disk subsystem. The deduplicate parameter is required to enable deduplication for the storage pool. The maxscratch parameter defines the maximum number of volumes that can be created for the storage pool. This parameter is used when using the scratch method of volume allocation, and should otherwise be set to a value of 0 when using pre-allocated volumes. Each volume will have a size determined by the maxcapacity parameter for the device class. In our example, 100 volumes multiplied by 100GB per volume, requires that 10TB of free space be available across the eight file systems used by the device class. The identifyprocess parameter is set to 0 to prevent duplicate identification processes from starting automatically. This supports scheduling when duplicate identification runs, which is described in more detail in a later section. The reclaim parameter is set to 100 to prevent automatic storage pool reclamation from running. This supports the best practice of scheduling when reclamation runs, which is described in more detail in a later section. The actual threshold used for reclamation is defined as part of the scheduled reclamation command which is defined in a later section. The reclaimprocess parameter is set to a value higher than the default of 1 since a deduplicated storage pool requires a large volume of reclamation processing to keep up with the daily ingestion of new backups. The suggested value of 8 is likely be sufficient for large-scale implementations, but you may need to further increase this setting. > define stgpool deduppool dedupfile maxscratch=100 deduplicate=yes identifyprocess=0 reclaim=100 reclaimprocess= Policy settings The final configuration step involves defining policy settings on the TSM server that allow data to ingest directly into the deduplicated storage pool that has been created. Policy requirements vary for each customer, but the following example shows policy that retains extra backup versions for 30 days. > define domain DEDUPDISK > define policy DEDUPDISK POLICY1 > define mgmtclass DEDUPDISK POLICY1 STANDARD > assign defmgmtclass DEDUPDISK POLICY1 STANDARD > define copygroup DEDUPDISK POLICY1 STANDARD type=backup destination=deduppool VEREXISTS=nolimit VERDELETED=10 RETEXTRA=30 RETONLY=80 > define copygroup DEDUPDISK POLICY1 STANDARD type=archive destination=deduppool RETVER=30 > activate policyset DEDUPDISK POLICY1 Page 22 of 42

23 3.2.2 Recommended options for deduplication The server has several tuning options that control deduplication processing. The following table summarizes these options, and provides an explanation for those options for which we recommend overriding the default values. Option Allowed values Recommended value Explanation Default This option delays the completion of server-side deduplication processing until after a secondary copy of the data has been made with storage pool backup. DedupRequiresBackup Yes No Default: Yes The use of copy storage pools is optional, so in cases where there will be no secondary copy storage pool or when node replication will be used for the secondary copy, this option should be set to No. The TSM server offers many levels of protection, including the ability to create a secondary copy of your data. Creating a secondary copy is optional, but is always a best practice for any storage pool regardless of whether it is deduplicated. ClientDedupTxnLimit Min: 32 Max: 1024 Default: 300 Default Specifies the largest object size in gigabytes that can be processed using client-side deduplication. This can be increased up to 1TB, but this does not guarantee that the TSM server will be able to process objects up to this size in all environments. ServerDedupTxnLimit Min: 32 Max: 2048 Default: 300 Default Specifies the largest object size in gigabytes that can be processed using server-side deduplication. This can be increased up to 2TB, but this does not guarantee that the TSM server will be able to process objects up to this size in all environments. DedupTier2FileSize Min: 20 Max: 9999 Default: 100 Default Changing the default tier settings is not recommended. Small changes may be tolerated, but avoid frequent changes to these settings, as changes will prevent matches between previously ingested backups and future backups. DedupTier3FileSize Min: 90 Max: 9999 Default: 400 Default Changing the default tier settings is not recommended. Small changes may be tolerated, but avoid frequent changes to these settings, as changes will prevent matches between previously ingested backups and future backups. Page 23 of 42

24 NumOpenVolsAllowed EnableNasDedup Min: 3 Max: 999 Default: 10 Yes No Default: No 20 This option controls the number of volumes that a process such as reclamation or client sessions can hold open at the same time. A small increase to this option is recommended, and some trial and error may be needed. Note: The device class mount limit parameter may need to be increased if this option is increased. Default If you are using NDMP backup of NetApp file servers in your environment, change this option to Yes Best practices for ordering backup ingestion and data maintenance tasks A successful implementation of deduplication with TSM requires separating the tasks of ingesting client data and performing server data maintenance tasks into separate time windows. Furthermore, the server data maintenance tasks have an optimal ordering, and in some cases need to be performed without overlap to avoid resource contention problems. TSM has the ability to schedule all of these activities to follow these best practices. The recommended ordering is explained below, along with sample commands to implement these tasks through scheduling. Consider using the following task sequence. Please note that the list focuses on those tasks pertinent to deduplication. Please consult the product documentation for additional commands which you may also need to include in the daily maintenance tasks. 1. Client data ingestion. 2. Create the secondary disaster recovery (DR) copy using the BACKUP STGPOOL command (optional). 3. The following tasks can run in parallel: a. Perform server-side duplicate identification by running the IDENTIFY DUPLICATES command. This processes data that was not already deduplicated on the clients. b. Create a DR copy of the TSM database by running the BACKUP DATABASE command. Following the completion of the database backup, the DELETE VOLHISTORY command can be used to remove older versions of database backups which are no longer required. 4. Perform node replication to create a secondary copy of the ingested data to another TSM server using the REPLICATE NODE command (optional). 5. Reclaim unused space from storage pool volumes that has been released through deduplication and inventory expiration using the RECLAIM STGPOOL command. 6. Backup the volume history and device configuration using BACKUP VOLHISTORY and BACKUP DEVCONFIG commands. 7. Remove objects that have exceeded their allowed retention using the EXPIRE INVENTORY command. Page 24 of 42

25 Define scripts that run each required maintenance task The following scripts, once defined, can be called by scheduled administrative commands. Here are a few points to note regarding these scripts: The storage pool backup script assumes you have already defined a copy storage pool named copypool, which uses tape storage. The database backup script requires a device class that typically also uses tape storage. The script for reclamation gives an example of how the parallel command can be used to simultaneously process more than one storage pool. The number of processes to use for identifying duplicates should not exceed the number of CPU cores available on your TSM server. This command also does not have a wait=yes parameter, so it is necessary to define a duration limit. If you have a large TSM database, you can further optimize the BACKUP DATABASE command by using multiple streams with TSM 6.3 and later. A deduplicated storage pool is typically reclaimed to a threshold lower than the default of 60 to allow more of the identified duplicate chunks to be removed. Some experimenting will be needed to find a value that can be completed within the available time. Tip: A reclamation setting of 40 or less is usually sufficient. define script STGBACKUP "/* Run stg pool backups */" update script STGBACKUP "backup stgpool DEDUPPOOL copypool maxprocess=10 wait=yes" line=020 define script DEDUP "/* Run identify duplicate processes */" update script DEDUP "identify duplicates DEDUPPOOL numprocess=6 duration=660" line=010 set dbrecovery TAPEDEVC define script DBBACKUP "/* Run DB backups */" update script DBBACKUP "backup db devclass=tapedevc type=full wait=yes" line=010 update script DBBACKUP "backup volhistory" line=020 update script DBBACKUP "backup devconfig" line=030 update script DBBACKUP "delete volhistory type=dbbackup todate=today-7 totime=now" line=040 define script RECLAIM "/* Run stg pool reclamation */" update script RECLAIM "parallel" line=010 update script RECLAIM "reclaim stgpool DEDUPPOOL threshold=40 wait=yes" line=020 update script RECLAIM "reclaim stgpool COPYPOOL threshold=60 wait=yes" line=030 define script EXPIRE "/* Run expiration processes. */" update script EXPIRE "expire inventory resources=8 wait=yes" line= Define schedules to run the data maintenance tasks The TSM server has the ability to schedule commands to run, where the scheduled action is to run the various scripts that were defined in the previous sections. The examples below give specific start times that have proven to be successful in environments where backups run from midnight until 07:00 AM on the same day. You will need to change the start times to appropriate values for your environment. Page 25 of 42

26 define schedule STGBACKUP type=admin cmd="run STGBACKUP" active=yes \ desc="run all stg pool backups." startdate=today starttime=08:00:00 \ duration=15 durunits=minutes period=1 perunits=day define schedule DEDUP type=admin cmd="run DEDUP" active=no \ desc="run indentify duplicates." startdate=today starttime=11:00:00 \ duration=15 durunits=minutes period=1 perunits=day define schedule DBBACKUP type=admin cmd="run DBBACKUP" active=yes \ desc="run database backup." startdate=today starttime=12:00:00 \ duration=15 durunits=minutes period=1 perunits=day define schedule RECLAIM type=admin cmd="run RECLAIM" active=yes \ desc="reclaim space from storage pools." startdate=today starttime=14:00 \ duration=15 durunits=minutes period=1 perunits=day define schedule EXPIRATION type=admin cmd="run expire" active=yes \ desc="run expiration." startdate=today starttime=18:00:00 \ duration=15 durunits=minutes period=1 perunits=day Page 26 of 42

27 4 Estimating deduplication savings If you ask someone in the data deduplication business to give you an estimate of the amount of savings to expect for your specific data, the answer will often be it depends. The reality is that TSM, like every other data protection product, cannot guarantee a certain level of deduplication because there are a variety of factors unique to your data that influence the results. Since deduplication requires computational resources, it is important to consider which environments and circumstances can benefit most from deduplication, and when other data reduction techniques may be more appropriate. What we can do is provide an understanding of the factors that influence deduplication effectiveness when using TSM, and provide some examples of observed behaviors for specific types of data, which can be used as a reference for planning purposes. 4.1 Factors that influence the effectiveness of deduplication The following are factors that have an influence on how effectively TSM reduces the amount of data to be stored using deduplication Characteristics of the data Uniqueness of the data The first factor to consider is the uniqueness of the data. Much of deduplication savings come from repeated backups of the same objects. Some savings, however, result from having data in common with backups of other objects or even within the same object. The uniqueness of the data is the portion of an object that has never been stored by a previous backup. Duplicate data can be found within the same object, across different objects stored by the same client, and from objects stored by different clients Response to fingerprinting The next factor is how data responds to the deduplication fingerprinting processing used by TSM. During deduplication, TSM breaks objects into chunks, which are examined to determine whether they have been previously stored. These chunks are variable in size and are identified using a process called fingerprinting. The purpose of fingerprinting is to ensure that the same chunk will always be identified regardless of whether it shifts to different positions within the object between successive backups. The TSM fingerprinting implementation uses a probability-based algorithm for identifying chunk boundaries within an object. The algorithm strives to have all of the chunks created for an object average out in terms of size to a target average for all chunks. The actual size of each chunk is variable within the constraints that it must be larger than the minimum chunk size and cannot be larger than the object itself. The fingerprinting implementation results in average chunk sizes that vary for different kinds of data. For data that fingerprints to average chunk sizes significantly larger than the target average, the deduplication efficiency is more sensitive to changes. More details are given in the later section that discusses tiering Volatility of the data The final factor is the volatility of the data. A significant amount of deduplication savings is a result of the fact that similar objects are backed up repeatedly over time. Objects that undergo only minor changes between backups will end up having a significant percentage of chunks that are unchanged since the last backup and hence do not need to be stored again. Likewise, an object can undergo a pattern of change that alters a Page 27 of 42

28 large percent of the chunks in the object. In these cases, there is very little savings realized by deduplication. It is important to note that this effect does not necessarily relate to the amount of data being written to an object. Instead, it is a factor of how pervasively the changes are scattered throughout the object. Some change patterns, such as appending new data at the end of an object, have a very favorable response with deduplication Examples of workloads that respond well to deduplication The following are general examples of backup workloads that respond well to deduplication: Backup of workstations with multiple copies or versions of the same file. Backup of objects with regions that repeat the same chunks of data (for example, regions with zeros). Multiple full backups of different versions of the same database. Operating system files across multiple systems. For example, Windows systemstate backup is a common source of duplicate data. Another example is virtual machine image backups with TSM for Virtual Environments. Backup of workstations with versions or copies of the same application data (for example, documents, presentations, or images). Periodic full backups taken of systems using a new nodename for the purposes of creating a out of cycle backup with special retention criteria Deduplication efficiency of some data types The following table shows some common data types along with their expected deduplication efficiency. Data type Deduplication efficiency Audio (mp3, wma), Video (mp4), Images (jpeg) Human generated/consumer data: text documents, source code Office documents spreadsheets, presentations Poor Good Poor Common operating system files Good Large repeated backups of databases (Oracle, DB2, etc) Good Objects with embedded control structures Poor TSM data stored in non-native storage pools (for example, NDMP data) None Page 28 of 42

29 4.1.2 Impacts from backup strategy decisions The gains realized from deduplication are also influenced by two different implementation choices in how backups are taken and managed Backup model For TSM, a very common backup model is the use of incremental-forever backups. In this case, each subsequent backup achieves significant storage savings by not having to send unchanged objects. These objects that are not re-sent also do not need to go through deduplication processing, which turns out to be a very efficient method of reducing data. On the other hand, other data types use a backup model that always runs a full backup, or a periodic full backup. In these cases, there will typically be significant reductions in the data to be stored, which is a result of the significant duplication across subsequent backups of the similar objects. The following table illustrates some examples of deduplication savings between full and incremental backup models: Does deduplication offer savings in the case where. Full backup Incremental backup File-level backups are taken using the backup-archive client. Database backups are taken using a data protection client. Virtual machine backups are taken using the Data Protection for VMware product. Yes when: There is data in common from other nodes such as operating system files Periodic full backups are taken for a system. This is Yes when: occasionally performed using a different node name for the purpose of establishing a different retention scheme Subsequent full backups are taken (depends on volatility) No when: The first backup is taken. Databases are typically unique Yes. VMware full backups often experience savings with matches from the backups of other virtual machines, as well as from regions from the same virtual disk that are in common. Yes for files that are being re-sent due to changes (depends on volatility) No for new files that are being sent for the first time (depends on uniqueness) Typically no. The database incremental mechanism is only sending changed regions of the object, which typically have not been stored before Retention settings In general, the more versions you set TSM policy to retain, the more savings you will realize from TSM deduplication as a percentage of the total you would have needed to store without deduplication. Users who Page 29 of 42

30 desire to retain more versions of objects in TSM storage find this to be more cost effective when using deduplication. Consider the example below, which shows the accumulated storage used over a series of backups using the Data Protection for Oracle product. You can see that ten backup versions are stored with deduplication using less capacity than three backup versions require without deduplication Cumulative Data Stored (120 GB Oracle backups) 75% reduction Data Stored(MB) Dedup No Dedup 0.0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 Backup number 4.2 Interaction of compression and deduplication The TSM client provides the ability to compress data with the potential to provide additional storage savings by combining both compression and deduplication. With TSM deduplication, you will need to decide whether to perform deduplication at the client, server, or in some combination. This section will guide you through the analysis that should happen in making that decision, taking into consideration the fact that combining deduplication and compression is only possible on the clients How deduplication and compression interact with TSM In general, deduplication technologies are not very effective when applied to data that is previously compressed. However, by compressing data after it is already deduplicated, additional savings can be gained. When deduplication and compression are both performed by the TSM client, the operations are sequenced in the desirable order of first applying deduplication, followed by compression. The following list summarizes key points of the TSM implementation, which will help explain other information to follow in this section: Page 30 of 42

31 The TSM client can perform deduplication combined with compression. The TSM server can perform deduplication, but cannot perform compression. If data is compressed prior to being passed to the TSM client, it is not possible to perform deduplication prior to compression. For example, certain databases provide the ability to compress a backup stream prior to passing the stream to a Tivoli for Data Protection client. In these cases, the data will be compressed prior to TSM performing deduplication. The most significant reduction in data size is typically a result of performing the combination of client-side deduplication and compression. The additional savings provided by compression will vary depending on how well the specific data responds to the TSM client compression mechanism Considerations related to compression when choosing between client-side and server-side deduplication Typically, the decision of whether to use data reduction technologies on the TSM client depends on your backup window requirements, and whether your environment is network-constrained. With constrained networks, using data reduction technologies on the client may actually improve backup elapsed times. Without a constrained network, the use of client-side data reduction technologies will typically result in longer backup elapsed times. The following questions are important to consider when choosing whether to implement client-side data reduction technologies: 1. Is the speed of your backup network limiting backup elapsed times? 2. What is more important to your business: the amount of storage savings you achieve through data reduction technologies, or how quickly backups complete? If the answer to the first question is yes, using data reduction technologies on the client may result in both faster backups and increased storage savings on the TSM server. More often, the answer to this question is no, in which case you need to weigh the trade-offs between having the fastest possible backup elapsed times, and gaining the maximum amount of storage pool savings. Page 31 of 42

32 Client Dedup with compression vs Server Dedup with compression (20GB object, 1% change rate) Reduction Savings 100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 CliDedup+comp Dedup only ServDedup+comp Comp only Backup number Totals after ten backups Stored Saved %reduced Elapsed Time (seconds) No Dedup, no comp 200 GB Comp only GB 94.2 GB 47.1% 665 (1.7x) Client compression + server dedup 83.5 GB GB 58.3% - Dedup only 50.0 GB GB 75.0% - Client dedup + compression 27.5 GB GB 86.2% 618 (1.5x) The graph above shows a 20GB object going through a series of ten backups. For each of the ten backups, the object in the same state was run through different data reduction mechanisms in TSM to allow comparing the behavior of each. The table summarizes the cumulative totals stored and saved for each of the techniques, along with elapsed times in some cases. The following observations can be made from these results: The most significant storage savings of 86% is seen with the combination of client-side deduplication and compression. There is a cost of a 1.5 times increase in the backup elapsed time versus a backup with no client-side data reduction. The addition of compression provides the additional 11% savings beyond the 75% that is possible using deduplication alone. Page 32 of 42

33 With compression alone, there is a savings of 47%. This is a fairly typical savings seen with TSM compression. With deduplication alone (can be either client-side or server-side,) there is a savings of 75%. There was no savings for the first backup with deduplication alone. This is typical with unique objects such as databases. The additional savings seen on the initial backup is one area in which compression provides substantial savings beyond what deduplication provides. Applying server-side deduplication to data that is already compressed by the client results in a lower 58% savings than the 75% that can be achieved using server-side deduplication alone. Caution: Your application may compress data before it is passed to the TSM client. This will result in a similar less-efficient deduplication savings. In these cases, it is best to either disable the application compression, or send this data to a storage pool that does not use deduplication. The bottom line: For the fastest backups on a fast network, choose server-side deduplication. For the largest storage savings, choose client-side deduplication combined with compression. Avoid performing client-compression in combination with server-side deduplication. 4.3 Understanding the TSM deduplication tiering implementation The deduplication implementation in TSM uses a tiered model where larger objects are processed with larger average chunk sizes with the goal of limiting the number of chunks that an object will be split into. The tiering model is used to avoid operational problems that arise when the TSM server needs to operate on objects consisting of very large numbers of chunks, and also to limit the growth of the TSM database. The use of larger average chunk sizes has the trade-off of limiting the amount of savings achieved by deduplication. The TSM server provides three different tiers that are used for different ranges of object sizes Controls for deduplication tiering There are two options on the TSM server that control the object size thresholds at which objects are processed in tier2 or tier3. All objects with sizes smaller than the tier2 threshold are processed in tier1. By default, objects under 100GB in size are processed at tier1. Objects in the range of 100GB to under 400GB are processed in tier2, and all objects 400GB and larger are processed in tier3. Avoid makings adjustments to the options controlling the deduplication tier thresholds. Changes to the thresholds after data has been stored can prevent newly stored data from matching data stored in previous backups, and can also cause operational problems if the changes cause larger objects to be processed in the lower tiers. Very large objects can be excluded from deduplication using the options clientdeduptxnlimit and serverdeduptxnlimit. The storage pool parameter maxsize can also be used to prevent large objects from being stored in a deduplicated storage pool. Option Allowed values (GB) Implications of the default DedupTier2FileSize Minimum: 20 Maximum: 9999 Default: 100 Objects that are smaller the 100GB will be processed in tier1. Objects 100GB and up to the tier3 setting are processed as tier2. Page 33 of 42

34 DedupTier3FileSize Minimum: 90 Maximum: 9999 Default: 400 Objects that are 400GB and larger are processed in tier3. Objects that are smaller the 400GB are processed in tier2 down to the tier2 threshold where they are processed with tier The impact of tiering to deduplication storage reduction The chart below gives an example of the impact that tiering has on deduplication savings. For the test below, the same DB2 database was processed through a series of ten sets of backups with a varying change pattern applied after each set of backups. For each set of backups, the object in the same state was tested using the three different deduplication tiers, each being stored in its own storage pool. The table below gives the cumulative savings for each tier across the ten backups. The following observations can be made: Deduplication is always more effective at reducing data in the lower tiers. The amount of difference in data reduction between the tiers depends on how the objects change between backups. For data with low volatility, there is less impact to savings from tiering. As a general rule-of-thumb, you can estimate that there will be approximately 17% loss of deduplication savings as you move through each tier. Dedup Savings (120GB db2 ) Dedup Savings 100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 Backup number Tier 1 Tier 2 Tier3 Page 34 of 42

35 Average chunk size after 10 backups: tier1 70K tier2 300K Totals after ten backups No Dedup Tier1 Tier2 Stored 1,236.1 GB GB GB Saved GB GB %reduced 77.9% 61.2% tier3 860K Tier GB GB 44.2% Client controls that optimize deduplication efficiency Controls are available on some TSM client types that prevent objects from becoming too large. This allows for large objects to be processed as multiple smaller objects which fall into the tier1 range. There is not a method to accomplish this for every client type, but here are some strategies that have proven effective at keeping objects within the tier1 threshold: For Oracle database backups, use the RAM MAXPIECESIZE option to prevent any individual object crossing the tier2 size threshold. More recommendations on this topic follow in a later section. For Microsoft SQL database backups that use the legacy backup API, the database can be split across multiple streams. Each stream that is used results in a separate object being stored on the TSM server. A 200GB database, for example, can be backed up with four streams, which results in approximately four 50GB objects that will all fit within the default tier1 size threshold. 4.4 What kinds of savings can I expect for different application types No specific guarantee of TSM deduplication data reduction can be made for specific application types. It is possible to construct an implementation of any of the applications discussed in this section with initial data and apply changes to that data in such a way that any deduplication system would show poor results. What we can do, and what is covered in this section, is to provide some examples of how specific implementations of these applications that undergo reasonable patterns of change respond to TSM deduplication. This information can be considered to be a likely outcome of using TSM deduplication in your environment. More specific results for your environment can only be obtained by testing your real data with TSM over a period of time. In the sections that follow, sample deduplication savings are given for specific applications that result from taking a series of backups with TSM. Each of these examples shows results from only using deduplication, so improved results are possible by combining deduplication and compression. Comparisons across the three different deduplication tiers are given except for applications where using the higher tiers can be avoided. Client-side deduplication was used for all of the tests. There are tables in the following sections that include elapsed times. These are given so that you can make relative comparisons and should not be considered indicators of the performance you will see. There are many factors that will influence actual backup elapsed times, including network performance. Page 35 of 42

36 4.4.1 IBM DB Cumulative Data Stored (120 GB DB2 backups) Data Stored(MB) Tier3 Tier 2 Tier 1 No Dedup Backup number Totals after ten backups Stored Saved %reduced Elapsed Time (seconds) No Dedup 1,236.1 GB Tier GB GB 77.9% 3541 (2.4x) Tier GB GB 61.2% 2955 (2x) Tier GB GB 44.2% 2712 (1.9x) Page 36 of 42

37 4.4.2 Microsoft SQL Cumulative Data Stored (93 GB MS SQL backups) Data Stored(MB) Tier 3 Tier 2 Tier 1 No Dedup 0.0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 Backup number Totals after ten backups Stored Saved %reduced Elapsed time (seconds) No Dedup GB Tier GB GB 78.7% 6132 (5x) Tier GB GB 61.8% 3286 (2.7x) Tier GB GB 50.5% 2944 (2.4x) Oracle Backups using the Data Protection for Oracle product can achieve similar deduplication storage savings with the proper configuration. The test results summarized in the following charts only give values for tier 1. The other tiers were not tested because the RMAN MAXPIECESIZE option can be used to prevent objects from reaching sizes that require the higher tiers. The following RMAN settings are recommended when performing deduplicated backups of Oracle databases with TSM: Use the maxpiecesize RMAN parameter to keep the objects sent to TSM within the tier 1 size range. Oracle backups can be broken into multiple objects of a specified size. This allows for databases of larger sizes to be processed safely with tier1 deduplication processing. The parameter must be set to a value that is less than the TSM server DedupTier2FileSize parameter (defaults to 100GB). Page 37 of 42

38 Recommended value: A maxpiecesize setting of 10GB provides a good balance between keeping each piece at an optimal size for handling by the TSM server and having too many resulting objects. Oracle RMAN provides the capability to multiplex the backups of database filesets across multiple channels. Using this feature will typically result in less effective TSM deduplication data reduction. Use the filesperset RMAN parameter to avoid splitting a fileset across multiple channels. Recommended value: A filesperset setting of 1 should be used for optimal deduplication data reduction. Following is a sample RMAN script, which includes the recommended values for use with TSM deduplication: run { } allocate channel ch1 type 'SBT_TAPE' maxopenfiles=1 maxpiecesize 10G parms 'ENV=(TDPO_OPTFILE=/home/orc11/tdpo_10g.opt)'; backup filesperset 1 (tablespace tbsp_dd); release channel ch1; Data Stored(MB) Cumulative Data Stored (120 GB Oracle backups) 75% reduction Dedup No Dedup 0.0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 Backup number Page 38 of 42

39 Totals after ten backups Stored Saved %reduced Elapsed time (seconds) No Dedup GB Tier GB GB 74.5% (4.1x) VMware VMware backup using TSM for Virtual Environments is one area that is commonly being deployed using TSM deduplication. VMware backups are typically showing very substantial savings when the combination of client-side deduplication and compression is used. The following factors contribute to the substantial savings that are seen: There is often significant data in common across virtual machines. Part of this is the result of the same operating systems being installed and cloned across multiple virtual machines. The TSM for Virtual Environments requirement to periodically repeat full backups results in significant duplication across backup versions. Some duplicate data exists within the same virtual machine on the initial backup. An example savings achieved with VMware backup using the combination of client-side deduplication and compression is 8 to 1 (87.5%). Page 39 of 42

40 5 How to determine deduplication results It is useful to evaluate the actual data reduction results from TSM deduplication to determine if the expected storage savings have been achieved. In addition to evaluating the data reduction results, other key operational factors should be checked, such as database utilization, to ensure that they are consistent with expectations. Deduplication results can be determined by various queries to the TSM server from the administrative command line or Administration Center interface. It is important to recognize the dynamic nature of deduplication and that the benefits of deduplication are not always realized immediately after data is backed up. Also, since the scope of deduplication includes multiple backups across multiple hosts, it will take time to accumulate sufficient data in the TSM storage pool to be effective at eliminating duplicates. Therefore, it is important to sample results at regular intervals, such as weekly, to obtain a valid report of the results. In addition to checking data reduction results, TSM provides queries that can show pending activity for deduplication processing. These queries can be issued to determine an overall assessment of deduplication processing in the server. A script has been developed to assist administrators with monitoring of deduplication-related processing. The script source is provided in the appendix of this document. 5.1 Simple TSM Server Queries QUERY STGPOOL The QUERY STGPOOL command provides a basic and quick method for evaluating deduplication results. However, if the query is run prior to reclamation of the storage pool then the Duplicate Data Not Stored value will be inaccurate and not reflect the most recent data reduction. Example command: Query stgpool format=detailed Example output: Estimated Capacity: 9,848 G Space Trigger Util: 60.7 Pct Util: 60.7 Pct Migr: 60.7 Pct Logical: 98.7 <... > Deduplicate Data?: Yes Processes For Identifying Duplicates: 0 Duplicate Data Not Stored: 28,387 G (87%) Auto-copy Mode: Client Contains Data Deduplicated by Client?: Yes The displayed value of Duplicate Data Not Stored will show the actual reduction of data in megabytes or gigabytes, and the percentage of reduction of the storage pool. If reclamation has not yet occurred, the following example shows the pending amount of data that will be removed: Page 40 of 42

41 In this example backuppool-file is the name of the deduplicating storage pool Other server queries affected by deduplication QUERY OCCUPANCY When a filespace is backed up to a deduplicated storage pool, the QUERY OCCUPANCY command will show the logical amount of storage per filespace. The physical space is displayed as 0.00 as this information is not able to be determined on an individual filespace basis. An example is shown below: Early versions of the TSM V6 server incorrectly maintained occupancy records in certain cases, which can result in an incorrect report of the amount of stored data. The following technote provides information on how to repair the occupancy information if necessary: TSM client reports When using client-side deduplication, the client summary report will show the data reduction associated with deduplication as well as compression. An example is shown here: Total number of objects inspected: 380,194 Total number of objects backed up: 573 Total number of objects updated: 0 Total number of objects rebound: 0 Total number of objects deleted: 0 Total number of objects expired: 72 Total number of objects failed: 0 Total objects deduplicated: 324 Total number of bytes inspected: 1.19 TB Total number of bytes processed: MB Total bytes before deduplication: 1.01 GB Total bytes after deduplication: MB Total number of bytes transferred: MB Data transfer time: sec Network data transfer rate: 6, KB/sec Aggregate data transfer rate: KB/sec Objects compressed by: 0% Page 41 of 42

42 Deduplication reduction: 87.30% Total data reduction ratio: 99.99% Elapsed processing time: 00:13: TSM deduplication report script A script has been developed to provide detailed information on deduplication results for a TSM server. In addition to providing summary information on the effectiveness of TSM deduplication, it can also be used to gather diagnostics if deduplication results are not consistent with expectations. The script and usage instructions can be obtained from the TSM support site: An example of the summary data provided by this report is shown below: The report also provides details of dedup related utilization of the TSM database. < End of Document> Page 42 of 42