Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication

Transcription

1 Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication 02/17/ Authors: Jason Basler Dan Wolfe Page 1 of 52

2 Document Location This is a snapshot of an on-line document. Paper copies are valid only on the day they are printed. The document is stored at the following location: Storage Manager/page/Deduplication Revision History Revision Revision Summary of Changes Number Date /17/12 Initial publication /31/12 Clarification on deduprequiresbackup option and other minor edits /10/13 General updates on best practices /27/13 Add information covering deduplication of Exchange data /09/13 Major revision to reflect scalability and best practice improvements provided by TSM and /17/15 Revised to include references to TSM Blueprints and Solutions Disclaimer The information contained in this document is distributed on an "as is" basis without any warranty either expressed or implied. This document has been made available as part of IBM developerworks WIKI, and is hereby governed by the terms of use of the WIKI as defined at the following location: Acknowledgements The authors would like to express their gratitude to the following people for contributions in the form of adding content, editing, and providing insight into TSM technology. Matt Anglin, Tivoli Storage Manager Server Development Dave Cannon, Tivoli Storage Manager Architect Robert Elder, Tivoli Storage Manger Performance Evaluation Tom Hughes, Executive; WW Storage Software Kathy Mitton, Tivoli Storage Manager Server Development Harley Puckett, Tivoli Storage Software Development - Executive Consultant Michael Sisco, Tivoli Storage Manager Server Development Richard Spurlock, CEO and Founder, Cobalt Iron Page 2 of 52

3 Contents Effective Planning and Use of TSM V6 and V7 Deduplication 1 Introduction Scope of This Document Overview Description of deduplication technology Data reduction and data deduplication Server-side and client-side deduplication Pre-requisites for configuring TSM deduplication Comparing TSM deduplication and appliance deduplication Conditions for effective use of TSM deduplication Traditional TSM architectures compared with deduplication architectures Examples of appropriate use of TSM deduplication Data characteristics for effective deduplication When is it not appropriate to use TSM deduplication? Primary storage of backup data is on VTL or physical tape No flexibility with the backup processing window Restore performance considerations Resource requirements for TSM deduplication Database and log size requirements TSM database capacity estimation TSM database log size estimation Estimating capacity for deduplicated storage pools Estimating storage pool capacity requirements Hardware recommendations and requirements Database I/O requirements CPU Memory Considerations for the storage pool disk Hardware requirements for TSM client deduplication Implementation guidelines Deciding between client and server deduplication TSM Deduplication configuration recommendations Recommendations for deduplicated storage pools Recommended options for deduplication Best practices for ordering backup ingestion and data maintenance tasks Page 3 of 52

4 4 Estimating deduplication savings Factors that influence the effectiveness of deduplication Characteristics of the data Impacts from backup strategy decisions Effectiveness of deduplication combined with progressive incremental backup Interaction of compression and deduplication How deduplication and compression interact with TSM Considerations related to compression when choosing between client-side and server-side deduplication Understanding the TSM deduplication tiering implementation Controls for deduplication tiering The impact of tiering to deduplication storage reduction Client controls that optimize deduplication efficiency What kinds of savings can I expect for different application types IBM DB Microsoft Exchange Microsoft SQL Oracle VMware SAP Backup using Tivoli Storage Manager for ERP How to determine deduplication results Simple TSM Server Queries QUERY STGPOOL Other server queries affected by deduplication TSM client reports TSM deduplication report script Page 4 of 52

5 1 Introduction Data deduplication is a technology that removes redundant data to reduce the storage capacity requirement for retaining the data. When deduplication technology is applied to data protection it can provide a highly effective means for reducing overall cost of a data protection solution. Tivoli Storage Manager introduced deduplication technology beginning with TSM V6.1. This document describes the benefits of deduplication and provides guidance on how to make effective use of the TSM deduplication feature as part of a welldesigned data protection solution. Although the information in this document is focused on TSM Version 7, particularly with regard to scalability and ingest rates, most of the information is still relevant to the older Version 6. Many improvements to deduplication have been made in Version 7 and associated maintenance releases, therefore upgrading to Version 7 should be seriously considered for TSM deployments that use TSM deduplication. Many of the recommendations throughout the document assume you are running the latest version of the TSM server at the time of this publication. Following are key points regarding TSM deduplication: TSM deduplication is an effective tool for reducing overall cost of a backup solution Additional resources (DB capacity, CPU, and memory) must be configured for a TSM server that is enabled with TSM deduplication. However, when properly configured, the benefit of storage pool capacity reduction is most likely to result in a significant cost reduction benefit. Cost reduction is the result of data reduction. Deduplication is just one of several methods that TSM provides for data reduction (such as progressive incremental backup). The goal is overall data reduction when all of the techniques are combined, rather than just on the deduplication ratio. TSM deduplication can operate on backup, archive, and HSM data. This includes data which is stored via the TSM API. TSM deduplication is an appropriate data reduction method for many situations. It can also be used as a cost effective option for backing up a subset of an environment that uses a deduplication appliance for the remaining backups. 1.1 Scope of This Document This document is intended to provide the technical background, general guidelines, and best practices for using TSM deduplication. Detailed information for configuring TSM servers with deduplication is provided in the TSM Blueprints documentation ( The Blueprint documentation and scripts should be considered for use when configuring a TSM server. In addition to the Blueprints guidelines for selecting a TSM architecture are documented in the TSM Solutions ( The document does not provide comprehensive instruction and guidance for the administration of TSM, and should be used in addition to the TSM product documentation. 1.2 Overview Description of deduplication technology Deduplication technology detects patterns within data that appear multiple times within the scope of a collection of data. For the purposes of this document, the collection of data consists of TSM backup, archive, and HSM data (all of these types of data will be referred to as backup data throughout this document). The Page 5 of 52

6 patterns that are detected are represented as a hash value that is much smaller than the original pattern, specifically 20 bytes. Except for the original instance of the pattern, subsequent instances of the chunk are referenced by the hash value. As a result, for a pattern that appears many times throughout a given collection of data, significant reduction in storage can be achieved. Unlike compression, deduplication can take advantage of a pattern that occurs multiple times within a collection of data. With compression, a single instance of a pattern is represented by a smaller amount of data that is used to algorithmically recreate the original data pattern. Compression cannot take advantage of common data patterns that reoccur throughout the collection of data, and this significantly reduces the potential reduction capability. However, compression can be combined with deduplication to take advantage of both techniques and further reduce the required amount of data storage beyond just one technique or the other TSM deduplication use compared with other deduplication approaches Deduplication technology of any sort requires CPU and memory resources to detect and replace duplicate chunks of data, as described throughout this section. Software based technologies such as TSM deduplication create similar outcomes to hardware based or appliance technologies. By using a software based solution the need to procure specialized, and therefore comparatively expensive, dedicated hardware is negated. This means that by using TSM based deduplication standard hardware components such as server and storage can be used. Because TSM has significant data efficiencies compared to other software based deduplication technologies (see section of this document) there is less duplicate data to detect process and remove. Therefore, all other things being equal, TSM requires less of this standard hardware resource to function compared to other software based deduplication technologies. Care should still be taken in planning and implementing this technology, but under the majority of use cases TSM provides a viable proven technical platform where available. Where not available, such as when performing backups over the storage area network (SAN) alternate technologies, such as a VTL provide an appropriate architectural solution. The diagram below outlines the reference architectures for these uses cases and highlights some key considerations. Page 6 of 52

7 How does TSM perform deduplication TSM uses an algorithm to analyze variable sized, contiguous segments of data, called chunks, for patterns that are likely to be duplicated within the same TSM storage pool. This process is explained in more detail in a later section in this document. As described above the repeated identical chunks of data are removed and replaced with a smaller pointer. The implementation of TSM deduplication only applies to the FILE device class (sequential-access disk) storage pools, and can be used with primary, copy, or active-data pools Data reduction and data deduplication When using data deduplication to substantially reduce storage capacity requirements it is important to consider other data reduction techniques that are available. Unlike other backup products, TSM provides a substantial advantage in data reduction through its native capability to back up data only once rather than create duplicate data by repeatedly backing up unchanged files and other data. Combined with TSM s incremental forever backup technology, deduplication and compression, the overall data reduction effectiveness of TSM should be considered rather than just the reduction from deduplication alone. Inherent efficiency combined with deduplication, compression, exclusion of specified objects, and appropriate retention policies, enables TSM to provide highly effective data reduction. If reduction of storage and infrastructure costs is the goal, the focus will be on overall data reduction effectiveness, with data deduplication effectiveness as one component. The following table provides a summary of the data reduction technologies that TSM offers: Page 7 of 52

8 Client compression Incremental forever Subfile backup Deduplication How data reduction is achieved Client compresses files Client only sends changed files Client only sends changed regions of a file Eliminates redundant data chunks Conserves network bandwidth? Yes Yes Yes When client-side deduplication is used. Data supported Scope of data reduction Backup, archive, HSM, API Redundant data within same file on client node Backup Files that do not change between backups Backup (Windows only) Backup, archive, HSM, API (HSM supported only for server-side deduplication) Unchanged regions within Redundant data from previously backed any data in storage pool up files Avoids storing identical files renamed, copied, No or relocated on client No No Yes node? Removes redundant data for files from different client nodes? No No No Yes Can be used with any type of storage pool configuration? Yes Yes Yes No Server-side and client-side deduplication TSM provides two options for performing deduplication: client-side and server-side deduplication. Both methods use the same algorithm to identify redundant data, however the when and where of the deduplication processing is different Server-side deduplication With server-side deduplication, all of the processing of redundant data occurs on the TSM server, after the data has been backed up. Server-side deduplication is also called target-side deduplication. The key characteristics of server-side deduplication are: Duplicate data is identified after backup data has been transferred to the storage pool volume. The duplicate identification processing must run regularly on the server, and will consume TSM server memory, CPU and TSM database resources. Page 8 of 52

9 Storage pool data reduction is not realized until data from the deduplication storage pool is moved to another storage pool volume, usually through a reclamation process, but can also occur during a TSM MOVE DATA process Client-side deduplication Client-side deduplication processes the redundant data during the backup process on the host system where the source data is located. When used without TSM client compression, the net results of deduplication are virtually the same as with server-side deduplication, except that the storage savings are realized immediately, since only the unique data needs to be sent to the server in its entirety. Data that is duplicated requires only a small signature to be sent to the TSM server. Client-side deduplication is especially effective when it is important to conserve bandwidth between the TSM client and server. In some cases, client-side deduplication has the potential to be more scalable than server-side deduplication due to the reduced I/O demands that result from how it immediately removes redundant data before it is sent to the TSM server. A number of conditions must exist for this to be the case: Sufficient client CPU resource to perform the duplicate identification processing that occurs in-line during backup. The ability to drive parallel client sessions where the number of client sessions exceeds the number of identify duplicates processes the server is capable of running. The combination of the TSM database running on fast disk, and a high bandwidth low latency network between the clients and server Using Client Compression with Client-side Deduplication Further reduction of backup data can be achieved by configuring client compression in addition to client-side deduplication. When client compression is configured, the deduplication process occurs first, and then the data chunk is compressed before sending to the server. The compression will only be applied when data is sent to the server. If a duplicate chunk of data is identified, no data is sent to the server. Although actual results are highly dependent upon the actual source data, the use of compression can reduce the amount of backup data by an additional 5-15% above the reduction from deduplication. Consider the following when using compression with client-side deduplication. Compression requires an additional processing step, and therefore requires some additional CPU resource on the client host. This additional amount of resource is typically not an issue unless the CPU resource is already heavily utilized. Backup performance will also be affected by this additional step, although this impact is often mitigated by the reduced amount of data transfer required. Due to the additional processing step required for uncompressing the data, restore performance can be impacted with the use of client compression Client deduplication cache Although it is necessary for the backup client to check in with the server to determine whether a chunk is unique or a duplicate, the amount of data transfer is small. The client must query the server for each chunk of data that is processed. The overhead associated with this query process can be reduced substantially by configuring a cache on the client, which allows previously discovered chunks on the client (during the backup session) to be identified without a query to the TSM server. For the backup-archive client (including VMware backup,) it is recommended to always configure a cache when using client-side deduplication. For applications that use the TSM API, the deduplication cache should not be used due to the potential for backup failures caused by the cache being out of sync with the TSM server. If multiple, concurrent TSM client sessions are configured (such as with a TSM for VMware vstorage backup server), there must be a separate cache configured for each session. There are also conditions where faster performance will be Page 9 of 52

10 possible when the deduplication cache is disabled. When the network between the clients and server has high bandwidth and low latency and the TSM server database is on fast storage, the deduplication queries directly to the TSM server can outperform queries to the local cache Pre-requisites for configuring TSM deduplication This section provides general description of pre-requisites when using TSM deduplication. For a complete list of pre-requisites refer to the TSM administrator documentation Pre-requisites common to client and server-side deduplication The destination storage pool must be of type FILE (sequential disk) The target storage pool must have the deduplication setting enabled The TSM database must be configured according to best practices for high performance Pre-requisites specific to client-side deduplication When configuring client-side TSM deduplication, the following requirements must be met: The client and server must be at version or later. The latest maintenance version should always be used. The client must have the client-side deduplication option enabled (DEDUPLICATION YES). The server must enable the node for client-side deduplication with the DEDUP=CLIENTORSERVER parameter using either the REGISTER NODE or UPDATE NODE commands. Files must be bound to a management class with the destination parameter pointing to a storage pool that is enabled for deduplication. By default, all client files that are at least 2KB and smaller than the value specified by the server clientdeduptxlimit option are processed with deduplication. The exclude.dedup client option provides a feature to selectively exclude certain files from client-side deduplication processing. The following TSM features are incompatible with TSM client-side deduplication: Client encryption LAN-free/storage agent UNIX HSM client Subfile backup Simultaneous storage pool write Comparing TSM deduplication and appliance deduplication TSM s deduplication provides the most cost effective solution for reducing backup storage costs, since there is no additional software license charge for it, and it does not require special purpose deduplicating hardware appliances. Deduplication of backup data can also be accomplished by using a deduplicating storage device in the TSM storage pool hierarchy. Deduplication appliances such as IBM s ProtecTIER and EMC s Data Domain provide deduplication capability at the storage device level. NAS devices are also available that provide NFS or CIFS mounted storage that removes redundant data through deduplication. Page 10 of 52

11 A optimal balance can be made between TSM deduplication and storage appliance deduplication. Both techniques can be used in the same environment for separate storage hierarchies or in separate TSM server instances. For example, TSM client-side deduplication is an ideal choice for backing up remote environments, either to a local TSM server or to a central datacenter. TSM node replication can then take advantage of the deduplicated storage pools to reduce data transfer requirements between TSM servers, for disaster recovery purposes. Alternatively, within a large datacenter, a separate TSM server may be designated for backing up a critical subset of all hosts using TSM deduplication. The remaining hosts would back up to a separate TSM server instance that uses a deduplicating appliance such as ProtecTier for its primary storage pool and also supports replication of the deduplicated data. TSM deduplication should not be used in the same storage hierarchy as a deduplicating appliance. For a deduplicating VTL, the TSM storage pool data would need to be rehydrated before moving to the VTL (as with any tape device), and there would be no data reduction as a result of the TSM deduplication rather it would be re-deduplicated by the VTL. For a deduplicating NAS device, a FILE device type could be created on the NAS. However, since the data is already deduplicated by TSM there would be little to no additional data reduction possible by the NAS device Factors to consider when comparing TSM and appliance deduplication There are three major factors to consider when deciding which deduplication technology to use: Scale Scope Cost Scale The TSM deduplication technology is a scalable solution which uses software technology that makes heavy use of TSM database transactions. The deduplication processing has an impact on daily server processes such as reclamation and storage pool backup. For a specific TSM server hardware configuration (for example, TSM database disk speed, processor and memory capability, and storage pool device speeds), there is a practical limit to the amount of data that can be backed up using deduplication. The two primary points of scalability to consider are the daily amount of new data that is ingested, as well as, the total amount of data which will be protected over time. The practical limits described are not hard limits in the product, and will vary based on the capabilities of the hardware which is used. The limit on the amount of protected data is presented as a guideline with the purpose of keeping the size of the TSM database below the recommended limit of 4TB. A 4TB database corresponds roughly to 400TB of protected data (original data plus all retained versions). There is no harm in occasionally exceeding the limit for daily ingest which is prescribed with the goal of allowing enough time each day for the TSM server s maintenance tasks to run efficiently. Regularly exceeding the practical limit on daily ingest for your specific hardware may have an impact on the ability to achieve the maximum possible amount of data reduction, or cause backup durations to run longer than desired. Deduplication appliances have dedicated resources for deduplication processing and do not have a direct impact on TSM server performance and scalability. If it is desired to scale up a single TSM server instance as much as possible, beyond approximately 400TB of protected data (original data plus all retained versions), then appliance deduplication may be considered. However, often a more cost-effective approach is to scale out with additional TSM server instances. Using additional TSM server instances can provide the ability to manage many multiples of 400TB protected data. In addition to the scale of data stored, the scale of the daily amount of data backed up will also have a practical limit with TSM. The daily ingest is established by the capabilities of system resources as well as the inclusion of secondary processes such as replication and storage pool backup. Since deduplicating appliances are single-purpose devices, there is the potential for greater throughput due to the use of dedicated resources. A cost/benefit analysis should be considered to determine the appropriate choice or Page 11 of 52

12 mix of deduplication technologies. The following table provides some general guidelines for daily ingest ranges for each TSM server relative to hardware configuration choices. Ingest range Server requirements Storage requirements Up to 4TB per day 12 CPU cores 64 GB RAM 4-8 TB per day 24 CPU cores 128 GB RAM Database and active log on SAS/FC 15K rpm Storage pool on NL-SAS/SATA or SAS Database and active log on SAS/FC 15K rpm Storage pool on NL-SAS/SATA or SAS 8-20 TB per day and up to 30TB per day with client-side deduplication. 32 CPU cores 192 GB RAM Database and active log on SSD/flash storage Storage pool on SAS Scope The scope of TSM deduplication is limited to a single TSM server instance and more precisely within a TSM storage pool. A single, shared deduplication appliance can provide deduplication across multiple TSM servers. When TSM node replication is used in a many-to-one architecture, such as with branch offices, the deduplicated storage pool on the replication target can deduplicate across the data incoming from the multiple source servers Cost TSM deduplication functionality is embedded in the product without an additional software license cost; in fact TSM software license costs will reduce when capacity based licensing is in force because the capacity is calculated after deduplication has occurred. It is important to consider that hardware resources must be appropriately sized and configured. Additional expense should be anticipated when planning a TSM server configuration that will be used with deduplication. However, these additional costs can easily be offset by the savings in disk storage. Also, the software license costs are reduced when capacity-based pricing is in effect. Deduplication appliances are priced for the performance and capability that they provide, and generally are considered more expensive per GB than the hardware requirements for TSM native deduplication. A detailed cost comparison should be done to determine the most cost-effective solution. 1.3 Conditions for effective use of TSM deduplication Although TSM deduplication provides a cost-effective and convenient method for reducing the amount of disk storage required for backups, there are specific conditions that can provide the most benefit when using TSM deduplication. Conversely, there are conditions where TSM deduplication will not be effective and in fact may reduce the efficiency of a backup operation. Conditions that lead to effective use of TSM deduplication including the following: Page 12 of 52

13 Need for reduction of the disk space required for backup storage. Need for remote backups over limited bandwidth connections. Use of TSM node replication for disaster recovery across geographically dispersed locations. Total amount of backup data and data backed up per day are within the recommended limits of less than 400TB total and 30TB per day for each TSM server instance. Either a disk-to-disk backup should be configured (where the final destination of backup data is on a deduplicating disk storage pool), or data should reside in the FILE storage pool for a significant time (e.g., 30 days), or until expiration. The deduplicated storage pools should not be used as a temporary staging pool before moving to tape or another non-deduplicating storage pool since this can be highly inefficient. Backup data should be a good candidate for data reduction through deduplication. This topic is covered in greater detail in later sections. High performance disk must be used for the TSM database to provide acceptable TSM deduplication performance Traditional TSM architectures compared with deduplication architectures A traditional TSM architecture ingests data into disk storage pools, and moves this data to tape on a frequent basis to maintain adequate free space on disk for continued ingestion. An architecture that includes deduplication changes this model to store the primary copy of data in a sequential file storage pool for its entire life cycle. Deduplication provides enough storage savings to make keeping the primary copy on disk an affordable possibility. Tape storage pools still have a place in this architecture for maintaining a secondary storage pool backup copy for disaster recovery purposes or for data with very long retention periods for example 7 years or forever. Although it is possible to migrate from a TSM deduplication storage pool to a non-deduplicated storage pool (such as tape, VTL, or non-deduplicated disk), this practice is not advised. Migrating to a nondeduplicated storage pool is inefficient and can result in poor performance. An architecture that uses TSM deduplicated storage pools should be based on leaving the data in that storage pool until it expires Examples of appropriate use of TSM deduplication This section contains examples of TSM architectures that can make the most effective use of TSM deduplication Deduplication with a secondary storage pool backup architecture Page 13 of 52

14 In this example the primary storage pool is a file-sequential disk storage pool configured for TSM deduplication. The deduplication storage pool is backed up to a tape library copy storage pool using the storage pool backup capability. The use of a secondary copy storage pool is an optional feature which provides an extra level of protection against disk failure in your primary storage pool. Here are some general considerations when a copy storage pool will be used: Having a second copy storage pool using disk (which can also be deduplicated) is another option. Server-side deduplication is a two-step process which includes duplicate identification followed by removal of the excess data during a subsequent data movement process such as reclamation or migration. If desired, the second step can be deferred until after a storage pool backup copy is created. See the description of the deduprequiresbackup option in a later section for additional considerations. When using server-side deduplication, schedule the storage pool backup process prior to the reclamation processing to ensure that there is minimal overhead when copying the data. After identify duplicates has run, the data is not deduplicated but it is redefined such that it can be reconstructed and dehydrated during the subsequent data movement operation. Sufficient time must be allotted for the scheduled storage pool backup to complete before the start of the schedule for reclamation. When using client-side deduplication, the storage pool backup processing will always occur after data has been deduplicated. This requires deduplicated data to be reconstructed during the copy (if the copy storage pool is not also deduplicated). The reconstruction processing can result in storage pool backup processing which is slower when compared with storage pool backup processing of data which has not been deduplicated. For planning purposes, estimate that the duration of storage pool backup will be doubled for data which is already deduplicated. Page 14 of 52

15 Deduplication with node replication copy Node replication is included with TSM, beginning with version 6.3., The node replication feature allows for an alternative architecture where deduplicated data is replicated to a second server in an incremental fashion that takes advantage of deduplication by only replicating unique data not previously replicated. The reconstruction penalty described in a previous section for storage pool backup of deduplicated data is also avoided. The use of a node replication copy corresponds to the Tivoli Storage Manager Solutions Multisite disk architecture, shown below Disk-to-disk backup Disk-to-disk backup refers to the scenario where the preferred backup storage device is disk-based, as opposed to tape or a virtual tape library (VTL). Disk-based backup has become more popular as the unit cost of disk storage has fallen. It has also become more common as companies distinguish between backup data, which is kept for a relatively short amount of time, and archive data, which has long term retention. Disk-to-disk backup still requires a backup of the storage pool data, and the backup or copy destination may be tape or disk. However, with disk-to-disk backup, the primary storage pool data remains on disk until it Page 15 of 52

16 expires. A significant reduction of disk storage can be achieved if the primary storage pool is configured for deduplication Data characteristics for effective deduplication When considering the use of TSM deduplication, you should assess whether the characteristics of the backup data are appropriate for deduplication. A more detailed description of data characteristics for deduplication is provided in the section on estimating deduplication efficiency. General types of structured and unstructured data are good candidates for deduplication, but if your backup data consists mostly of unique binary images or encrypted data, you may wish to exclude these data types from a management class that uses a deduplicated storage pool. 1.4 When is it not appropriate to use TSM deduplication? TSM deduplication can provide significant benefits and cost savings, but it does not apply to all situations. The following situations are not appropriate for using TSM deduplication: Primary storage of backup data is on VTL or physical tape Movement to tape requires rehydration of the deduplicated data. This takes extra time and requires processing resources. If regular migration to tape is required, the benefits of using TSM deduplication may be reduced, since the goal is to reduce disk storage as the primary location of the backup data No flexibility with the backup processing window TSM deduplication processing requires additional resources, which can extend backup windows or server processing times for daily backup activities. For example, a duplicate identification process must run for server-side deduplication. Additional reclamation activity is required to remove the duplicate data from a storage pool after the duplicate identification processing completes. For client-side deduplication, the client backup speed will generally be reduced for local clients (remote clients may not be impacted if there is a bandwidth constraint). If the backup window has already reached the limit for service level agreements, TSM deduplication could possibly impact the backup window further unless careful planning is done. Page 16 of 52

17 1.4.3 Restore performance considerations Restore performance from deduplicated storage pools is slower than from a comparable disk storage pool that does not use deduplication. However, restore from a deduplicated storage pool can compare favorably to restore from tape devices for certain workloads. If fastest restore performance from disk is a high priority, then restore performance benchmarking should be done to determine whether the effects of deduplication can be accommodated. The following table compares the restore performance of small and large object workloads across several storage scenarios. Storage pool type Small object workload Large object workload Tape Typically slower due to tape mounts and seeks Typically faster due to streaming capabilities of modern tape drives Non-deduplicated disk Typically faster due to Comparable to or slightly absence of tape mounts and slower than tape quick seek times Deduplicated disk Faster than tape, slower than non-deduplicated disk Slowest since data must be rehydrated, when compared to tape which is fast for streaming large objects that are not spread across many tapes. Page 17 of 52

18 2 Resource requirements for TSM deduplication TSM deduplication provides significant benefits as a result of its data reduction technology, particularly when combined with other data reduction techniques available with TSM. However, the use of deduplication in TSM adds additional requirements for hardware and database/log storage, which are essential for a successful implementation. When configuring TSM to use deduplication, you must ensure that proper resources have been allocated to support the use of the technology. The resources include hardware requirements necessary to meet the additional processing performed during deduplication, additional storage requirements for handling the TSM database records used to store the deduplication catalog, and additional storage requirements for the TSM server database logs. The TSM internal database plays a central role in enabling the deduplication technology. Deduplication requires additional database capacity to be available. In addition, there is a significant increase in the frequency of references to records in the database during many TSM operations including backup, restore, duplicate identification, reclamation, and expiration. These demands on the database require that the database disk storage be capable of sustaining higher rates of I/O operations than would be required without the use of deduplication. As a result, planning for resources used by the TSM database is critical for a successful deduplication deployment. This section guides you through the estimation of resource requirements to support TSM deduplication. 2.1 Database and log size requirements TSM database capacity estimation Use of TSM deduplication significantly increases the capacity requirements of the TSM database. This section provides some guidelines for estimating the capacity requirements of the database. It is important to plan ahead for the database capacity so an adequate amount of higher-performing disk can be reserved for the database (refer to the next section for performance requirements). The estimation guidelines are approximate, since actual requirements will depend on many factors including ones that cannot be predicted ahead of time (for example, a change in the data backup rate, the exact amount of backup data, and other factors) Planning database space requirements The use of deduplication in TSM requires more storage space in the TSM server database than without the use of deduplication. One important point to note is that when using deduplication, the TSM database grows proportionally to the amount of data that is stored in deduplicated storage pools. This is because each chunk of data that is stored in a deduplicated storage pool is referenced by an entry in the database. Without deduplication, each backed-up object (typically a file) is referenced by a database entry, and the database grows proportionally to the number of objects that are stored. With deduplication, the database grows proportionally to the total amount of data backed up. The following table provides an example to illustrate this point: Page 18 of 52

19 Number of objects stored Amount of data being managed Storage requirements Without deduplication 500 million 200 TB 475 GB * With deduplication 500 million 200 TB 2000 GB ** * Using rule-of-thumb of 1KB of database space per object stored ** Using rule-of-thumb of 100GB of database space per 10TB of data managed The document Determining the impact of deduplication on TSM server database and storage pools provides detailed information for estimating the amount of disk storage that will be required for your TSM database. The document provides formulas for estimating database size based on the volume of data to be stored. As a simplified rule-of-thumb for taking a rough estimate, you can plan for 100GB of database storage for every 10TB of data that will be protected in deduplicated storage pools Database reorganization The TSM server uses a process called reorganization to remove fragmentation that can accumulate in the database over time. When deduplication is used, the number of database records increases significantly for storing information for data chunks, and as data is expired on the TSM server, significant amounts of deletion occurs within the database increasing the need for reorganization. Reorganization can be processed on-line while the TSM server is running, or off-line while the server is halted. Depending on your server workloads, you might need to disable both table and index reorganization to maintain server stability and to reliably complete daily server activities. With reorganization disabled, if you experience unacceptable database growth or server performance degradation, you will need to plan offline reorganization for those tables. The TSM database sizing guidelines given in the previous section include additional space to accommodate database fragmentation that can grow the database used space between reorganizations. For additional information on best practices related to reorganization and data deduplication, see the technote titled Database size, database reorganization, and performance considerations for Tivoli Storage Manager V6 and V7 servers TSM database log size estimation The use of deduplication adds additional requirements for the TSM server database, active log, and archive log storage. Properly sizing the storage capacity for these components is essential for a successful implementation of deduplication Planning active log space requirements The database active log stores information about database transactions that are in progress. With deduplication, transactions can run longer, requiring more space to store the active transactions. Tip: Use the maximum allowed size for the active log which is 128GB. Page 19 of 52

20 Planning archive log space requirements The archive log stores older log files for completed transactions until they are cleaned up as part of the TSM server database backup processing. The file system holding the archive log must be given sufficient capacity to avoid running out of space, which can cause the TSM server to be halted. Space is freed in the archive log every time a full backup is performed of the TSM server s database. See the document on Sizing the TSM archive log for detailed information on how to carefully calculate the space requirements for the TSM server archive log. Tip: A file system with 500GB of free space has proven to be more than adequate for a large-scale TSM server that ingests several terabytes a day of new data into deduplicated storage pools and performs a full TSM database backup once a day. For server which will ingest more than 4TB of new data each day, an archive log with 1TB of free space is recommended. 2.2 Estimating capacity for deduplicated storage pools TSM deduplication ratios typically range from 2:1 (50% reduction) to 15:1 (93% reduction), and is data dependent. Lower ratios are associated with backups of unique data (e.g., such as progressive incremental data), and higher ratios are associated with backups that are repeated, such as repeated full backups of databases or virtual machine images. Mixtures of unique and repeated data will result in ratios within that range. If you aren't sure of what type of data you have and how well it will reduce, use 3:1 for planning purposes when comparing with non deduplicated TSM storage pool occupancy. This ratio corresponds to an overall data reduction ratio of over 15:1 when factoring in the data reduction benefits of progressive incremental backups Estimating storage pool capacity requirements Delayed release of storage pool data Due to the latency for deletion of data chunks with multiple references, there is a need for transient storage associated with data chunks that must remain in a storage pool volume even though their associated file or object is deleted or expired. As a result of this behavior, storage pool capacity sizing must account for some percentage of data that is retained because of references by other objects. This latency can result in the delayed deletion of storage pool volumes Delayed effect of post-identification processing Storage reduction does not always occur immediately with TSM deduplication. In the case of server-side deduplication, sufficient storage pool capacity is required to ingest the full amount of daily backup data. With server-side deduplication, removal of redundant data does not occur until after storage pool reclamation completes, which in turn may not complete until after a storage pool backup is done. If client-side deduplication is used, this delay will not apply. Sufficient storage pool free capacity must be maintained to accommodate continued backup ingestion Estimating storage pool capacity requirements You can roughly estimate storage pool capacity requirements for a deduplicated storage pool using the following technique: Estimate the base size of the source data Estimate the daily backup size, using an estimated change and growth rate Page 20 of 52

21 Determine retention requirements Estimate the total amount of source data by factoring in the base size, daily backup size, and retention requirements. Apply the deduplication ratio factor Uplift the estimate to consider transient storage pool usage The following example illustrates the estimation method: Parameter Value Notes Base size of the source data 40TB Data from all clients that will be backed up to the deduplicated storage pool. Estimated daily change rate 2% Includes new and changed data Retention requirement 30 days Estimated deduplication ratio 3:1 3:1 assumes compression is used with client-side deduplication Uplift for transient storage pool volumes 30% Computed Values: Parameter Computation Result Base source data 40TB 40TB Estimated daily backup amount 40TB * 0.02 change rate 0.8TB Total changed data retained 30 * 0.8TB daily backup 24TB Total data retained 40TB base data + 24TB retained 64TB Retained data after deduplication (3:1 ratio) 64TB/3 21.3TB Uplift for delays in chunk deletion (30%) 21.3TB * TB Add full daily backup amount 27.69TB + 0.8TB 28.49TB Round up: Storage pool capacity requirement 29TB 2.3 Hardware recommendations and requirements The use of deduplication requires additional processing, which increases the TSM server hardware requirements beyond what is required without the use of deduplication. The most critical hardware requirement when using deduplication is the I/O capability of the disk system that is used for the TSM database. You should begin by understanding the base hardware recommendations for the TSM server, which are described in the following documents: AIX, HPUX, Linux x86, Linux on Power, Linux on system Z, Solaris, Windows. The most detailed and comprehensive guidance for hardware requirements for a TSM Page 21 of 52

22 server with deduplication is provided in the TSM Blueprints documentation, Additional hardware recommendations are made in the TSM Version 6 deployment guide: TSM V6 Deployment Recommendations The Optimizing Performance guide also provides configuration best practices for the use of deduplication Database I/O requirements For optimal performance, fast disk storage is always recommended for the TSM database as measured in terms of Input/Output Operations Per Second (IOPS). Due to the random access I/O patterns of the TSM database, minimizing the latency of operations that access the database volumes is critical for optimizing the performance of the TSM server. The large tables used for storing deduplication information in the TSM database bring about an even more significant demand for disk storage that can handle a large number of IOPS. In general, systems based on solid-state disk technology and SAS/FC provide the best capabilities in terms of increased IOPS. Because the claims of disk manufacturers are not always reliable, we recommend measuring actual IOPS of a disk system before implementing a new TSM database. Details about how to configure high performing disk storage are beyond the scope of this document. The following key points should be considered when configuring disk storage for the TSM database: The disk used for the TSM database should be configured according to best practices for a transactional database. Low-latency, high-performance disk devices or storage subsystems should be used for the TSM database storage volumes and the active log. Slower disk technology is acceptable for the archive log. Disk devices or storage systems that are capable of a minimum of approximately 3000 IOPS are suggested for the TSM Database disk device. An additional 1000 IOPS per TB of daily ingested data (pre-deduplication) should be considered. Lower-performing disk devices can be used, but performance may not be optimal. Refer to the Deduplication FAQ's for an example configuration. Disk I/O should be distributed over as many disk devices and controllers as possible. TSM database and logs should be configured on separate disk volumes (LUNS), and should not share disk volumes with the TSM storage pool or any other application or file system Using flash storage for the TSM database Lab testing has demonstrated a significant benefit to deduplication and node replication scalability when using flash storage for the TSM database. There are many choices available when moving to flash technology. Large ingest deduplication testing has been performed with the following classes of flash-based storage: Flash acceleration using in-server PCIe adapters. For example, the High IOPS MLC (Multi Level Cell) and Enterprise Value Flash adapters for IBM SystemX Servers, available in capacities from 365 GB to 3.2 TB. These adapters appear as block storage in the operating system, and provide persistent, low-latency storage. Solid state drive modules (SSDs) as part of a disk array. For example, the SSD options available with the IBM Storwize family of disk arrays are currently available with capacities of 400GB and 800GB each and can be used to build arrays of larger capacity. Page 22 of 52

23 Flash memory appliances which provide a solution where flash storage can be shared across more than one TSM server. For example, the IBM FlashSystem family of products that are currently available in sizes ranging from 5 TB to 20TB. The following are some general guidelines to consider when implementing the TSM database using solidstate storage technologies: Solid state storage provides the most significant benefit for the database containers and active log. Testing has demonstrated a substantial improvement from moving the active log to solid state storage versus moving only the database containers. There is no substantial benefit to placing the archive log on solid state storage. Although a costly design decision, testing has demonstrated a 5-10% improvement to daily ingest capabilities when using RAID10 for the database container arrays rather than RAID5. When using solid state technology for the database, faster storage pool disk such as SAS 10K may be required to gain the full benefit of the faster database storage. This is particularly true when using server-side deduplication. Faster database access from using solid state technology allows pushing the parallelism to the limit with tuning parameters for tasks such as backup sessions, identify duplicates processes, reclamation processes, and expire inventory resources CPU The use of deduplication requires additional CPU resources on the TSM server, particularly for performing the task of duplicate identification. You should consider using a minimum of at least 8 (2.2Ghz or equivalent) processor cores in any TSM server that is configured for deduplication. The following table provides CPU recommendations for different ranges of daily ingest. Daily ingest Recommended CPU cores Up to 4TB 12 4TB to 8TB 16 8TB to 30TB Memory For the highest performance of a large-scale TSM server using deduplication, additional memory is recommended. The memory is used to optimize the frequent lookup of deduplication chunk information stored in the TSM database. A minimum of 64GB of system memory should be considered for TSM servers using deduplication. If the retained capacity of backup data grows, the memory requirement may need to be as high as 192GB. It is beneficial to monitor memory utilization on a regular basis to determine if additional memory is required. The following table provides system memory guidance for different ranges of daily ingest. Page 23 of 52

24 Daily ingest Up to 4TB 4TB to 8TB 8TB to 30TB Recommended system memory 64GB 128GB 192GB Considerations for the storage pool disk The speed of the disk technology used for the deduplicated storage pool also has significant implications to the overall performance of a deduplication solution. In general, using cheaper disk such as SATA is desirable for the storage pool to keep the overall cost down. To prevent the use of slower disk technology from impacting performance, it is important to distribute the storage pool I/O across a very large number of disks. This can be accomplished by: 1. Creating a large number of volumes within the storage array. Although the optimal number of volumes is dependent upon the environment, testing has shown that 32 volumes can provide an effective configuration. 2. Present all of these volumes as file systems to the TSM server s device class definition so that I/O from activities such as backup ingest will be distributed across all of the volumes in parallel. 3. Push the parallelism of tasks such as duplicate identification and reclamation to the upper limits using the options which control the number of processes used by the tasks to drive I/O across all of the disks in the storage pool. More information on this topic follows in later sections. For systems that will handle very large daily ingests beyond 8TB per day, faster SAS or FC 10K disk is recommended for the storage pool disk. This is particularly true when using server-side deduplication to accommodate the additional I/O required for identify duplicates and reclamation processing Hardware requirements for TSM client deduplication Client-side deduplication (and compression if used with deduplication) requires resources on the client system for processing. Prior to deciding to use client-side deduplication you should verify that client systems have adequate resources available during the backup window to perform the deduplication processing. A suggested minimum CPU requirement is the equivalent of one 2.2Ghz CPU core per backup process with client-side deduplication. As an example, a system with a single-socket, quad-core, 2.2Ghz processor that is utilized 75% or less during the backup window would be a good candidate to use client-side deduplication. One CPU core should also be planned for each parallel backup stream within a process for client types that support this such as TSM for Virtual Environments. Testing has demonstrated a similar benefit to lowering CPU usage during client-side deduplication resulting from adding CPU sockets compared to using more CPU cores per socket. There is no significant additional memory requirement for client systems that use client-side deduplication. Page 24 of 52

25 3 Implementation guidelines A successful implementation of TSM deduplication requires careful planning in the following areas: Implementing an appropriate architecture suitable for using deduplication Properly sizing your TSM server hardware and storage Configuring TSM following best practices for separating data ingestion and data maintenance tasks 3.1 Deciding between client and server deduplication After you decide on an architecture using deduplication for your TSM server, you need to decide whether you will perform deduplication on the TSM clients, the TSM server, or using a combination of the two. The TSM deduplication implementation allows storage pools to manage deduplication performed by both clients and the TSM server. The server is optimized to only perform deduplication on data that has not been deduplicated by the TSM clients. Furthermore, duplicate data can be identified across objects regardless of whether the deduplication is performed on the client or server. These benefits allow for hybrid configurations that efficiently apply client-side deduplication to a subset of clients, and use server-side deduplication for the remaining clients. Typically a combination of both client-side and server-side data deduplication is the most appropriate. Here are some further points to consider: Server-side deduplication is a two-step process of duplicate data identification followed by reclamation to remove the duplicate data. Client-side deduplication stores the data directly in a deduplicated format, eliminating the need for the extra reclamation processing. Deduplication on the client can be combined with compression to provide the largest possible storage savings. Client-side deduplication processing can increase backup durations. Expect increased backup durations if network bandwidth is not restrictive. A doubling of backup durations is a reasonable estimate when using client-side deduplication in an environment that is not constrained by the network. In addition, if you will be creating a secondary copy using storage pool backup where the copy storage pool is not using deduplication, the data movement will take longer due to the extra processing required to reconstruct the deduplicated data. Client-side deduplication can outperform server-side deduplication with a high-performing TSM server configuration and a low-latency network connection between the clients and server. In addition, when combining deduplication with node replication, client-side deduplication stores data on the TSM server in a deduplicated state that is ready for immediate replication that will take advantage of the node replication ability to conserve bandwidth by not sending data chunks that have previously been replicated. Client-side deduplication can place a significant load on the TSM server in cases where a large number of clients are simultaneously driving deduplication processing. The load is a result of the TSM server processing duplicate chunk queries from the clients. Server-side deduplication, on the other hand, typically has a relatively small number of identification processes running in a controlled fashion. Client-side deduplication cannot be combined with LAN-free data movement using the Tivoli Storage Manager for SAN feature. If you are implementing one of TSM s supported LAN-free to disk solutions, then you can still consider using server-side deduplication. Page 25 of 52

26 Tips: Perform deduplication at the client in combination with compression in the following circumstances: 1. Your backup network speed is a bottleneck. 2. Increased backup durations can be tolerated, and the maximum storage savings is more important than having the fastest possible backup elapsed times. 3. V6 servers only: the client does not typically send objects larger than 500GB in size, or client configuration options can be used to break up large objects into smaller objects. These options are discussed in a later section. 3.2 TSM Deduplication configuration recommendations NOTE: The TSM Blueprints and scripts should be considered for configuring a TSM server. The latest best practices for TSM deduplication are included in the Blueprints. This section provides information on some of the specific configuration details that are recommended for TSM deduplication, Recommendations for deduplicated storage pools The TSM deduplication feature is turned on at the storage pool level. The TSM server can be configured with more than one deduplicated storage pool, but duplicate data will not be identified across different storage pools. In most cases, using a single large deduplicated storage pool is recommended. The following commands provide an example of setting up a deduplicated storage pool on the TSM server. Some parameters are explained in further detail to give the rationale behind the values used, and later sections build upon those settings Device class A device class is used to define the storage that will be used for sequential file volumes by the deduplicated storage pool. Each of the directories specified should be backed by a separate file system, which corresponds to a distinct logical volume on the disk storage subsystem. By using multiple directories backed by different storage elements on the subsystem, the TSM round-robin implementation for volume allocation is able to achieve more throughput by spreading I/O across a large pool of physical disks. Here are some considerations for parameters with the DEFINE DEVCLASS command: The mountlimit parameter limits the number of volumes that can be simultaneously mounted by all storage pools that use this device class. Typically client sessions sending data to the server use the most mount points, so you will want to set this parameter high enough to handle the expected number of simultaneous client sessions. This parameter needs to be set very high for deduplicated storage pools to avoid having client session and server processes waiting for available mount points. The setting is influenced by the numopenvolsallowed option, which is discussed in a later section. To estimate the setting of this option, use the following formula where numprocs is the largest number of processes used for a data copy/movement task such as reclamation and migration: mountlimit = (numprocs * numopenvolsallowed) + max_backup_sessions + (restore_sessions * numopenvolsallowed) + buffer The maxcapacity parameter controls the size of each file volume that will be created for your storage pool. This parameter takes some planning. The goal is to avoid too small of a volume size, which will result in frequent end-of-volume processing and spanning of larger objects across multiple volumes, and also to avoid volume sizes that are too large to ensure that enough writeable volumes Page 26 of 52

27 are available to handle your expected number of client backup sessions. The following example shows a volume size of 50GB, which has proven to be optimal in many environments. > define devclass dedupfile devtype=file mountlimit=4000 maxcapacity=50g directory=/tsmdedup1,/tsmdedup2,/tsmdedup3,/tsmdedup4,,/tsmdedup Storage pools The storage pool is the repository for deduplicated storage and uses the device class previously defined. An example command for defining a deduplicated storage pool is given below, with explanations for parameters that vary from defaults. There are two methods for allocating volumes in a file-based storage pool. With the first method, volumes are pre-allocated and remain assigned to the same storage pool after they are reclaimed. The second method uses scratch volumes, which are allocated as needed, and return to the scratch pool once they are reclaimed. The examples below set up a storage pool using scratch volumes as this approach is more convenient and has shown in testing to more efficiently distribute the load across multiple storage containers within a disk subsystem. The deduplicate parameter is required to enable deduplication for the storage pool. The maxscratch parameter defines the maximum number of volumes that can be created for the storage pool. This parameter is used when using the scratch method of volume allocation, and should otherwise be set to a value of 0 when using pre-allocated volumes. Each volume will have a size determined by the maxcapacity parameter for the device class. In our example, 200 volumes multiplied by 50GB per volume, requires that 10TB of free space be available across the 32 file systems used by the device class. The identifyprocess parameter is set to 0 to prevent duplicate identification processes from starting automatically. This supports scheduling when duplicate identification runs, which is described in more detail in a later section. The reclaim parameter is set to 100 to prevent automatic storage pool reclamation from running. This supports the best practice of scheduling when reclamation runs, which is described in more detail in a later section. The actual threshold used for reclamation is defined as part of the scheduled reclamation command which is defined in a later section. The reclaimprocess parameter is set to a value higher than the default of 1 since a deduplicated storage pool requires a large volume of reclamation processing to keep up with the daily ingestion of new backups. As a rule-of-thumb, allow for one process for every file system defined to the device class. The example value of 32 is likely be sufficient for large-scale implementations, but you may need to tune this setting after monitoring system usage during reclamation. > define stgpool deduppool dedupfile maxscratch=200 deduplicate=yes identifyprocess=0 reclaim=100 reclaimprocess= Policy settings The final configuration step involves defining policy settings on the TSM server that allow data to ingest directly into the deduplicated storage pool that has been created. Policy requirements vary for each customer, but the following example shows policy that retains extra backup versions for 30 days. > define domain DEDUPDISK > define policy DEDUPDISK POLICY1 > define mgmtclass DEDUPDISK POLICY1 STANDARD > assign defmgmtclass DEDUPDISK POLICY1 STANDARD > define copygroup DEDUPDISK POLICY1 STANDARD type=backup destination=deduppool VEREXISTS=nolimit VERDELETED=10 RETEXTRA=30 RETONLY=80 Page 27 of 52

28 > define copygroup DEDUPDISK POLICY1 STANDARD type=archive destination=deduppool RETVER=30 > activate policyset DEDUPDISK POLICY1 Page 28 of 52

29 3.2.2 Recommended options for deduplication The server has several tuning options that control deduplication processing. The following table summarizes these options, and provides an explanation for those options for which we recommend overriding the default values. Option Allowed values Recommended value Explanation Default This option delays the completion of serverside deduplication processing until after a secondary copy of the data has been made with storage pool backup. This option does not influence whether client-side deduplication is performed. DedupRequiresBackup Yes No Default: Yes The TSM server offers many levels of protection, including the ability to create a secondary copy of your data. Creating a secondary copy is optional, but is always a best practice for any storage pool regardless of whether it is deduplicated. Note: See the section which follows this table for additional information regarding this option. ClientDedupTxnLimit Min: 32 Max: 2048 Default: 300 Default Specifies the largest object size in gigabytes that can be processed using client-side deduplication. This can be increased up to 2TB, but this does not guarantee that the TSM server will be able to process objects up to this size in all environments. ServerDedupTxnLimit Min: 32 Max: 2048 Default: 300 Default Specifies the largest object size in gigabytes that can be processed using server-side deduplication. This can be increased up to 2TB, but this does not guarantee that the TSM server will be able to process objects up to this size in all environments. DedupTier2FileSize Min: 20 Max: 9999 Default: 100 Default Changing the default tier settings is not recommended. Small changes may be tolerated, but avoid frequent changes to these settings, as changes will prevent matches between previously ingested backups and future backups. Page 29 of 52

30 DedupTier3FileSize Min: 90 Max: 9999 Default: 400 Default Changing the default tier settings is not recommended. Small changes may be tolerated, but avoid frequent changes to these settings, as changes will prevent matches between previously ingested backups and future backups. NumOpenVolsAllowed Min: 3 Max: 999 Default: This option controls the number of volumes that a process such as reclamation or client sessions can hold open at the same time. A small increase to this option is recommended, and some trial and error may be needed. Note: The device class mount limit parameter may need to be increased if this option is increased. EnableNasDedup Yes No Default: No Default If you are using NDMP backup of NetApp file servers in your environment, change this option to Yes Additional information regarding the deduprequiresbackup option Backup of the TSM primary storage pool is optional as determined by environment-specific risk mitigation requirements. The table which follows summarizes the appropriate value for the deduprequiresbackup option for different situations. In the case of a non-deduplicated copy storage pool, the storage pool backup should be performed prior to running the reclamation processing. If storage pool backup is performed after the reclamation processing (or with client-side deduplication) the copy process will take longer since it requires the deduplicated data to be reassembled to full objects. Architecture for secondary copy of backup data Appropriate setting for deduprequiresbackup A secondary copy is created using the storage pool backup capability to a non-deduplicated copy pool such as a copy pool using tape. Yes A secondary copy is created using the storage pool backup capability to a deduplicated copy pool. No No secondary copy is created. No A secondary copy is created on another TSM server using the node replication feature. No Page 30 of 52

31 3.2.3 Best practices for ordering backup ingestion and data maintenance tasks A successful implementation of deduplication with TSM requires separating the tasks of ingesting client data and performing server data maintenance tasks into separate time windows. Furthermore, the server data maintenance tasks have an optimal ordering, and in some cases need to be performed without overlap to avoid resource contention problems. TSM has the ability to schedule all of these activities to follow these best practices. There is a variation on the recommended ordering when storage pool backup is used in combination with server-side deduplication, to delay duplicate identification to allow for the fastest possible throughput for backup ingestion on systems that are I/O constrained and cannot handle overlapping backup ingest with duplicate identification. This alternate variation can also be followed any time server-side deduplication is used and the fastest possible backup ingestion is desired. A later section provides the sample commands to implement the ordering with delayed duplicate identification through command scripts and scheduling. Two suggested task sequences, A and B are described. Refer to table below to determine the preferred task sequence. Type of Deduplication Used (Server or Client Side) Is Node Replication Used? (Yes or No) Are you doing storage pool backup to a non-deduplicated copy storage pool? Suggested Task Sequence Client Side Either Yes or No Either Yes or No A Server Side No No A Server Side Yes No A Server Side No Yes B, if fastest possible backup ingest is required. Please note that the list focuses on those tasks pertinent to deduplication. Please consult the product documentation for additional commands which you may also need to include in the daily maintenance tasks Suggested task sequence A 1. The following tasks can run in parallel: a. Client data ingestion. b. Perform server-side duplicate identification by running the IDENTIFY DUPLICATES command. This processes data that was not already deduplicated on the clients. 2. Optional: Create the secondary disaster recovery (DR) copy using the REPLICATE NODE command or the BACKUP STGPOOL command. 3. Create a DR copy of the TSM database by running the BACKUP DATABASE command. Following the completion of the database backup, the DELETE VOLHISTORY command can be used to remove older versions of database backups which are no longer required. 4. Remove objects that have exceeded their allowed retention using the EXPIRE INVENTORY command. 5. Reclaim unused space from storage pool volumes that has been released through deduplication and inventory expiration using the RECLAIM STGPOOL command. 6. Backup the volume history and device configuration using BACKUP VOLHISTORY and BACKUP DEVCONFIG commands. Page 31 of 52

32 Suggested task sequence B 1. Client data ingestion. 2. Create the secondary disaster recovery (DR) copy using the BACKUP STGPOOL command. 3. Create a DR copy of the TSM database by running the BACKUP DATABASE command. Following the completion of the database backup, the DELETE VOLHISTORY command can be used to remove older versions of database backups which are no longer required. 4. Perform server-side duplicate identification by running the IDENTIFY DUPLICATES command. This processes data that was not already deduplicated on the clients. 5. Remove objects that have exceeded their allowed retention using the EXPIRE INVENTORY command. 6. Reclaim unused space from storage pool volumes that has been released through deduplication and inventory expiration using the RECLAIM STGPOOL command. 7. Backup the volume history and device configuration using BACKUP VOLHISTORY and BACKUP DEVCONFIG commands Define scripts that run each required maintenance task The following scripts, once defined, can be called by scheduled administrative commands. Here are a few points to note regarding these scripts: The storage pool backup script assumes you have already defined a copy storage pool named copypool, which uses tape storage. NOTE: Storage pool backup is optional, as determined by environment specific risk mitigation requirements. The database backup script requires a device class that typically also uses tape storage. The script for reclamation gives an example of how the parallel command can be used to simultaneously process more than one storage pool. The number of processes to use for identifying duplicates should not exceed the number of CPU cores available on your TSM server. This command also does not have a wait=yes parameter, so it is necessary to define a duration limit. If you have a large TSM database, you can further optimize the BACKUP DATABASE command by using multiple streams with TSM 6.3 and later. A deduplicated storage pool is typically reclaimed to a threshold lower than the default of 60 to allow more of the identified duplicate chunks to be removed. Some experimenting will be needed to find a value that can be completed within the available time. Tip: A reclamation setting of 40 or less is usually sufficient. define script STGBACKUP "/* Run stg pool backups */" update script STGBACKUP "backup stgpool DEDUPPOOL copypool maxprocess=10 wait=yes" line=020 define script DEDUP "/* Run identify duplicate processes */" update script DEDUP "identify duplicates DEDUPPOOL numprocess=12 duration=660" line=010 set dbrecovery TAPEDEVC numstreams=3 define script DBBACKUP "/* Run DB backups */" Page 32 of 52

33 update script DBBACKUP "backup db devclass=tapedevc type=full numstreams=3 wait=yes" line=010 update script DBBACKUP "if(error) goto done" line=020 update script DBBACKUP "backup volhistory" line=030 update script DBBACKUP "backup devconfig" line=040 update script DBBACKUP "delete volhistory type=dbbackup todate=today-7 totime=now" line=050 update script DBBACKUP "done:exit" line=060 define script RECLAIM "/* Run stg pool reclamation */" update script RECLAIM "parallel" line=010 update script RECLAIM "reclaim stgpool DEDUPPOOL threshold=40 wait=yes" line=020 update script RECLAIM "reclaim stgpool COPYPOOL threshold=60 wait=yes" line=030 define script EXPIRE "/* Run expiration processes. */" update script EXPIRE "expire inventory resources=8 wait=yes" line= Define schedules to run the data maintenance tasks The TSM server has the ability to schedule commands to run, where the scheduled action is to run the various scripts that were defined in the previous sections. The examples below give specific start times that have proven to be successful in environments where backups run from midnight until 07:00 AM on the same day. You will need to change the start times to appropriate values for your environment. NOTE: Storage pool backup is optional, as determined by environment specific risk mitigation requirements. define schedule STGBACKUP type=admin cmd="run STGBACKUP" active=yes \ desc="run all stg pool backups." startdate=today starttime=08:00:00 \ duration=15 durunits=minutes period=1 perunits=day define schedule DEDUP type=admin cmd="run DEDUP" active=yes \ desc="run identify duplicates." startdate=today starttime=00:00:00 \ duration=15 durunits=minutes period=1 perunits=day define schedule EXPIRATION type=admin cmd="run expire" active=yes \ desc="run expiration." startdate=today starttime=14:00:00 \ duration=15 durunits=minutes period=1 perunits=day define schedule DBBACKUP type=admin cmd="run DBBACKUP" active=yes \ desc="run database backup." startdate=today starttime=12:00:00 \ duration=15 durunits=minutes period=1 perunits=day define schedule RECLAIM type=admin cmd="run RECLAIM" active=yes \ desc="reclaim space from storage pools." startdate=today starttime=16:00 \ duration=15 durunits=minutes period=1 perunits=day Page 33 of 52

34 4 Estimating deduplication savings If you ask someone in the data deduplication business to give you an estimate of the amount of savings to expect for your specific data, the answer will often be it depends. The reality is that TSM, like every other data protection product, cannot guarantee a certain level of deduplication because there are a variety of factors unique to your data that influence the results. Since deduplication requires computational resources, it is important to consider which environments and circumstances can benefit most from deduplication, and when other data reduction techniques may be more appropriate. What we can do is provide an understanding of the factors that influence deduplication effectiveness when using TSM, and provide some examples of observed behaviors for specific types of data, which can be used as a reference for planning purposes. 4.1 Factors that influence the effectiveness of deduplication The following are factors that have an influence on how effectively TSM reduces the amount of data to be stored using deduplication Characteristics of the data Uniqueness of the data The first factor to consider is the uniqueness of the data. Much of deduplication savings come from repeated backups of the same objects. Some savings, however, result from having data in common with backups of other objects or even within the same object. The uniqueness of the data is the portion of an object that has never been stored by a previous backup. Duplicate data can be found within the same object, across different objects stored by the same client, and from objects stored by different clients Response to fingerprinting The next factor is how data responds to the deduplication fingerprinting processing used by TSM. During deduplication, TSM breaks objects into chunks, which are examined to determine whether they have been previously stored. These chunks are variable in size and are identified using a process called fingerprinting. The purpose of fingerprinting is to ensure that the same chunk will always be identified regardless of whether it shifts to different positions within the object between successive backups. The TSM fingerprinting implementation uses a probability-based algorithm for identifying chunk boundaries within an object. The algorithm strives to have all of the chunks created for an object average out in terms of size to a target average for all chunks. The actual size of each chunk is variable within the constraints that it must be larger than the minimum chunk size and cannot be larger than the object itself. The fingerprinting implementation results in average chunk sizes that vary for different kinds of data. For data that fingerprints to average chunk sizes significantly larger than the target average, the deduplication efficiency is more sensitive to changes. More details are given in the later section that discusses tiering Volatility of the data The final factor is the volatility of the data. A significant amount of deduplication savings is a result of the fact that similar objects are backed up repeatedly over time. Objects that undergo only minor changes between backups will end up having a significant percentage of chunks that are unchanged since the last backup and hence do not need to be stored again. Likewise, an object can undergo a pattern of change that alters a large percentage of the chunks in the object. In these cases, there is very little savings realized by deduplication. It is important to note that this effect does not necessarily relate to the amount of data being written to an object. Instead, it is a factor of how pervasively the changes are scattered throughout the Page 34 of 52

35 object. Some change patterns, such as appending new data at the end of an object, have a very favorable response with deduplication Examples of workloads that respond well to deduplication The following are general examples of backup workloads that respond well to deduplication: Backup of workstations with multiple copies or versions of the same file. Backup of objects with regions that repeat the same chunks of data (for example, regions with zeros). Multiple full backups of different versions of the same database. Operating system files across multiple systems. For example, Windows system state backup is a common source of duplicate data. Another example is virtual machine image backups with TSM for Virtual Environments. Backup of workstations with versions or copies of the same application data (for example, documents, presentations, or images). Periodic full backups taken of systems using a new nodename for the purposes of creating a out of cycle backup with special retention criteria Deduplication efficiency of some data types The following table shows some common data types along with their expected deduplication efficiency. Data type Deduplication efficiency Audio (mp3, wma), Video (mp4), Images (jpeg) Poor Human generated/consumer data: text documents, source code Office documents spreadsheets, presentations Good Poor Common operating system files Good Large repeated backups of databases (Oracle, DB2, etc.) Good Objects with embedded control structures Poor TSM data stored in non-native storage pools (for example, NDMP data) None Impacts from backup strategy decisions The gains realized from deduplication are also influenced by two different implementation choices in how backups are taken and managed. Page 35 of 52

36 Backup model For TSM, a very common backup model is the use of incremental-forever backups. In this case, each subsequent backup achieves significant storage savings by not having to send unchanged objects. These objects that are not re-sent also do not need to go through deduplication processing, which turns out to be a very efficient method of reducing data. On the other hand, other data types use a backup model that always runs a full backup, or a periodic full backup. In these cases, there will typically be significant reductions in the data to be stored, which is a result of the significant duplication across subsequent backups of the similar objects. The following table illustrates some examples of deduplication savings between full and incremental backup models: Does deduplication offer savings in the case where. Full backup Incremental backup File-level backups are taken using the backup-archive client. Database backups are taken using a data protection client. Virtual machine backups are taken using the Data Protection for VMware product. Yes when: There is data in common from other nodes such as operating system files Periodic full backups are taken for a system. This is Yes when: No when: occasionally performed using a different node name for the purpose of establishing a different retention scheme Subsequent full backups are taken (depends on volatility) The first backup is taken. Databases are typically unique Yes. VMware full backups often experience savings with matches from the backups of other virtual machines, as well as from regions from the same virtual disk that are in common. Yes for files that are being re-sent due to changes (depends on volatility) No for new files that are being sent for the first time (depends on uniqueness) Typically no. The database incremental mechanism is only sending changed regions of the object, which typically have not been stored before Retention settings In general, the more versions you set TSM policy to retain, the more savings you will realize from TSM deduplication as a percentage of the total you would have needed to store without deduplication. Users who desire to retain more versions of objects in TSM storage find this to be more cost effective when using deduplication. Consider the example below, which shows the accumulated storage used over a series of backups using the Data Protection for Oracle product. You can see that ten backup versions are stored with deduplication using less capacity than three backup versions require without deduplication. Page 36 of 52

37 Page 37 of 52

38 4.2 Effectiveness of deduplication combined with progressive incremental backup The progressive incremental backup technology in TSM provides a very effective method of efficiently reducing the amount of data processed in each backup. This technology can also be effectively combined with deduplication. When used in combination, data is initially reduced by the incremental processing which is able to skip unchanged objects without applying deduplication processing against them. For those objects which do require a backup, deduplication is applied. A very typical pattern seen with incremental backup is the existence of certain files which change continuously. For these objects, incremental backup is not able to provide any savings as they always require backup. This provides a significant source of reduction for deduplication since although these objects change frequently, the changes can be minimal in terms of the volatility from a deduplication perspective. The following example shows how this works for one common file type which changes continuously. The following chart shows a Lotus Notes mail replica file which undergoes a series of ten daily backups. In this case, deduplication is able to provide a cumulative savings of 78% after the series of backups. Page 38 of 52

39 4.3 Interaction of compression and deduplication The TSM client provides the ability to compress data with the potential to provide additional storage savings by combining both compression and deduplication. With TSM deduplication, you will need to decide whether to perform deduplication at the client, server, or in some combination. This section will guide you through the analysis that should happen in making that decision, taking into consideration the fact that combining deduplication and compression is only possible on the clients How deduplication and compression interact with TSM In general, deduplication technologies are not very effective when applied to data that is previously compressed. However, by compressing data after it is already deduplicated, additional savings can be gained. When deduplication and compression are both performed by the TSM client, the operations are sequenced in the desirable order of first applying deduplication, followed by compression. The following list summarizes key points of the TSM implementation, which will help explain other information to follow in this section: The TSM client can perform deduplication combined with compression. The TSM server can perform deduplication, but cannot perform compression. If data is compressed prior to being passed to the TSM client, it is not possible to perform deduplication prior to compression. For example, certain databases provide the ability to compress a backup stream prior to passing the stream to a Tivoli for Data Protection client. In these cases, the data will be compressed prior to TSM performing deduplication. The most significant reduction in data size is typically a result of performing the combination of client-side deduplication and compression. The additional savings provided by compression will vary depending on how well the specific data responds to the TSM client compression mechanism Considerations related to compression when choosing between client-side and server-side deduplication Typically, the decision of whether to use data reduction technologies on the TSM client depends on your backup window requirements, and whether your environment is network-constrained. With constrained networks, using data reduction technologies on the client may actually improve backup elapsed times. Without a constrained network, the use of client-side data reduction technologies will typically result in longer backup elapsed times. The following questions are important to consider when choosing whether to implement client-side data reduction technologies: 1. Is the speed of your backup network limiting backup elapsed times? 2. What is more important to your business: the amount of storage savings you achieve through data reduction technologies, or how quickly backups complete? If the answer to the first question is yes, using data reduction technologies on the client may result in both faster backups and increased storage savings on the TSM server. More often, the answer to this question is no, in which case you need to weigh the trade-offs between having the fastest possible backup elapsed times, and gaining the maximum amount of storage pool savings. Page 39 of 52

40 The graph above shows a 20GB object going through a series of ten backups. For each of the ten backups, the object in the same state was run through different data reduction mechanisms in TSM to allow comparing the behavior of each. The table summarizes the cumulative totals stored and saved for each of the techniques, along with elapsed times in some cases. The following observations can be made from these results: The most significant storage savings of 86% is seen with the combination of client-side deduplication and compression. There is a cost of a 1.5 times increase in the backup elapsed time versus a backup with no client-side data reduction. The addition of compression provides the additional 11% savings beyond the 75% that is possible using deduplication alone. Page 40 of 52

41 With compression alone, there is a savings of 47%. This is a fairly typical savings seen with TSM compression. With deduplication alone (can be either client-side or server-side,) there is a savings of 75%. There was no savings for the first backup with deduplication alone. This is typical with unique objects such as databases. The additional savings seen on the initial backup is one area in which compression provides substantial savings beyond what deduplication provides. Applying server-side deduplication to data that is already compressed by the client results in a lower 58% savings than the 75% that can be achieved using server-side deduplication alone. Caution: Your application may compress data before it is passed to the TSM client. This will result in a similar less-efficient deduplication savings. In these cases, it is best to either disable the application compression, or send this data to a storage pool that does not use deduplication. The bottom line: For the fastest backups on a fast network, choose server-side deduplication. For the largest storage savings, choose client-side deduplication combined with compression. Avoid performing client-compression in combination with server-side deduplication. 4.4 Understanding the TSM deduplication tiering implementation The deduplication implementation in TSM uses a tiered model where larger objects are processed with larger average chunk sizes with the goal of limiting the number of chunks that an object will be split into. The tiering model is used to avoid operational problems that arise when the TSM server needs to operate on objects consisting of very large numbers of chunks, and also to limit the growth of the TSM database. The use of larger average chunk sizes has the trade-off of limiting the amount of savings achieved by deduplication. The TSM server provides three different tiers that are used for different ranges of object sizes Controls for deduplication tiering There are two options on the TSM server that control the object size thresholds at which objects are processed in tier2 or tier3. All objects with sizes smaller than the tier2 threshold are processed in tier1. By default, objects under 100GB in size are processed at tier1. Objects in the range of 100GB to under 400GB are processed in tier2, and all objects 400GB and larger are processed in tier3. Avoid makings adjustments to the options controlling the deduplication tier thresholds. Changes to the thresholds after data has been stored can prevent newly stored data from matching data stored in previous backups, and can also cause operational problems if the changes cause larger objects to be processed in the lower tiers. Very large objects can be excluded from deduplication using the options clientdeduptxnlimit and serverdeduptxnlimit. The storage pool parameter maxsize can also be used to prevent large objects from being stored in a deduplicated storage pool. Beginning with TSM version 7.1, a new feature has been added where the TSM server transparently segments large objects into fragments of 10GB. Each fragment is processed with deduplication independently as a separate transaction, and can avoid operational and performance problems previously experienced with large objects. The client is unaware of this fragmentation, and the server reports on the objects normally as if they were one large object. This capability is available for both client-side deduplication and server-side deduplication, and is enabled by default. The capability can be disabled selectively for specific nodes by updating a node s SPLITLARGEOBJECTS setting. Page 41 of 52

42 Option Allowed values (GB) Implications of the default DedupTier2FileSize DedupTier3FileSize Minimum: 20 Maximum: 9999 Default: 100 Minimum: 90 Maximum: 9999 Default: 400 Objects that are smaller the 100GB will be processed in tier1. Objects 100GB and up to the tier3 setting are processed as tier2. Objects that are 400GB and larger are processed in tier3. Objects that are smaller the 400GB are processed in tier2 down to the tier2 threshold where they are processed with tier The impact of tiering to deduplication storage reduction The chart below gives an example of the impact that tiering has on deduplication savings. For the test below, the same DB2 database was processed through a series of ten sets of backups with a varying change pattern applied after each set of backups. For each set of backups, the object in the same state was tested using the three different deduplication tiers, each being stored in its own storage pool. The table below gives the cumulative savings for each tier across the ten backups. The following observations can be made: Deduplication is always more effective at reducing data in the lower tiers. The amount of difference in data reduction between the tiers depends on how the objects change between backups. For data with low volatility, there is less impact to savings from tiering. As a general rule-of-thumb, you can estimate that there will be approximately 17% loss of deduplication savings as you move through each tier. Page 42 of 52

43 4.4.3 Client controls that optimize deduplication efficiency Controls are available on some TSM client types that prevent objects from becoming too large. This allows for large objects to be processed as multiple smaller objects which fall into the tier1 range. There is not a method to accomplish this for every client type, but here are some strategies that have proven effective at keeping objects within the tier1 threshold: For Oracle database backups, use the RAM MAXPIECESIZE option to prevent any individual object crossing the tier2 size threshold. More recommendations on this topic follow in a later section. For Microsoft SQL database backups that use the legacy backup API, the database can be split across multiple streams. Each stream that is used results in a separate object being stored on the TSM server. A 200GB database, for example, can be backed up with four streams, which results in approximately four 50GB objects that will all fit within the default tier1 size threshold. Page 43 of 52

44 4.5 What kinds of savings can I expect for different application types No specific guarantee of TSM deduplication data reduction can be made for specific application types. It is possible to construct an implementation of any of the applications discussed in this section with initial data and apply changes to that data in such a way that any deduplication system would show poor results. What we can do, and what is covered in this section, is to provide some examples of how specific implementations of these applications that undergo reasonable patterns of change respond to TSM deduplication. This information can be considered to be a likely outcome of using TSM deduplication in your environment. More specific results for your environment can only be obtained by testing your real data with TSM over a period of time. In the sections that follow, sample deduplication savings are given for specific applications that result from taking a series of backups with TSM. Each of these examples shows results from only using deduplication, so improved results are possible by combining deduplication and compression. Comparisons across the three different deduplication tiers are given except for applications where using the higher tiers can be avoided. Client-side deduplication was used for all of the tests. There are tables in the following sections that include elapsed times. These are given so that you can make relative comparisons and should not be considered indicators of the performance you will see. There are many factors that will influence actual backup elapsed times, including network performance IBM DB2 Page 44 of 52

45 Page 45 of 52

46 Data Stored(MB) Effective Planning and Use of TSM V6 and V7 Deduplication Microsoft Exchange Cumulative Data Stored (122 GB MS Exchange full VSS backups) Tier 1 No Dedup 0.0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 Backup number The test environment consisted of a Microsoft Exchange Server 2010 with five databases each with a starting size of approximately 25GB containing 305 users per database. All backups were full VSS backups performed using TSM Data Protection for Microsoft Exchange 6.4. Between backups, change activity was driven against the database using the Microsoft Load Generator. The load profile was set to generate 83 tasks per user per day consisting of receive, send, delete, and open/read activity. Page 46 of 52

47 4.5.3 Microsoft SQL Oracle Backups using the Data Protection for Oracle product can achieve similar deduplication storage savings with the proper configuration. The test results summarized in the following charts only give values for tier 1. The other tiers were not tested because the RMAN MAXPIECESIZE option can be used to prevent objects from reaching sizes that require the higher tiers. The following RMAN settings are recommended when performing deduplicated backups of Oracle databases with TSM: Use the maxpiecesize RMAN parameter to keep the objects sent to TSM within the tier 1 size range. Oracle backups can be broken into multiple objects of a specified size. This allows for databases of larger sizes to be processed safely with tier1 deduplication processing. The parameter must be set to a value that is less than the TSM server DedupTier2FileSize parameter (defaults to 100GB). Page 47 of 52

48 Recommended value: A maxpiecesize setting of 10GB provides a good balance between keeping each piece at an optimal size for handling by the TSM server and having too many resulting objects. Oracle RMAN provides the capability to multiplex the backups of database filesets across multiple channels. Using this feature will typically result in less effective TSM deduplication data reduction. Use the filesperset RMAN parameter to avoid splitting a fileset across multiple channels. Recommended value: A filesperset setting of 1 should be used for optimal deduplication data reduction. Following is a sample RMAN script, which includes the recommended values for use with TSM deduplication: run { } allocate channel ch1 type 'SBT_TAPE' maxopenfiles=1 maxpiecesize 10G parms 'ENV=(TDPO_OPTFILE=/home/orc11/tdpo_10g.opt)'; backup filesperset 1 (tablespace tbsp_dd); release channel ch1; Page 48 of 52

49 4.5.5 VMware VMware backup using TSM for Virtual Environments is one area that is commonly being deployed using TSM deduplication. VMware backups are typically showing very substantial savings when the combination of client-side deduplication and compression is used. The following factors contribute to the substantial savings that are seen: There is often significant data in common across virtual machines. Part of this is the result of the same operating systems being installed and cloned across multiple virtual machines. Although the initial full results in the highest reduction in data, subsequent incremental backups (with incremental forever) can still benefit from reduction from deduplication. Some duplicate data exists within the same virtual machine on the initial backup. An example savings achieved with VMware backup using the combination of incremental forever, client-side deduplication and client compression is 25: SAP Backup using Tivoli Storage Manager for ERP TSM deduplication can provide backup storage savings when backing up SAP with the TSM for ERP product. TSM for ERP supports SAP environments that use both the Oracle and DB2 databases. Some parameters may be specific to the database in use, but the general recommendations are similar to improve deduplication results. The following factors should be considered to improve backup deduplication results when using TSM for ERP: Disable multiplexing by setting MULTIPLEXING=1. Although multiplexing can improve backup performance, it may reduce the effectiveness of deduplication. Disable TSM for ERP compression by setting RL_COMPRESSION=NO. Performing compression on data prior to processing by TSM deduplication will significantly reduce the effectiveness of deduplication. If TSM server side deduplication is being used, TSM client compression should be disabled by setting COMPRESSION=NO. Additional guidance for backing up SAP with a DB2 database with deduplication can be found at the following link: Page 49 of 52

50 5 How to determine deduplication results It is useful to evaluate the actual data reduction results from TSM deduplication to determine if the expected storage savings have been achieved. In addition to evaluating the data reduction results, other key operational factors should be checked, such as database utilization, to ensure that they are consistent with expectations. Deduplication results can be determined by various queries to the TSM server from the administrative command line or the Operations Center interface. It is important to recognize the dynamic nature of deduplication and that the benefits of deduplication are not always realized immediately after data is backed up. Also, since the scope of deduplication includes multiple backups across multiple hosts, it will take time to accumulate sufficient data in the TSM storage pool to be effective at eliminating duplicates. Therefore, it is important to sample results at regular intervals, such as weekly, to obtain a valid report of the results. In addition to checking data reduction results, TSM provides queries that can show pending activity for deduplication processing. These queries can be issued to determine an overall assessment of deduplication processing in the server. A script has been developed to assist administrators with monitoring of deduplication-related processing. The script source is provided in the appendix of this document. 5.1 Simple TSM Server Queries QUERY STGPOOL The QUERY STGPOOL command provides a basic and quick method for evaluating deduplication results. However, if the query is run prior to reclamation of the storage pool then the Duplicate Data Not Stored value will be inaccurate and not reflect the most recent data reduction. Example command: Query stgpool format=detailed Example output: Estimated Capacity: 9,848 G Space Trigger Util: 60.7 Pct Util: 60.7 Pct Migr: 60.7 Pct Logical: 98.7 <... > Deduplicate Data?: Yes Processes For Identifying Duplicates: 0 Duplicate Data Not Stored: 28,387 G (87%) Auto-copy Mode: Client Contains Data Deduplicated by Client?: Yes The displayed value of Duplicate Data Not Stored will show the actual reduction of data in megabytes or gigabytes, and the percentage of reduction of the storage pool. If reclamation has not yet occurred, the following example shows the pending amount of data that will be removed: In this example backuppool-file is the name of the deduplicating storage pool. Page 50 of 52

51 5.1.2 Other server queries affected by deduplication QUERY OCCUPANCY When a filespace is backed up to a deduplicated storage pool, the QUERY OCCUPANCY command will show the logical amount of storage per filespace. The physical space is displayed as 0.00 as this information is not able to be determined on an individual filespace basis. An example is shown below: Early versions of the TSM V6 server incorrectly maintained occupancy records in certain cases, which can result in an incorrect report of the amount of stored data. The following technote provides information on how to repair the occupancy information if necessary: TSM client reports When using client-side deduplication, the client summary report will show the data reduction associated with deduplication as well as compression. An example is shown here: Total number of objects inspected: 380,194 Total number of objects backed up: 573 Total number of objects updated: 0 Total number of objects rebound: 0 Total number of objects deleted: 0 Total number of objects expired: 72 Total number of objects failed: 0 Total objects deduplicated: 324 Total number of bytes inspected: 1.19 TB Total number of bytes processed: MB Total bytes before deduplication: 1.01 GB Total bytes after deduplication: MB Total number of bytes transferred: MB Data transfer time: sec Network data transfer rate: 6, KB/sec Aggregate data transfer rate: KB/sec Objects compressed by: 0% Deduplication reduction: 87.30% Total data reduction ratio: 99.99% Elapsed processing time: 00:13:40 Page 51 of 52

52 5.3 TSM deduplication report script A script has been developed to provide detailed information on deduplication results for a TSM server. In addition to providing summary information on the effectiveness of TSM deduplication, it can also be used to gather diagnostics if deduplication results are not consistent with expectations. The script and usage instructions can be obtained from the TSM support site: An example of the summary data provided by this report is shown below: The report also provides details of dedup related utilization of the TSM database. < End of Document> Page 52 of 52