Type of Submission: Article Title: DB2 s Integrated Support for Data Deduplication Devices Subtitle: Keywords: DB2, Backup, Deduplication

Type of Submission: Article Title: DB2 s Integrated Support for Data Deduplication Devices Subtitle: Keywords: DB2, Backup, Deduplication Prefix: Error! Bookmark not defined. Given: Dale Middle: M. Family: McInnis Suffix: Job Title: STSM DB2 LUW Availability Architect Email: dmcinnis@ca.ibm.com Bio: Dale McInnis is a Senior Technical Staff Member (STSM) at the IBM Toronto Canada lab. He has a B.Sc.(CS) from the University of New Brunswick and a Masters of Engineering from the University of Toronto. Dale joined IBM in 1988, and has been working on the DB2 development team since 1992. Dale's area of expertise includes DB2 for Linux, UNIX and Windows Kernel development, where he led teams that designed the current backup and recovery architecture and other key high availability and disaster recovery technologies. His expertise in the area DB2 availability area is well known in the information technology industry. Dale currently fills the role of DB2 Availability Architect at the IBM Toronto Canada Lab. Company: IBM Canada Ltd. Photo filename: Abstract: This article will provide an overview of data deduplication and explain how the DB2 backup utility was modified to support such devices. It will examine the compatibility of compression in a DB2 environment with data deduplication devices. Finally some best practices and tuning recommendations will be presented.

Introduction With the exponential growth in data comes the corresponding need to store and archive that data. For organizations this is not just hoarding bytes for their own sake, but instead it comes from the requirement for having data backups. The trick is to find the most efficient way to back up that data, and one of the best solutions is to determine which data is duplicated so that you can exclude that from your backup. This is known as data deduplication, a data compression technique that eliminates redundant data, thereby improving storage utilization. Beginning in DB2 for Linux, UNIX, and Windows Version 9.7 Fix Pack 4, DB2 backups have been optimized for deduplication devices, and backup operations that use such devices as a target for DB2 backup operations have been simplified. How data deduplication works Data deduplication (often called "intelligent compression" or "single-instance storage") is a method of reducing storage needs by eliminating redundant data. Only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy. For example, suppose an e-mail system contains 100 instances of the same 4 megabyte (MB) attachment. If this e-mail system is backed up without deduplication, all 100 instances of the attachment are saved, requiring 400 MB of storage. However, if the same e-mail system is backed up to a deduplication device, only one instance of the attachment is actually stored; each subsequent instance merely references the copy that was saved. Thus, the 400 MB of storage needed to back up the system will be reduced to 4 MB plus some nominal overhead for references to the deduplicated data. Most deduplication devices work by comparing relatively large chunks of data such as entire files or large portions of files. Each chunk examined is assigned an identifier, which is typically calculated using cryptographic hash functions. In many implementations, the assumption is made that if an identifier is identical, the corresponding data is identical; other implementations forego this assumption, preferring instead to do a byte-by-byte comparison to verify that data with the same identifier is indeed the same. Regardless, if it is decided that a particular chunk of data already exists in the deduplication namespace, that chunk is replaced with a link to the data that has already been stored. Later, when the deduplicated data is accessed, if a link is encountered, it is replaced with the data the link refers to. Of course, this whole process is transparent to end users and applications. Typically, deduplication is performed using one of two methods: "in-line" or "postprocess." With in-line deduplication, hash calculations and lookups are performed before data is written to disk. Consequently, in-line deduplication significantly reduces the raw disk capacity needed because not-yet-deduplicated data is never written to disk. For this reason, in-line deduplication is often considered the most efficient and economic deduplication method available. However, because it takes time to perform hash calculations and lookups, in-line deduplication elongate the time for the backup to complete, although certain in-line deduplication solution vendors have been able to achieve performance that is comparable to that of post-process deduplication. With post-process deduplication, all data is written to storage before the deduplication process is initiated. The advantage to this approach is that there is no need to wait for hash

calculations and lookups to complete before data is stored. The drawback is that a greater amount of available storage is needed initially since duplicate data must be written to storage for a brief period of time. This method also increases the lag time before deduplication is complete. Data deduplication offers other benefits. Lower storage space requirements will save money on disk expenditures. The more efficient use of disk space also allows for longer disk retention periods, which provides better recovery time objectives (RTO) for a longer time and reduces the need for tape backups. Data deduplication also reduces the data that must be sent across a WAN for remote backups, replication, and disaster recovery. How a standard DB2 backup operation works When a DB2 backup operation begins, one or more buffer manipulator (db2bm) threads are started and these threads are responsible for accessing data in the database and streaming it to one or more backup buffers. Likewise, one or more media controller (db2med) threads are started and these threads are responsible for writing data residing in the backup buffers to files on the target backup device. (The number of db2bm threads used is controlled by the PARALLELISM option of the BACKUP DATABASE command; the number of db2med threads used is controlled by the OPEN n SESSIONS option or the number of target devices.) Finally, a DB2 agent (db2agent) thread is assigned the responsibility of directing communication between the buffer manipulator threads and the media controller threads. This process can be seen in Figure 1. Figure 1: DB2's backup process model Normally, data retrieved by db2bm threads is read and placed in shared memory. The db2med threads then use a First In First Out (FIFO) algorithm to pull the backup buffers from shared memory in random order, resulting in the data being multiplexed across all of

the output streams; there is no correlation or deterministic pattern between table space data and the output streams. (This behavior is illustrated in Figure 2.) As a result, when the output streams are directed to a deduplication device, the device thrashes in an attempt to identify chunks of data that have already been backed up. Figure 2: Default database backup behavior. (Note that the metadata for a table space will appear in an output stream before any of its data and that empty extents are never placed in an output stream.) How DB2 was modified to support data deduplication devices To optimize the backup format for data deduplication the backup utility needs to ensure that the data is sent to the target devices in a predictable manner. To that end, the DEDUP_DEVICE option was added to the backup utility so the user can indicate that the target device is a data deduplication enabled device and to ensure the data sequences sent to those devices are predictable. When this option is used with the BACKUP DATABASE command, data retrieved by db2bm threads is no longer read and multiplexed across the output streams being used by the db2med threads. Instead, as data is read from a particular table space, all of that table space s data is sent to one, and only one, output stream. Furthermore, data for a particular table space is always written in order, from lowest to highest page. As a result, a predictable and deterministic pattern of the data emerges in

each output stream, making it easy for a deduplication device to identify chunks of data that have been backed up previously. Figure 3 illustrates this change in backup behavior when the DEDUP_DEVICE option of the BACKUP DATABASE command is used. Figure 3: Database backup behavior when the DEDUP_DEVICE option is specified This relatively simple change in behavior yielded some impressive gains for data deduplication. One of the initial customers to utilize the DEDUP_DEVICE option on DB2 backup experienced both faster backups and vastly improved deduplication. The customer s backups of 4 TB were exceeding 6.5 hours and were getting poor deduplication results of 2:1 or 3:1. (The deduplication ratio indicates the aggregate reduction in data stored in other words, using data deduplication was reducing their backup s size to 1/2 or 1/3). With this change, the backup elapsed time decreased to 5.5 hours, and the deduplication results were between 11:1 and 15:1. Naturally, individual results depend on the volatility of the data: the less the data changes, the higher the data deduplication ratio will be. How DB2 incremental backups compare to data deduplicated backups A DB2 incremental backup reads all of the pages in a table space and only sends the changed pages to the backup image. All of the large object (LOB) and long field data that exists in the table space is added to the backup image in its entirety due to the lack of a fixed page format. As a result, a DB2 incremental backup produces a very similarly sized

backup object as that of a data deduplicated backup image; essentially only the new pages consume space. One advantage of the data deduplicated backup over an incremental backup is the way LOBs are handled. As previously mentioned, an incremental backup always includes the entire LOB. One disadvantage of a data deduplicated backup is that it sends the entire table space's content over the LAN/SAN to the data deduplication device, thus consuming a lot of bandwidth that is not consumed with a DB2 incremental backup. Compatibility of compression with data deduplication There are several forms of compression available for DB2 DBAs to explore, namely: Row compression (aka table compression) Adaptive compression (aka page compression) DB2 backup compression TSM client compression The previous rule of thumb was that any form of compression is incompatible with data deduplication. Testing has revealed that this assumption is false and that there are circumstances in which compression and data deduplication are completely compatible. The key factor that must be determined is as follows: if the data remains unchanged does the physical binary representation of the data change between backups if compression is used? For the first two items on the list above, row and adaptive compression, the answer is no. After the data is compressed on disk, the binary format of the data does not change between backups unless the data has been modified. This is referred to as static compression as long as the data does not change the representation remains the same. This type of compression is compatible with data deduplication, as the data deduplication device can easily detect the pattern. For other two forms of compression on the list, db2 backup and TSM compression, the answer is yes. These forms of compression are referred to as dynamic compression. Each time the database is backed up the binary presentation of the data may change depending on where is the data stream the data falls. Both compression techniques use a sliding window to detect patterns and if the alignment of the window is not identical between backups then the pattern detection will result of a different compressed output; thus lowering the possibility for the data deduplication device to find a pattern match. How to tune DB2 backups for data deduplication devices The tuning parameters used by DB2 backup to perform optimally to a data deduplication device is somewhat different than that used to backup to a non data deduplication device. Specifically, data deduplication devices perform better with larger buffer sizes, e.g. 8192 or 16384, as well as more target sessions. The additional target sessions are required as the DB2 backup no longer multiplexes the data across the target devices, but rather targets each target device with the data from a single table space. The default behavior for DB2 backup is to be optimized for through-put, thus it will multiplex the data from all table spaces across all sessions to TSM. The result can be a poor factoring ratio on the data deduplication device. To counter this effect, use the largest buffer size possible, namely 16384, as well as more target sessions. The additional target

sessions are required because the DB2 backup no longer multiplexes the data across the target devices, but rather targets each target device with the data from a single table space. To obtain the optimal data deduplication ratio, lower the number of sessions and parallelism; however, this is at the cost of a longer elapsed time for the DB2 backup to complete. Other basic rules of thumb are: Change the logarchmeth1 to ensure that the archived logs are not stored on a data deduplication device Increase utilheapsz to at least 50000 Here is an example DB2 backup invocation using some of those recommendations: db2 backup db databasename use tsm open 10 sessions dedup_device buffer 16384 Note: This example operation requires 1.3GB of memory. If that is too much, use buffer 8192 instead of buffer 16384. Conclusion Data deduplication is invaluable in the quest to better manage or store backups because of its ability to reduce redundant data. As of DB2 LUW Version 9.7 Fix Pack 4, DB2 backups have been optimized for deduplication devices. Users that are considering data deduplication as a part of their backup strategy should give it some consideration because of how well integrated it is with the DB2 backup utility. Users that are already using deduplication devices should experience a shorter backup window and improved deduplication results when they exploit DB2 s integrated data deduplication device support. Acknowledges I would like to personally thank both Roger Sanders (EMC) and Robert Causley (IBM) for their assistance in creating this document.