Type of Submission: Article Title: DB2 s Integrated Support for Data Deduplication Devices Subtitle: Keywords: DB2, Backup, Deduplication



Similar documents
IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Deduplication Demystified: How to determine the right approach for your business

Availability Digest. Data Deduplication February 2011

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Backup and Recovery 1

3Gen Data Deduplication Technical

LDA, the new family of Lortu Data Appliances

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

EMC Data Domain Boost for Oracle Recovery Manager (RMAN)

Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski Spała

Data Deduplication and Tivoli Storage Manager

Understanding EMC Avamar with EMC Data Protection Advisor

Trends in Enterprise Backup Deduplication

A Practical Guide to Backup and Recovery of IBM DB2 for Linux, UNIX and Windows in SAP Environments Part 1 Backup and Recovery Overview

Hardware Configuration Guide

Effective Planning and Use of TSM V6 Deduplication

Get Success in Passing Your Certification Exam at first attempt!

ExaGrid Product Description. Cost-Effective Disk-Based Backup with Data Deduplication

Backup and Restore Back to Basics with SQL LiteSpeed

An Authorized Duplicate Check Scheme for Removing Duplicate Copies of Repeating Data in The Cloud Environment to Reduce Amount of Storage Space

Data Deduplication HTBackup

Choosing an Enterprise-Class Deduplication Technology

W H I T E P A P E R R e a l i z i n g t h e B e n e f i t s o f Deduplication in a Backup and Restore System

Turnkey Deduplication Solution for the Enterprise

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

DEDUPLICATION NOW AND WHERE IT S HEADING. Lauren Whitehouse Senior Analyst, Enterprise Strategy Group

Inline Deduplication

Deploying De-Duplication on Ext4 File System

Demystifying Deduplication for Backup with the Dell DR4000

Riverbed Whitewater/Amazon Glacier ROI for Backup and Archiving

Together with SAP MaxDB database tools, you can use third-party backup tools to backup and restore data. You can use third-party backup tools for the

Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication

How To Manage A Data Warehouse On A Database 2 For Linux And Unix

Best Practices. Using IBM InfoSphere Optim High Performance Unload as part of a Recovery Strategy. IBM Smart Analytics System

Detailed Product Description

How To Make A Backup System More Efficient

The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle

Eight Considerations for Evaluating Disk-Based Backup Solutions

Don t Get Duped By Dedupe or Dedupe Vendors

A Deduplication File System & Course Review

Unitrends Recovery-Series: Addressing Enterprise-Class Data Protection

Backup architectures in the modern data center. Author: Edmond van As Competa IT b.v.

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

Deduplication has been around for several

Efficient Backup with Data Deduplication Which Strategy is Right for You?

Presentation Identifier Goes Here 1

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Identifying the Hidden Risk of Data Deduplication: How the HYDRAstor TM Solution Proactively Solves the Problem

DeltaStor Data Deduplication: A Technical Review

Real-time Compression: Achieving storage efficiency throughout the data lifecycle

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

EMC DATA DOMAIN OVERVIEW. Copyright 2011 EMC Corporation. All rights reserved.

Symantec NetBackup PureDisk Optimizing Backups with Deduplication for Remote Offices, Data Center and Virtual Machines

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

09'Linux Plumbers Conference

EMC Data Domain Boost for Oracle Recovery Manager (RMAN)

Creating a Cloud Backup Service. Deon George

Oracle Data Protection Concepts

VMware vsphere Data Protection 6.1

ORACLE RMAN DESIGN BEST PRACTICES WITH EMC DATA DOMAIN

Don t Get Duped By. Dedupe. 7 Technology Circle Suite 100 Columbia, SC Phone: sales@unitrends.com URL:

Data Deduplication: An Essential Component of your Data Protection Strategy

How To Use An Npm On A Network Device

WHITE PAPER. Effectiveness of Variable-block vs Fixedblock Deduplication on Data Reduction: A Technical Analysis

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Energy Efficient Storage - Multi- Tier Strategies For Retaining Data

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

2011 FileTek, Inc. All rights reserved. 1 QUESTION

PARALLELS CLOUD STORAGE

Protect Microsoft Exchange databases, achieve long-term data retention

EMC VNXe File Deduplication and Compression

WHITE PAPER. Dedupe-Centric Storage. Hugo Patterson, Chief Architect, Data Domain. Storage. Deduplication. September 2007

Data Deduplication and Tivoli Storage Manager

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

NETAPP WHITE PAPER Looking Beyond the Hype: Evaluating Data Deduplication Solutions

Using HP StoreOnce D2D systems for Microsoft SQL Server backups

Using HP StoreOnce Backup Systems for NDMP backups with Symantec NetBackup

Managed Services - A Paradigm for Cloud- Based Business Continuity

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Reducing Backups with Data Deduplication

A Business Case for Disk Based Data Protection

Tiered Data Protection Strategy Data Deduplication. Thomas Störr Sales Director Central Europe November 8, 2007

EMC Data de-duplication not ONLY for IBM i

Backups in the Cloud Ron McCracken IBM Business Environment

Barracuda Backup Deduplication. White Paper

UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s.

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

EMC DATA DOMAIN OPERATING SYSTEM

Data deduplication is more than just a BUZZ word

EMC DATA DOMAIN OPERATING SYSTEM

Optimizing Backup and Data Protection in Virtualized Environments. January 2009

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Protect Data... in the Cloud

TSM Family Capacity Pricing Special Bid Offering IBM Corporation

Rapid Data Backup and Restore Using NFS on IBM ProtecTIER TS7620 Deduplication Appliance Express IBM Redbooks Solution Guide

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

Understanding Disk Storage in Tivoli Storage Manager

Contents. WD Arkeia Page 2 of 14

PASS4TEST 専 門 IT 認 証 試 験 問 題 集 提 供 者

Transcription:

Type of Submission: Article Title: DB2 s Integrated Support for Data Deduplication Devices Subtitle: Keywords: DB2, Backup, Deduplication Prefix: Error! Bookmark not defined. Given: Dale Middle: M. Family: McInnis Suffix: Job Title: STSM DB2 LUW Availability Architect Email: dmcinnis@ca.ibm.com Bio: Dale McInnis is a Senior Technical Staff Member (STSM) at the IBM Toronto Canada lab. He has a B.Sc.(CS) from the University of New Brunswick and a Masters of Engineering from the University of Toronto. Dale joined IBM in 1988, and has been working on the DB2 development team since 1992. Dale's area of expertise includes DB2 for Linux, UNIX and Windows Kernel development, where he led teams that designed the current backup and recovery architecture and other key high availability and disaster recovery technologies. His expertise in the area DB2 availability area is well known in the information technology industry. Dale currently fills the role of DB2 Availability Architect at the IBM Toronto Canada Lab. Company: IBM Canada Ltd. Photo filename: Abstract: This article will provide an overview of data deduplication and explain how the DB2 backup utility was modified to support such devices. It will examine the compatibility of compression in a DB2 environment with data deduplication devices. Finally some best practices and tuning recommendations will be presented.

Introduction With the exponential growth in data comes the corresponding need to store and archive that data. For organizations this is not just hoarding bytes for their own sake, but instead it comes from the requirement for having data backups. The trick is to find the most efficient way to back up that data, and one of the best solutions is to determine which data is duplicated so that you can exclude that from your backup. This is known as data deduplication, a data compression technique that eliminates redundant data, thereby improving storage utilization. Beginning in DB2 for Linux, UNIX, and Windows Version 9.7 Fix Pack 4, DB2 backups have been optimized for deduplication devices, and backup operations that use such devices as a target for DB2 backup operations have been simplified. How data deduplication works Data deduplication (often called "intelligent compression" or "single-instance storage") is a method of reducing storage needs by eliminating redundant data. Only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy. For example, suppose an e-mail system contains 100 instances of the same 4 megabyte (MB) attachment. If this e-mail system is backed up without deduplication, all 100 instances of the attachment are saved, requiring 400 MB of storage. However, if the same e-mail system is backed up to a deduplication device, only one instance of the attachment is actually stored; each subsequent instance merely references the copy that was saved. Thus, the 400 MB of storage needed to back up the system will be reduced to 4 MB plus some nominal overhead for references to the deduplicated data. Most deduplication devices work by comparing relatively large chunks of data such as entire files or large portions of files. Each chunk examined is assigned an identifier, which is typically calculated using cryptographic hash functions. In many implementations, the assumption is made that if an identifier is identical, the corresponding data is identical; other implementations forego this assumption, preferring instead to do a byte-by-byte comparison to verify that data with the same identifier is indeed the same. Regardless, if it is decided that a particular chunk of data already exists in the deduplication namespace, that chunk is replaced with a link to the data that has already been stored. Later, when the deduplicated data is accessed, if a link is encountered, it is replaced with the data the link refers to. Of course, this whole process is transparent to end users and applications. Typically, deduplication is performed using one of two methods: "in-line" or "postprocess." With in-line deduplication, hash calculations and lookups are performed before data is written to disk. Consequently, in-line deduplication significantly reduces the raw disk capacity needed because not-yet-deduplicated data is never written to disk. For this reason, in-line deduplication is often considered the most efficient and economic deduplication method available. However, because it takes time to perform hash calculations and lookups, in-line deduplication elongate the time for the backup to complete, although certain in-line deduplication solution vendors have been able to achieve performance that is comparable to that of post-process deduplication. With post-process deduplication, all data is written to storage before the deduplication process is initiated. The advantage to this approach is that there is no need to wait for hash

calculations and lookups to complete before data is stored. The drawback is that a greater amount of available storage is needed initially since duplicate data must be written to storage for a brief period of time. This method also increases the lag time before deduplication is complete. Data deduplication offers other benefits. Lower storage space requirements will save money on disk expenditures. The more efficient use of disk space also allows for longer disk retention periods, which provides better recovery time objectives (RTO) for a longer time and reduces the need for tape backups. Data deduplication also reduces the data that must be sent across a WAN for remote backups, replication, and disaster recovery. How a standard DB2 backup operation works When a DB2 backup operation begins, one or more buffer manipulator (db2bm) threads are started and these threads are responsible for accessing data in the database and streaming it to one or more backup buffers. Likewise, one or more media controller (db2med) threads are started and these threads are responsible for writing data residing in the backup buffers to files on the target backup device. (The number of db2bm threads used is controlled by the PARALLELISM option of the BACKUP DATABASE command; the number of db2med threads used is controlled by the OPEN n SESSIONS option or the number of target devices.) Finally, a DB2 agent (db2agent) thread is assigned the responsibility of directing communication between the buffer manipulator threads and the media controller threads. This process can be seen in Figure 1. Figure 1: DB2's backup process model Normally, data retrieved by db2bm threads is read and placed in shared memory. The db2med threads then use a First In First Out (FIFO) algorithm to pull the backup buffers from shared memory in random order, resulting in the data being multiplexed across all of

the output streams; there is no correlation or deterministic pattern between table space data and the output streams. (This behavior is illustrated in Figure 2.) As a result, when the output streams are directed to a deduplication device, the device thrashes in an attempt to identify chunks of data that have already been backed up. Figure 2: Default database backup behavior. (Note that the metadata for a table space will appear in an output stream before any of its data and that empty extents are never placed in an output stream.) How DB2 was modified to support data deduplication devices To optimize the backup format for data deduplication the backup utility needs to ensure that the data is sent to the target devices in a predictable manner. To that end, the DEDUP_DEVICE option was added to the backup utility so the user can indicate that the target device is a data deduplication enabled device and to ensure the data sequences sent to those devices are predictable. When this option is used with the BACKUP DATABASE command, data retrieved by db2bm threads is no longer read and multiplexed across the output streams being used by the db2med threads. Instead, as data is read from a particular table space, all of that table space s data is sent to one, and only one, output stream. Furthermore, data for a particular table space is always written in order, from lowest to highest page. As a result, a predictable and deterministic pattern of the data emerges in

each output stream, making it easy for a deduplication device to identify chunks of data that have been backed up previously. Figure 3 illustrates this change in backup behavior when the DEDUP_DEVICE option of the BACKUP DATABASE command is used. Figure 3: Database backup behavior when the DEDUP_DEVICE option is specified This relatively simple change in behavior yielded some impressive gains for data deduplication. One of the initial customers to utilize the DEDUP_DEVICE option on DB2 backup experienced both faster backups and vastly improved deduplication. The customer s backups of 4 TB were exceeding 6.5 hours and were getting poor deduplication results of 2:1 or 3:1. (The deduplication ratio indicates the aggregate reduction in data stored in other words, using data deduplication was reducing their backup s size to 1/2 or 1/3). With this change, the backup elapsed time decreased to 5.5 hours, and the deduplication results were between 11:1 and 15:1. Naturally, individual results depend on the volatility of the data: the less the data changes, the higher the data deduplication ratio will be. How DB2 incremental backups compare to data deduplicated backups A DB2 incremental backup reads all of the pages in a table space and only sends the changed pages to the backup image. All of the large object (LOB) and long field data that exists in the table space is added to the backup image in its entirety due to the lack of a fixed page format. As a result, a DB2 incremental backup produces a very similarly sized

backup object as that of a data deduplicated backup image; essentially only the new pages consume space. One advantage of the data deduplicated backup over an incremental backup is the way LOBs are handled. As previously mentioned, an incremental backup always includes the entire LOB. One disadvantage of a data deduplicated backup is that it sends the entire table space's content over the LAN/SAN to the data deduplication device, thus consuming a lot of bandwidth that is not consumed with a DB2 incremental backup. Compatibility of compression with data deduplication There are several forms of compression available for DB2 DBAs to explore, namely: Row compression (aka table compression) Adaptive compression (aka page compression) DB2 backup compression TSM client compression The previous rule of thumb was that any form of compression is incompatible with data deduplication. Testing has revealed that this assumption is false and that there are circumstances in which compression and data deduplication are completely compatible. The key factor that must be determined is as follows: if the data remains unchanged does the physical binary representation of the data change between backups if compression is used? For the first two items on the list above, row and adaptive compression, the answer is no. After the data is compressed on disk, the binary format of the data does not change between backups unless the data has been modified. This is referred to as static compression as long as the data does not change the representation remains the same. This type of compression is compatible with data deduplication, as the data deduplication device can easily detect the pattern. For other two forms of compression on the list, db2 backup and TSM compression, the answer is yes. These forms of compression are referred to as dynamic compression. Each time the database is backed up the binary presentation of the data may change depending on where is the data stream the data falls. Both compression techniques use a sliding window to detect patterns and if the alignment of the window is not identical between backups then the pattern detection will result of a different compressed output; thus lowering the possibility for the data deduplication device to find a pattern match. How to tune DB2 backups for data deduplication devices The tuning parameters used by DB2 backup to perform optimally to a data deduplication device is somewhat different than that used to backup to a non data deduplication device. Specifically, data deduplication devices perform better with larger buffer sizes, e.g. 8192 or 16384, as well as more target sessions. The additional target sessions are required as the DB2 backup no longer multiplexes the data across the target devices, but rather targets each target device with the data from a single table space. The default behavior for DB2 backup is to be optimized for through-put, thus it will multiplex the data from all table spaces across all sessions to TSM. The result can be a poor factoring ratio on the data deduplication device. To counter this effect, use the largest buffer size possible, namely 16384, as well as more target sessions. The additional target

sessions are required because the DB2 backup no longer multiplexes the data across the target devices, but rather targets each target device with the data from a single table space. To obtain the optimal data deduplication ratio, lower the number of sessions and parallelism; however, this is at the cost of a longer elapsed time for the DB2 backup to complete. Other basic rules of thumb are: Change the logarchmeth1 to ensure that the archived logs are not stored on a data deduplication device Increase utilheapsz to at least 50000 Here is an example DB2 backup invocation using some of those recommendations: db2 backup db databasename use tsm open 10 sessions dedup_device buffer 16384 Note: This example operation requires 1.3GB of memory. If that is too much, use buffer 8192 instead of buffer 16384. Conclusion Data deduplication is invaluable in the quest to better manage or store backups because of its ability to reduce redundant data. As of DB2 LUW Version 9.7 Fix Pack 4, DB2 backups have been optimized for deduplication devices. Users that are considering data deduplication as a part of their backup strategy should give it some consideration because of how well integrated it is with the DB2 backup utility. Users that are already using deduplication devices should experience a shorter backup window and improved deduplication results when they exploit DB2 s integrated data deduplication device support. Acknowledges I would like to personally thank both Roger Sanders (EMC) and Robert Causley (IBM) for their assistance in creating this document.