Data Deduplication in a Virtual Tape Library Environment Mathias Defiebre IBM Lab Services mathias.defiebre@de.ibm.com STG Technical Conferences 2010
Agenda Data Deduplication Overview Data Deduplication Theory Data Deduplication Approaches in Practice Data Deduplication Considerations and Value Proposition TS7650 ProtecTIER Deduplication Gateway TS7650 ProtecTIER Deduplication Appliance Series A look in the Future Links 2
Data Deduplication Overview 3
Data Deduplication Overview With Data Deduplication repeated instances of identical data are identified and stored only once Identical data is referenced to a single instance Saves storage capacity and network bandwidth Data Deduplication is a feature of a storage device or an application VTL, NAS-Box, backup application Data Deduplication requires an I/O protocol FCP, iscsi, CIFS, NFS, API, Tape Library Emulation Data Deduplication does not always make sense Not all data can be deduplicated well May interfere or work together with other technologies like compression, encryption or with data security requirements Data Deduplication is transparent To end-users and applications 4
Data Deduplication Theory 5
Data Deduplication Process (simplified) Data Object / Stream Data object or stream is subject for deduplication (1) Data object is split in chunks (fixed or variable size) Data Chunking A B C D A E F F D (2) For each junk an identity characteristic is determined Identity Determination A B C D E F (3a) Identical Chunks are referenced (pointer, reference) (3b) Non-identical chunks (single instances) are stored unique Determining Duplicates Identical Chunks 6
Methods for Data Chunking Data Object / Stream 1. File based One chunk is one file, most appropriate for file systems 2. Block based Data object is chunked into blocks of fixed or variable size Used by block storage devices 3. Format aware (Content aware) Understands explicit data formats and chunks data objects according to the format Example: Breaking a PowerPoint presentation into separate slides 4. Format agnostic (Content agnostic) Chunking is based on an algorithm that looks for logical breaks or similar elements within a data object/stream Chunking method influences dedupe ratio 7
Methods for Determining Duplicates A B C D A E F F D 1. Hashing Calculate a hash (MD-5, SHA-256) for each data chunk Compare hash with hash of existing data A B C D Identical hash means most likely identical data Hash Collision: Identical hash but non-identical data Must be prevented through secondary comparison (additional metadata, second hash method, additional binary comparison) E F 2. Binary Comparison Compare all bits of similar chunks 3. Delta Differencing Computes a delta between two similar chunks of data where one chunk is the baseline and the second is the delta Since each delta is unique there is no possibility of collision To reconstruct the original chunk the delta(s) have to be re-applied to the baseline chunk 8
Data Deduplication Architectures Client Server Storage Device LAN LAN or SAN Client-side + Reduces load on Server + Reduces bandwidth on LAN Adds load to Client No cross-correlation among multiple clients Server-side + Allows cross-correlation among multiple Clients Adds load to Server Storage-side + Transparent to Clients and Servers + Reduces load on Server and Clients Adds load to Storage Device 9
Data Deduplication Processing Time In-line: Data is deduplicated before it is actually stored + Requires less storage capacity Potential decrease of I/O performance Post-processing: Data is first stored and deduplicated later in the background + Better Performance expected Requires more storage capacity to temporarily store the data Data is written, read and written again thus more I/O intensive Deduplication window must be coordinated with backup window Combination of In-Line and Post-processing In-line as long as performance can be satisfied then switch to Post-processing 10
Data Deduplication Approaches in Practise 11
Practical Approaches Overview Practical approaches combine Chunking Method Method for Determining/Checking Identity Common Practical Approaches Identity Check Chunking Hashing Delta Diff Binary Diff Fixed/Variable Block Size Format Aware Format Agnostic Hash based Content Aware HyperFactor 12
Hash Based Approach 1. Slice data into chunks (fixed or variable) A B C D E 2. Generate Hash per chunk A h B h C h D h E h 3. Compare hashes with hash table Hash Value Storage locations Object References 4. For identical hashes store reference, otherwise store chunk and update hash table 13
Assessment for Hash Based Approach Hash-Collisions must be handled More overhead, especially for in-line deduplication Requires a hash table to store hashes for all chunks Hash table will grow with data volume Hash Table must be quickly searchable and accessible Growing hash table may become a performance bottleneck (doesn t fit into RAM) Scalability issues Hash table must be protected One copy might not be sufficient Example: Chunk size of 8KB, each hash is 20 bytes long With a 1 TB repository: 1 TByte repository has ~134,000,000 chunks of 8 KB each Need pointers scheme to reference inside 1 TByte Hash table requires ~2.5 GB of memory no issue With a 100 TB repository: Hash table requires ~250 GB of memory performance!!! 14
HyperFactor Approach HyperFactor has two indexes HyperFactor Index Restore Index HyperFactor Index used for backup Used to filter out similar elements from the incoming data stream Fixed size of 4 GB, memory resident, synced to disk (repository) periodically Can be restored from repository if lost References up to 1 PB of physical data elements stored in the repository Restore Index used for restore Includes references to physical data elements Dynamic index, growing Stored on disk (repository) 15
HyperFactor Approach 1. Look through data stream for similarity and filter similar elements Using HyperFactor Index (fixed size 4 GB) New Data Stream 2. Read elements that are most similar from storage Using Restore Index 3. Binary compare element in stream with element(s) read from storage Element A Element B Element C 4. Identical data is referenced by a new additional entry in the Restore Index - unique data is stored in the repository 16
Assessment for HyperFactor No Hash Table required No scalability issues 4 GB Index references up to 1 PB of physical data elements No dependency of data format and application Very flexible, no ongoing development effort due to format changes HyperFactor index always fits into memory Enables enterprise-class high-performance in-line deduplication Eliminates the phenomenon of missed factoring opportunities Looks for similarity between data not on exact chunk matches 17
Data Deduplication Considerations and Value Proposition 18
Not all Data Dedupe well High Dedupe Ratio expected for... Structured Data Database Files E-mails Low Dedupe Ratio expected for... Unstructured Data Images Videos Voice Data Seismic Data Large collections of small files Some Technologies influence Dedupe Ratio 19
Technologies influencing Data Deduplication Compression Archives *.zip (Phil Katz zip: pkzip, pkunzip) *.gz (GNU zip: gzip, gzip -d) Compaction Lotus Notes Database Multiplexing Multiple backup streams to a single tape drive Veritas Backup Exec Computer Associates ARCserve Oracle RMAN multiplexing of backup sets Encryption Above technologies change the data stream making identical data non-identical! 20
Example: Data Deduplication and Encryption Data source 1 Important text No encryption Important text Data Data encryption encryption prior prior to to deduplicatioduplication processing processing can can de- subvert subvert data data reduction reduction Data source 2 Important text Encryption key 1 txpt tnatroemi Data Deduplication Data Store Important text txpt tnatroemi Data source 3 Important text Encryption key 2 te tarpixtntom Compression possible te tarpixtntom 1. Three data sources have the same text file 2. After encryption, text files do not match 3. Deduplication processing does not detect redundancy 4. Text files are stored without data reduction 21
Dedupe Value Proposition & potential Drawbacks Data Deduplication Value Proposition Disk storage savings Network Bandwidth savings Energy savings (Green IT) Better utilization of existing floor and rack space Increased scalability Data Deduplication Potential Drawbacks Loss of one single data chunk may cause loss of multiple files Repository or Index required to store meta data must be protected requires additional storage capacity may slow down performance Loss of all Index means loss of all data 22
TS7650 ProtecTIER Gateway 23
ProtecTIER Architecture Overview It s a Tape Library and Drives Virtual Tape Library ProtecTIER Server FC Backup Server ProtecTIER Application Disk Storage System Linux server-based application running on a System x server Emulates a tape library unit, including drives, cartridges, and robotics Uses Fibre Channel (FC) attached disk storage system as the backup medium Has a build-in deduplication engine (HyperFactor) 24
STG Technical Conferences 2010 New Data Stream Filter out similar elements (using resident index) Read similar elements from storage and compare HyperFactor Data Storage Memory Resident Index (4 GB, may contain predefined elements) Disk Arrays FC Switch ProtecTIER Server Existing Data Virtual Tape Emulation Backup Servers Restore Index Filtered data Reference identical elements in restore index 25 Store unique elements on storage
Dedupe Ratio depends on... Data Change Rate the percentage of data in the incomming backup data stream that is new for ProtecTIER and not already stored physically in the repository Backup Policies # full backups # Inc backups backup frequency data retention period 26
ProtecTIER Native Replication Key new feature R2.3 Primary Site Backup Server ProtecTIER Gateway Represented capacity Physical capacity Secondary Site Backup Server ProtecTIER IP replication Significant bandwidth reduction Represented capacity Backup Server ProtecTIER Gateway Physical capacity 27
TS7650 ProtecTIER Appliance Series 28
TS7650 Appliance Series F05 Base Frame Standalone 4700 32 spindle 450GB (2 drawer) 7TB 100MB/sec F05 Base Frame Standalone 4700 64 spindle 450GB (4 drawer) 18TB 250MB/sec Standalone 4700 128 spindle 450GB (8 drawer) 36TB 450MB/sec 500MB/sec 6.3TB 15.8TB 31.5TB 31.5TB U U U U P U U U U P U U U U P S U U U U P U U U U P U U U U P S U U U U P U U U U P U U U U P S M M M M M m m m m m U U U U P S Appliances or TSSC can 1u be empty space upgraded or TSSC one step 1u empty forward space TSSC... or TSSC or TSSC or TSSC F05 Base Frame EXP810 EXP810 EXP810 DS4700 Clustered 4700 128 spindle 450GB (8 drawer) 36TB 450MB/sec F05 Base Frame EXP810 EXP810 EXP810 DS4700 500MB/sec Ethernet Switch (1U) Ethernet Switch (1U) U U U U P U U U U P U U U U U U U U P U U U U P U U U U U U U U P U U U U P U U U U M M M M M m m m m m X3850 M2 3 x 6core, 24GB RAM or TSSC or TSSC U U U U P S P S P S P S WTI Switch Power: Base Power: FC1903 EXP810 DS4700 X3850 M2 3 x 6core, 24GB RAM U U U U P U U U U P U U U U M M M M M m m m m m U U U U P S P S Power: Base Power: FC1903 DS4700 EXP810 EXP810 EXP810 X3850 M2 3 x 6core, 24GB RAM U U U U P U U U U P U U U U U U U U P U U U U P U U U U U U U U P U U U U P U U U U M M M M M m m m m m U U U U P S P S P S P S Power: Base Power: FC1903 DS4700 EXP810 EXP810 EXP810 X3850 M2 3 x 6core, 24GB RAM U U U U P U U U U P U U U U U U U U P U U U U P U U U U U U U U P U U U U P U U U U M M M M M m m m m m U U U U P S P S P S P S Power: Base Power: FC1903 DS4700 EXP810 EXP810 EXP810 X3850 M2 3 x 6core, 24GB RAM U U U U P U U U U P U U U U U U U U P U U U U P U U U U U U U U P U U U U P U U U U M M M M M m m m m m U U U U P S P S P S P S 29
A look in the Future... 30
A look in the Future Some observations from the VTL and Dedupe Market Vendors converge to a common point Scalable appliances with multiple I/O interfaces (FCP, iscsi, CIFS, NFS, Library Emulation) Replication becomes more and more commodity Replication benefits from deduped data Intelligent storage devices will be tighly integrated with 3rd party backup applications e.g. controlling & monitoring replication from a backup application 31
Links 32
Links I TS7650G ProtecTIER Deduplication Gateway http://www-03.ibm.com/systems/storage/tape/ts7650g/index.html TS7650 ProtecTIER Deduplication Appliance http://www-03.ibm.com/systems/storage/tape/ts7650a/index.html Whitepaper: IBM Data Deduplication Strategy and Operations http://www.ibm.com/developerworks/wikis/display/tivolistoragemana ger/ibm+tivoli+storage+manager+v6.1+data+deduplication+strate gy+and+operations Redbook: The IBM System Storage TS7650G and TS7650 ProtecTIER Servers http://w3.itso.ibm.com/redpieces/abstracts/sg247652.html?open 33
Links II TS7650G ProtecTIER Implementation Workshops IBMer: https://w3-01.sso.ibm.com/learning/lms/saba/web/main/goto/learningactivity?c oursenum=ss92e1de&deeplinkredirect=false Business Partner: http://www- 304.ibm.com/jct03001c/services/learning/ites.wss/de/de?pageType= course_description&includenotscheduled=y&coursecode=ss92e1 DE 34
Storage Competence at the Mainz Location IBM Germany s fourth largest location offers you a broad portfolio of IBM System Storage Services IBM Dynamic Infrastructure Leadership Center for Information Infrastructure Business, Channel & Skill Enablement & Training DI Education & Briefings Demos & Showcases IT Transformation Roadmaps & Workshops BP Certification IBM European Storage Competence Center & Systems Lab Europe Business, Channel & Skill Enablement & Training End-to-end client support Workshops Solution Design Lab Services Customer Relationship Management IBM Executive Briefing Center & TMCC Business, Channel & Skill Enablement & Training Customer and Group Briefings Product & SW Demos Integrated Solution Demos Exhibition Support & Organization IBM STG Europe Storage Software Development Software Development Storage & Tape Linux Mainframe File Systems 35
IBM System Storage Solutions Center of Excellence We offer technical support from the planning phase through well after installation Our Services Client Briefings & Education Systems Lab Services & Training Customized Workshops System Storage Demos Advanced Technical Support Solution Design Proof of Concepts Benchmarks Product Field Engineering Our Expertise Skilled technical storage experts covering the whole IBM System Storage Portfolio Information Infrastructure: Compliance Availability Retention Security HW / SW & Performance Our Systems Lab Europe 1500 sqm lab space IBM & heterogenous hardware 36
Hindi Hebrew Simplified Chinese Gracias Russian Thank You Spanish Arabic Tak English Obrigado Brazilian Portuguese Grazie Italian Danish Korean Danke German Merci French Tamil Japanese Traditional Chinese Thai 37
Disclaimer I STG Technical Conferences 2010 Copyright 2009 by International Business Machines Corporation. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. The performance data contained herein were obtained in a controlled, isolated environment. Results obtained in other operating environments may vary significantly. While IBM has reviewed each item for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. These values do not constitute a guarantee of performance. The use of this information or the implementation of any of the techniques discussed herein is a customer responsibility and depends on the customer's ability to evaluate and integrate them into their operating environment. Customers attempting to adapt these techniques to their own environments do so at their own risk. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. This information could include technical inaccuracies or typographical errors. IBM may make improvements and/or changes in the product(s) and/or program(s) at any time without notice. Any statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Any reference to an IBM Program Product in this document is not intended to state or imply that only that program product may be used. Any functionally equivalent program, that does not infringe IBM's intellectually property rights, may be used instead. It is the user's responsibility to evaluate and verify the operation of any on-ibm product, program or service. 38
Disclaimer II STG Technical Conferences 2010 THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NONINFRINGEMENT. IBM shall have no responsibility to update this information. IBM products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided. IBM is not responsible for the performance or interoperability of any non-ibm products discussed herein. Information concerning non-ibm products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-ibm products. Questions on the capabilities of non-ibm products should be addressed to the suppliers of those products. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents or copyrights. Inquiries regarding patent or copyright licenses should be made, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. 39
Trademarks STG Technical Conferences 2010 The following terms are trademarks or registered trademarks of the IBM Corporation in either the United States, other countries or both. IBM, TotalStorage, zseries, pseries, xseries, S/390, ES/9000, AS/400, RS/6000 z/os, z/vm, VM/ESA, OS/390, AIX, DFSMS/MVS, OS/2, OS/400, ESCON, Tivoli iseries, ES/3090, VSE/ESA, TPF, DFSMSdfp, DFSMSdss, DFSMShsm, DFSMSrmm, FICON, ProtecTIER, XIV Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Other company, product, and service names mentioned may be trademarks or registered trademarks of their respective companies. 40