Data Deduplication in a Virtual Tape Library Environment



Similar documents
Rapid Data Backup and Restore Using NFS on IBM ProtecTIER TS7620 Deduplication Appliance Express IBM Redbooks Solution Guide

IBM Tivoli Storage FlashCopy Manager Overview Wolfgang Hitzler Technical Sales IBM Tivoli Storage Management

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

Arwed Tschoeke, Systems Architect IBM Systems and Technology Group

New!! - Higher performance for Windows and UNIX environments

IBM Systems and Technology Group Technical Conference

Data Deduplication: An Essential Component of your Data Protection Strategy

Protecting Information in a Smarter Data Center with the Performance of Flash

CS z/os Application Enhancements: Introduction to Advanced Encryption Standards (AES)

SHARE Lunch & Learn #15372

Actual trends in backup protection solution IBM Backup Products and Services

Maximizing Backup and Restore Performance of Large Databases

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

IBM CommonStore Archiving Preload Solution

Creating a Cloud Backup Service. Deon George

Database lifecycle management

Data Deduplication and Tivoli Storage Manager

UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s.

Implementing Tivoli Storage Manager on Linux on System z

Long-Distance Configurations for MSCS with IBM Enterprise Storage Server

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

EMC DATA DOMAIN OPERATING SYSTEM

EMC DATA DOMAIN OPERATING SYSTEM

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

Virtual Tape Library Solutions by Hitachi Data Systems

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

DeltaStor Data Deduplication: A Technical Review

Universal Backup Device The Essential Facts of UBD

Business Resilience for the On Demand World Yvette Ray Practice Executive Business Continuity and Resiliency Services

Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski Spała

Get Success in Passing Your Certification Exam at first attempt!

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

IBM WebSphere Data Interchange V3.3

Emulex 8Gb Fibre Channel Expansion Card (CIOv) for IBM BladeCenter IBM BladeCenter at-a-glance guide

Quantum StorNext. Product Brief: Distributed LAN Client

CS z/os Network Security Configuration Assistant GUI

LDA, the new family of Lortu Data Appliances

IBM Infrastructure Suite for z/vm and Linux

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

WHY DO I NEED FALCONSTOR OPTIMIZED BACKUP & DEDUPLICATION?

Lisa Gundy IBM Corporation. Wednesday, March 12, 2014: 11:00 AM 12:00 PM Session 15077

IBM System Storage DR550

Communications Server for Linux

QLogic 4Gb Fibre Channel Expansion Card (CIOv) for IBM BladeCenter IBM BladeCenter at-a-glance guide

Using HP StoreOnce Backup Systems for NDMP backups with Symantec NetBackup

z/os V1R11 Communications Server system management and monitoring

Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication Software

Universal Backup Device with

Redbooks Redpaper. IBM TotalStorage NAS Advantages of the Windows Powered OS. Roland Tretau

Step by Step Guide To vstorage Backup Server (Proxy) Sizing

EMC Disk Library with EMC Data Domain Deployment Scenario

Cross-Platform Access

EMC DATA DOMAIN OVERVIEW. Copyright 2011 EMC Corporation. All rights reserved.

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

Demystifying Deduplication for Backup with the Dell DR4000

EMC Data de-duplication not ONLY for IBM i

SHARE in Pittsburgh Session 15591

Disaster Recovery Procedures for Microsoft SQL 2000 and 2005 using N series

QLogic 8Gb FC Single-port and Dual-port HBAs for IBM System x IBM System x at-a-glance guide

Brocade Enterprise 20-port, 20-port, and 10-port 8Gb SAN Switch Modules IBM BladeCenter at-a-glance guide

ADVANCED DEDUPLICATION CONCEPTS. Larry Freeman, NetApp Inc Tom Pearce, Four-Colour IT Solutions

3Gen Data Deduplication Technical

z/os V1R11 Communications Server System management and monitoring Network management interface enhancements

EMC Integrated Infrastructure for VMware

EMC Integrated Infrastructure for VMware

UN 4013 V - Virtual Tape Libraries solutions update...

Efficient Backup with Data Deduplication Which Strategy is Right for You?

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

UPSTREAM for Linux on System z

WHITE PAPER: customize. Best Practice for NDMP Backup Veritas NetBackup. Paul Cummings. January Confidence in a connected world.

Managed Services - A Paradigm for Cloud- Based Business Continuity

Backups in the Cloud Ron McCracken IBM Business Environment

DATASHEET FUJITSU ETERNUS CS800 DATA PROTECTION APPLIANCE

Using HP StoreOnce Backup systems for Oracle database backups

IBM RDX USB 3.0 Disk Backup Solution IBM Redbooks Product Guide

Data Deduplication and Tivoli Storage Manager

HP StoreOnce: reinventing data deduplication

IBM Tivoli Service Request Manager 7.1

IBM System Storage Executive Briefing Center Topics (Tucson)

Seriously: Tape Only Backup Systems are Dead, Dead, Dead!

IBM Storage Server. Installing the IBM storage server

IBM Tivoli Web Response Monitor

Protect Microsoft Exchange databases, achieve long-term data retention

Technical White Paper for the Oceanspace VTL6000

ClearPath Storage Update Data Domain on ClearPath MCP

DataPower z/os crypto integration

Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication

EonStor DS remote replication feature guide

Copyright 2015 EMC Corporation. All rights reserved. 1

Transcription:

Data Deduplication in a Virtual Tape Library Environment Mathias Defiebre IBM Lab Services mathias.defiebre@de.ibm.com STG Technical Conferences 2010

Agenda Data Deduplication Overview Data Deduplication Theory Data Deduplication Approaches in Practice Data Deduplication Considerations and Value Proposition TS7650 ProtecTIER Deduplication Gateway TS7650 ProtecTIER Deduplication Appliance Series A look in the Future Links 2

Data Deduplication Overview 3

Data Deduplication Overview With Data Deduplication repeated instances of identical data are identified and stored only once Identical data is referenced to a single instance Saves storage capacity and network bandwidth Data Deduplication is a feature of a storage device or an application VTL, NAS-Box, backup application Data Deduplication requires an I/O protocol FCP, iscsi, CIFS, NFS, API, Tape Library Emulation Data Deduplication does not always make sense Not all data can be deduplicated well May interfere or work together with other technologies like compression, encryption or with data security requirements Data Deduplication is transparent To end-users and applications 4

Data Deduplication Theory 5

Data Deduplication Process (simplified) Data Object / Stream Data object or stream is subject for deduplication (1) Data object is split in chunks (fixed or variable size) Data Chunking A B C D A E F F D (2) For each junk an identity characteristic is determined Identity Determination A B C D E F (3a) Identical Chunks are referenced (pointer, reference) (3b) Non-identical chunks (single instances) are stored unique Determining Duplicates Identical Chunks 6

Methods for Data Chunking Data Object / Stream 1. File based One chunk is one file, most appropriate for file systems 2. Block based Data object is chunked into blocks of fixed or variable size Used by block storage devices 3. Format aware (Content aware) Understands explicit data formats and chunks data objects according to the format Example: Breaking a PowerPoint presentation into separate slides 4. Format agnostic (Content agnostic) Chunking is based on an algorithm that looks for logical breaks or similar elements within a data object/stream Chunking method influences dedupe ratio 7

Methods for Determining Duplicates A B C D A E F F D 1. Hashing Calculate a hash (MD-5, SHA-256) for each data chunk Compare hash with hash of existing data A B C D Identical hash means most likely identical data Hash Collision: Identical hash but non-identical data Must be prevented through secondary comparison (additional metadata, second hash method, additional binary comparison) E F 2. Binary Comparison Compare all bits of similar chunks 3. Delta Differencing Computes a delta between two similar chunks of data where one chunk is the baseline and the second is the delta Since each delta is unique there is no possibility of collision To reconstruct the original chunk the delta(s) have to be re-applied to the baseline chunk 8

Data Deduplication Architectures Client Server Storage Device LAN LAN or SAN Client-side + Reduces load on Server + Reduces bandwidth on LAN Adds load to Client No cross-correlation among multiple clients Server-side + Allows cross-correlation among multiple Clients Adds load to Server Storage-side + Transparent to Clients and Servers + Reduces load on Server and Clients Adds load to Storage Device 9

Data Deduplication Processing Time In-line: Data is deduplicated before it is actually stored + Requires less storage capacity Potential decrease of I/O performance Post-processing: Data is first stored and deduplicated later in the background + Better Performance expected Requires more storage capacity to temporarily store the data Data is written, read and written again thus more I/O intensive Deduplication window must be coordinated with backup window Combination of In-Line and Post-processing In-line as long as performance can be satisfied then switch to Post-processing 10

Data Deduplication Approaches in Practise 11

Practical Approaches Overview Practical approaches combine Chunking Method Method for Determining/Checking Identity Common Practical Approaches Identity Check Chunking Hashing Delta Diff Binary Diff Fixed/Variable Block Size Format Aware Format Agnostic Hash based Content Aware HyperFactor 12

Hash Based Approach 1. Slice data into chunks (fixed or variable) A B C D E 2. Generate Hash per chunk A h B h C h D h E h 3. Compare hashes with hash table Hash Value Storage locations Object References 4. For identical hashes store reference, otherwise store chunk and update hash table 13

Assessment for Hash Based Approach Hash-Collisions must be handled More overhead, especially for in-line deduplication Requires a hash table to store hashes for all chunks Hash table will grow with data volume Hash Table must be quickly searchable and accessible Growing hash table may become a performance bottleneck (doesn t fit into RAM) Scalability issues Hash table must be protected One copy might not be sufficient Example: Chunk size of 8KB, each hash is 20 bytes long With a 1 TB repository: 1 TByte repository has ~134,000,000 chunks of 8 KB each Need pointers scheme to reference inside 1 TByte Hash table requires ~2.5 GB of memory no issue With a 100 TB repository: Hash table requires ~250 GB of memory performance!!! 14

HyperFactor Approach HyperFactor has two indexes HyperFactor Index Restore Index HyperFactor Index used for backup Used to filter out similar elements from the incoming data stream Fixed size of 4 GB, memory resident, synced to disk (repository) periodically Can be restored from repository if lost References up to 1 PB of physical data elements stored in the repository Restore Index used for restore Includes references to physical data elements Dynamic index, growing Stored on disk (repository) 15

HyperFactor Approach 1. Look through data stream for similarity and filter similar elements Using HyperFactor Index (fixed size 4 GB) New Data Stream 2. Read elements that are most similar from storage Using Restore Index 3. Binary compare element in stream with element(s) read from storage Element A Element B Element C 4. Identical data is referenced by a new additional entry in the Restore Index - unique data is stored in the repository 16

Assessment for HyperFactor No Hash Table required No scalability issues 4 GB Index references up to 1 PB of physical data elements No dependency of data format and application Very flexible, no ongoing development effort due to format changes HyperFactor index always fits into memory Enables enterprise-class high-performance in-line deduplication Eliminates the phenomenon of missed factoring opportunities Looks for similarity between data not on exact chunk matches 17

Data Deduplication Considerations and Value Proposition 18

Not all Data Dedupe well High Dedupe Ratio expected for... Structured Data Database Files E-mails Low Dedupe Ratio expected for... Unstructured Data Images Videos Voice Data Seismic Data Large collections of small files Some Technologies influence Dedupe Ratio 19

Technologies influencing Data Deduplication Compression Archives *.zip (Phil Katz zip: pkzip, pkunzip) *.gz (GNU zip: gzip, gzip -d) Compaction Lotus Notes Database Multiplexing Multiple backup streams to a single tape drive Veritas Backup Exec Computer Associates ARCserve Oracle RMAN multiplexing of backup sets Encryption Above technologies change the data stream making identical data non-identical! 20

Example: Data Deduplication and Encryption Data source 1 Important text No encryption Important text Data Data encryption encryption prior prior to to deduplicatioduplication processing processing can can de- subvert subvert data data reduction reduction Data source 2 Important text Encryption key 1 txpt tnatroemi Data Deduplication Data Store Important text txpt tnatroemi Data source 3 Important text Encryption key 2 te tarpixtntom Compression possible te tarpixtntom 1. Three data sources have the same text file 2. After encryption, text files do not match 3. Deduplication processing does not detect redundancy 4. Text files are stored without data reduction 21

Dedupe Value Proposition & potential Drawbacks Data Deduplication Value Proposition Disk storage savings Network Bandwidth savings Energy savings (Green IT) Better utilization of existing floor and rack space Increased scalability Data Deduplication Potential Drawbacks Loss of one single data chunk may cause loss of multiple files Repository or Index required to store meta data must be protected requires additional storage capacity may slow down performance Loss of all Index means loss of all data 22

TS7650 ProtecTIER Gateway 23

ProtecTIER Architecture Overview It s a Tape Library and Drives Virtual Tape Library ProtecTIER Server FC Backup Server ProtecTIER Application Disk Storage System Linux server-based application running on a System x server Emulates a tape library unit, including drives, cartridges, and robotics Uses Fibre Channel (FC) attached disk storage system as the backup medium Has a build-in deduplication engine (HyperFactor) 24

STG Technical Conferences 2010 New Data Stream Filter out similar elements (using resident index) Read similar elements from storage and compare HyperFactor Data Storage Memory Resident Index (4 GB, may contain predefined elements) Disk Arrays FC Switch ProtecTIER Server Existing Data Virtual Tape Emulation Backup Servers Restore Index Filtered data Reference identical elements in restore index 25 Store unique elements on storage

Dedupe Ratio depends on... Data Change Rate the percentage of data in the incomming backup data stream that is new for ProtecTIER and not already stored physically in the repository Backup Policies # full backups # Inc backups backup frequency data retention period 26

ProtecTIER Native Replication Key new feature R2.3 Primary Site Backup Server ProtecTIER Gateway Represented capacity Physical capacity Secondary Site Backup Server ProtecTIER IP replication Significant bandwidth reduction Represented capacity Backup Server ProtecTIER Gateway Physical capacity 27

TS7650 ProtecTIER Appliance Series 28

TS7650 Appliance Series F05 Base Frame Standalone 4700 32 spindle 450GB (2 drawer) 7TB 100MB/sec F05 Base Frame Standalone 4700 64 spindle 450GB (4 drawer) 18TB 250MB/sec Standalone 4700 128 spindle 450GB (8 drawer) 36TB 450MB/sec 500MB/sec 6.3TB 15.8TB 31.5TB 31.5TB U U U U P U U U U P U U U U P S U U U U P U U U U P U U U U P S U U U U P U U U U P U U U U P S M M M M M m m m m m U U U U P S Appliances or TSSC can 1u be empty space upgraded or TSSC one step 1u empty forward space TSSC... or TSSC or TSSC or TSSC F05 Base Frame EXP810 EXP810 EXP810 DS4700 Clustered 4700 128 spindle 450GB (8 drawer) 36TB 450MB/sec F05 Base Frame EXP810 EXP810 EXP810 DS4700 500MB/sec Ethernet Switch (1U) Ethernet Switch (1U) U U U U P U U U U P U U U U U U U U P U U U U P U U U U U U U U P U U U U P U U U U M M M M M m m m m m X3850 M2 3 x 6core, 24GB RAM or TSSC or TSSC U U U U P S P S P S P S WTI Switch Power: Base Power: FC1903 EXP810 DS4700 X3850 M2 3 x 6core, 24GB RAM U U U U P U U U U P U U U U M M M M M m m m m m U U U U P S P S Power: Base Power: FC1903 DS4700 EXP810 EXP810 EXP810 X3850 M2 3 x 6core, 24GB RAM U U U U P U U U U P U U U U U U U U P U U U U P U U U U U U U U P U U U U P U U U U M M M M M m m m m m U U U U P S P S P S P S Power: Base Power: FC1903 DS4700 EXP810 EXP810 EXP810 X3850 M2 3 x 6core, 24GB RAM U U U U P U U U U P U U U U U U U U P U U U U P U U U U U U U U P U U U U P U U U U M M M M M m m m m m U U U U P S P S P S P S Power: Base Power: FC1903 DS4700 EXP810 EXP810 EXP810 X3850 M2 3 x 6core, 24GB RAM U U U U P U U U U P U U U U U U U U P U U U U P U U U U U U U U P U U U U P U U U U M M M M M m m m m m U U U U P S P S P S P S 29

A look in the Future... 30

A look in the Future Some observations from the VTL and Dedupe Market Vendors converge to a common point Scalable appliances with multiple I/O interfaces (FCP, iscsi, CIFS, NFS, Library Emulation) Replication becomes more and more commodity Replication benefits from deduped data Intelligent storage devices will be tighly integrated with 3rd party backup applications e.g. controlling & monitoring replication from a backup application 31

Links 32

Links I TS7650G ProtecTIER Deduplication Gateway http://www-03.ibm.com/systems/storage/tape/ts7650g/index.html TS7650 ProtecTIER Deduplication Appliance http://www-03.ibm.com/systems/storage/tape/ts7650a/index.html Whitepaper: IBM Data Deduplication Strategy and Operations http://www.ibm.com/developerworks/wikis/display/tivolistoragemana ger/ibm+tivoli+storage+manager+v6.1+data+deduplication+strate gy+and+operations Redbook: The IBM System Storage TS7650G and TS7650 ProtecTIER Servers http://w3.itso.ibm.com/redpieces/abstracts/sg247652.html?open 33

Links II TS7650G ProtecTIER Implementation Workshops IBMer: https://w3-01.sso.ibm.com/learning/lms/saba/web/main/goto/learningactivity?c oursenum=ss92e1de&deeplinkredirect=false Business Partner: http://www- 304.ibm.com/jct03001c/services/learning/ites.wss/de/de?pageType= course_description&includenotscheduled=y&coursecode=ss92e1 DE 34

Storage Competence at the Mainz Location IBM Germany s fourth largest location offers you a broad portfolio of IBM System Storage Services IBM Dynamic Infrastructure Leadership Center for Information Infrastructure Business, Channel & Skill Enablement & Training DI Education & Briefings Demos & Showcases IT Transformation Roadmaps & Workshops BP Certification IBM European Storage Competence Center & Systems Lab Europe Business, Channel & Skill Enablement & Training End-to-end client support Workshops Solution Design Lab Services Customer Relationship Management IBM Executive Briefing Center & TMCC Business, Channel & Skill Enablement & Training Customer and Group Briefings Product & SW Demos Integrated Solution Demos Exhibition Support & Organization IBM STG Europe Storage Software Development Software Development Storage & Tape Linux Mainframe File Systems 35

IBM System Storage Solutions Center of Excellence We offer technical support from the planning phase through well after installation Our Services Client Briefings & Education Systems Lab Services & Training Customized Workshops System Storage Demos Advanced Technical Support Solution Design Proof of Concepts Benchmarks Product Field Engineering Our Expertise Skilled technical storage experts covering the whole IBM System Storage Portfolio Information Infrastructure: Compliance Availability Retention Security HW / SW & Performance Our Systems Lab Europe 1500 sqm lab space IBM & heterogenous hardware 36

Hindi Hebrew Simplified Chinese Gracias Russian Thank You Spanish Arabic Tak English Obrigado Brazilian Portuguese Grazie Italian Danish Korean Danke German Merci French Tamil Japanese Traditional Chinese Thai 37

Disclaimer I STG Technical Conferences 2010 Copyright 2009 by International Business Machines Corporation. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. The performance data contained herein were obtained in a controlled, isolated environment. Results obtained in other operating environments may vary significantly. While IBM has reviewed each item for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. These values do not constitute a guarantee of performance. The use of this information or the implementation of any of the techniques discussed herein is a customer responsibility and depends on the customer's ability to evaluate and integrate them into their operating environment. Customers attempting to adapt these techniques to their own environments do so at their own risk. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. This information could include technical inaccuracies or typographical errors. IBM may make improvements and/or changes in the product(s) and/or program(s) at any time without notice. Any statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Any reference to an IBM Program Product in this document is not intended to state or imply that only that program product may be used. Any functionally equivalent program, that does not infringe IBM's intellectually property rights, may be used instead. It is the user's responsibility to evaluate and verify the operation of any on-ibm product, program or service. 38

Disclaimer II STG Technical Conferences 2010 THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NONINFRINGEMENT. IBM shall have no responsibility to update this information. IBM products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided. IBM is not responsible for the performance or interoperability of any non-ibm products discussed herein. Information concerning non-ibm products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-ibm products. Questions on the capabilities of non-ibm products should be addressed to the suppliers of those products. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents or copyrights. Inquiries regarding patent or copyright licenses should be made, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. 39

Trademarks STG Technical Conferences 2010 The following terms are trademarks or registered trademarks of the IBM Corporation in either the United States, other countries or both. IBM, TotalStorage, zseries, pseries, xseries, S/390, ES/9000, AS/400, RS/6000 z/os, z/vm, VM/ESA, OS/390, AIX, DFSMS/MVS, OS/2, OS/400, ESCON, Tivoli iseries, ES/3090, VSE/ESA, TPF, DFSMSdfp, DFSMSdss, DFSMShsm, DFSMSrmm, FICON, ProtecTIER, XIV Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Other company, product, and service names mentioned may be trademarks or registered trademarks of their respective companies. 40