UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

Similar documents
UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s.

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

ADVANCED DEDUPLICATION CONCEPTS. Larry Freeman, NetApp Inc Tom Pearce, Four-Colour IT Solutions

The Business Value of Data Deduplication DDSR SIG

Restoration Technologies. Mike Fishman / EMC Corp.

Introduction to Data Protection: Backup to Tape, Disk and Beyond. Michael Fishman, EMC Corporation

Trends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation (Author and Presenter)

Introduction to Data Protection: Backup to Tape, Disk and Beyond. Michael Fishman, EMC Corporation

Introduction to Data Protection: Backup to Tape, Disk and Beyond

Creating a Catalog for ILM Services. Bob Mister Rogers, Application Matrix Paul Field, Independent Consultant Terry Yoshii, Intel

Trends in Application Recovery. Andreas Schwegmann, HP

Backup to Tape, Disk and Beyond. Jason Iehl, NetApp

3Gen Data Deduplication Technical

Thomas Rivera / Hitachi Data Systems

Deduplication Demystified: How to determine the right approach for your business

Efficient Backup with Data Deduplication Which Strategy is Right for You?

Storage Considerations for Database Archiving. Julie Lockner, Vice President Solix Technologies, Inc.

Deduplication has been around for several

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

DXi Accent Technical Background

Understanding Enterprise NAS

Cloud OS Vision. Modern platform for the world s apps

WAN Optimization and Thin Client: Complementary or Competitive Application Delivery Methods? Josh Tseng, Riverbed

Cloud-integrated Storage What & Why

Tiered Data Protection Strategy Data Deduplication. Thomas Störr Sales Director Central Europe November 8, 2007

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Cloud File Services: October 1, 2014

EMC PERSPECTIVE. An EMC Perspective on Data De-Duplication for Backup

How To Migrate To A Network (Wan) From A Server To A Server (Wlan)

WAN Optimization and Cloud Computing. Josh Tseng, Riverbed

EMC Disk Library with EMC Data Domain Deployment Scenario

EMC DATA PROTECTION. Backup ed Archivio su cui fare affidamento

Server and Storage Virtualization with IP Storage. David Dale, NetApp

Overcoming Backup & Recovery Challenges in Enterprise VMware Environments

Cloud-integrated Enterprise Storage. Cloud-integrated Storage What & Why. Marc Farley

Data Deduplication Background: A Technical White Paper

Veeam Backup and Replication Architecture and Deployment. Nelson Simao Systems Engineer

How To Protect Data On Network Attached Storage (Nas) From Disaster

Fundamental Approaches to WAN Optimization. Josh Tseng, Riverbed

Business Benefits of Data Footprint Reduction

Long term retention and archiving the challenges and the solution

EMC AVAMAR. a reason for Cloud. Deduplication backup software Replication for Disaster Recovery

WHITE PAPER. DATA DEDUPLICATION BACKGROUND: A Technical White Paper

DEDUPLICATION NOW AND WHERE IT S HEADING. Lauren Whitehouse Senior Analyst, Enterprise Strategy Group

VDI Optimization Real World Learnings. Russ Fellows, Evaluator Group

High Performance Computing OpenStack Options. September 22, 2015

Quantum DXi6500 Family of Network-Attached Disk Backup Appliances with Deduplication

Real-time Compression: Achieving storage efficiency throughout the data lifecycle

Protect Microsoft Exchange databases, achieve long-term data retention

Accelerating Applications and File Systems with Solid State Storage. Jacob Farmer, Cambridge Computer

Take Advantage of Data De-duplication for VMware Backup

Next Generation Backup Solutions

Riverbed Whitewater/Amazon Glacier ROI for Backup and Archiving

WHITE PAPER Data Deduplication for Backup: Accelerating Efficiency and Driving Down IT Costs

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

Get Success in Passing Your Certification Exam at first attempt!

Storage Cloud Environments. Alex McDonald NetApp

ILM: Tiered Services & The Need For Classification

White. Paper. Addressing NAS Backup and Recovery Challenges. February 2012

Scale and Availability Considerations for Cluster File Systems. David Noy, Symantec Corporation

DPAD Introduction. EMC Data Protection and Availability Division. Copyright 2011 EMC Corporation. All rights reserved.

EMC DATA DOMAIN OPERATING SYSTEM

Data Domain Overview. Jason Schaaf Senior Account Executive. Troy Schuler Systems Engineer. Copyright 2009 EMC Corporation. All rights reserved.

Optimizing Backup and Data Protection in Virtualized Environments. January 2009

Redefining Backup for VMware Environment. Copyright 2009 EMC Corporation. All rights reserved.

SERVER VIRTUALIZATION AND STORAGE DISASTER RECOVERY. Ray Lucchesi, Silverton Consulting

Module: Business Continuity

Symantec NetBackup Appliances

Energy Efficient Storage - Multi- Tier Strategies For Retaining Data

June Blade.org 2009 ALL RIGHTS RESERVED

EMC DATA DOMAIN OPERATING SYSTEM

Backup Exec 2010 Deduplication Protect More, Store Less, Save More

This presentation contains some information about future Veeam product releases, the timing and content of which are subject to change without

EMC BACKUP-AS-A-SERVICE

Using HP StoreOnce Backup Systems for NDMP backups with Symantec NetBackup

Backup and Recovery Redesign with Deduplication

Solution Overview. Business Continuity with ReadyNAS

Dell PowerVault DL2200 & BE 2010 Power Suite. Owen Que. Channel Systems Consultant Dell

Transcription:

UNDERSTANDING DATA DEDUPLICATION Thomas Rivera SEPATON

SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may use this material in presentations and literature under the following conditions: Any slide or slides used must be reproduced in their entirety without modification The SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations. This presentation is a project of the SNIA Education Committee. Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney. The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK. 2

About the SNIA DMF This tutorial has been developed, reviewed and approved by members of the Data Management Forum (DMF) The DMF is an industry resource to those responsible for the accessibility and integrity of their organization s information The DMF focuses on the technologies and trends related to Data Protection, ILM and Long-term digital information retention DMF Workgroups: Data Protection Initiative (DPI) Defining best practices for data protection and recovery technologies such as Backup, CDP, Data deduplication and VTL Information Lifecycle Management Initiative (ILMI) Developing, educating and promoting ILM practices, implementation methods, and benefits Long-term Archive and Compliance Storage Initiative (LTACSI) Addressing the challenges of retaining, securing, and preserving digital information for the long-term 3

Abstract Data deduplication is a capacity optimization technology that is being used to dramatically improve storage efficiency. This technical session will: Review various data deduplication methodologies Identify the factors that influence space savings Provide scenarios where data deduplication is used 4

Agenda Overview How Data Deduplication Works Scenarios Q & A 5

Space Reduction Terminology Data Deduplication is the replacement of multiple copies of data at variable levels of granularity with references to a shared copy in order to save storage space and/or bandwidth Subfile Data Deduplication is a form of data deduplication that operates at a finer granularity than an entire file or data object Single Instance Storage is form of data deduplication that operates at a granularity of an entire file or data object Compression is the encoding of data to reduce its storage requirement - deduplicated data can also be compressed 6

Space Reduction Ratio & Percent Bytes In Capacity Optimization Bytes Out Ratio = Bytes In Bytes Out % = Bytes In Bytes Out Bytes In 1 1 - % 1 % = 1 ( ) Ratio Ratio = ( ) 7

Space Reduction Ratio & Percent Space Reduction Ratio Space Reduction Percent 2:1 1/2 = 50% 5:1 4/5 = 80% 10:1 9/10 = 90% 20:1 19/20 = 95% Ratios can meaningfully be compared only under the same set of assumptions Relatively low space reduction ratios provide significant space savings 100:1 99/100 = 99% 500:1 499/500 = 99.8% 8

Data Deduplication How it Works Evaluate Data Identify Redundancy Create or Update Reference Information Store and/or Transmit Unique Data Once Read and/or Reproduce Data 9

Data Deduplication Simplified = new unique data = repeat data 10

Data Deduplication Simplified = new unique data = repeat data = pointer to unique data segment 11

Data Deduplication Simplified Dump #1 Dump #2 Dump #3 Dump #4 = new unique data = repeat data = pointer to unique data segment 12

Reading Deduplicated Data Read Request Read Fulfilled Application/CIFS/NFS/VTL Interface Deduplication Engine Reconstitution/ Verification Metadata References Deduplicated Data 13

Design Approach Component Hardware (e.g., chip or card) integrated into a larger system Gateway A dedicated data deduplication engine that must be combined with a storage system Appliance A dedicated deduplication engine integrated with a storage system Storage System A general purpose storage system with data deduplication capabilities Grid Storage A storage system that can scale independently without constraints to physical attributes Software Includes application agents, virtual appliances, or storage software 14

Design Approach Application-specific protocol CIFS, NFS, FC, iscsi, VTL WAN Integrated Replication Agents Gateway Virtual Appliance Appliance Storage System Grid Storage Multiple deployment examples are illustrated Specific deployments selected based on customer situation 15

Source or Target Source Deduplication Identifies duplicate data at the client Transfers unique segments to a central repository Separate client and server components Target Deduplication Identifies duplicate data where the data is being stored Stores unique segments Standalone system Considerations Neither approach enables a greater or lesser space savings Scope of data deduplication may vary by implementation 16

Inline or Post-Process Inline Deduplication Data deduplication performed before writing the deduplicated data Post-Process Deduplication Data deduplication performed after the data to be deduplicated has been initially stored Considerations A product may implement both methods A product may provide methods to control when particular data is deduplicated May impact replication, usable capacity, scalability, etc. 17

Fixed or Variable Size Segment Fixed Length Segment Deduplication Evaluation of data includes a fixed reference window used to look at segments of data during deduplication process Provides fixed granularity, e.g. 4KB, or 8KB, or 128KB Variable length Segment Deduplication Evaluation of data uses a variable length window to find duplicate data in stream or volume of data processed Provides variable granularity, e.g. Average 4KB or 32KB Method Chosen May Affect Deduplication Results Effects observed will vary by method Segmentation may not apply to all deduplication 18

Data Deduplication Benefits Data Deduplication can help organizations: Satisfy ROI/TCO requirements Manage data growth Increase efficiency of storage and backup Reduce overall cost of storage Reduce network bandwidth Reduce operational costs including: Infrastructure costs required space, power and cooling Movement toward a greener data center Reduce administrative costs 19

Data Deduplication Scope Multiple Repositories Per Controller Single Repository Per Controller Single Repository Shared by Multiple Controllers Application Servers Application Servers Application Servers Dedupe Dedupe Repository Repository Dedupe Repository Dedupe Repository System capacity varies independently from the scope 20

Applications for Deduplication Backup and Recovery Backup to disk efficiently with long retention recoverability Replication for offsite data movement Archive Repository Long-term retention and preservation Primary Storage Lower physical capacity required for storage of active data 21

Backup: What to Consider Factors that will Impact your Results: Different applications or data types Bandwidth and latency Policies and methodologies Data protection overhead Compression and encryption Deduplication Scope Deduplicated Data Resiliency Scalability Capacity Performance 22

Backup: Factors Impacting Space Savings Factors associated with higher data deduplication ratios Data created by users Low change rates Reference data and inactive data Applications with lower data transfer rates Use of full backups Longer retention of deduplicated data Wider scope of data deduplication Continuous business process improvement Smaller segment size Variable-length segment size Format awareness Temporal data deduplication Factors associated with lower data deduplication ratios Data captured from mother nature High change rates Active data, encrypted data, compressed data Applications with higher data transfer rates Use of incremental backups Shorter retention of deduplicated data Narrower scope of data deduplication Business as usual operational procedures Larger segment size Fixed-length segment size No format awareness Spatial data deduplication www.snia.org/forums/dmf/knowledge 23

Backup: Influence of Backup Methodology 24

Backup: Capacity Savings Over Time Data Amount of Data Stored Subsequent Data Storage Events Initial Data Storage Event Initial Storage Requirement Ongoing Storage Requirement Actual Physical Storage Consumed Time 25

Backup: Deduplication for Data Movement Disaster Recovery Replicate all Data after Deduplication for Bandwidth Efficiency Meet Offsite Requirements without Physical Transport Bandwidth Optimization Increasing WAN Efficiency Transfer more information per pipe Support Remote Office Protection Enable Backup Centralization Consolidate Physical Tape Creation 26

Backup: Removable Media Integration Create Long-term Removable Media Storage for Compliance and Archive Backup Server Different Data Path Approaches (#1) Path through backup server # 1 (#2) Path direct from deduplication system to removable media storage Deduplication System # 2 Removable Media 27

Deduplication within Secondary Storage Use Cases 28

Deduplication for Backup and Recovery Deduplicating Agents Backup Clients Backup Clients Primary systems Primary systems Primary systems B2D B2D Deduplicating Backup Server B2D Backup Server Appliance Storage System Deduplicating VTL or NAS

Backup Remote Office Source Deduplication Large Remote Site Agents Source Deduplication Small Remote Site Agents Only Primary systems WAN Primary systems Appliance Optimized Replication Encryption Remote Recovery Site Data Center Agents Agents Primary systems Primary systems Appliance Appliance Tape Vault 30

Deduplicated Archive Repositories Policy-based software moves inactive data on primary storage to deduplicated storage archive Application Servers Policy-based software Local archive may be replicated offsite and copied to tape for long term retention Optimized Replication Remote Site Primary Storage Deduplicated Local Archive Deduplicated Offsite Archive Tape Vault 31

Deduplication of Primary Storage Use Cases 32

Primary Storage with Deduplication LAN Application Server SAN Primary Storage Bring the benefits of deduplication to primary storage Supports networked storage (SAN & LAN) Storage efficiency supports replicating more data Space/performance tradeoff 33

Virtual Machine Storage Balance the tradeoff between savings and performance impact VM Clone Copies APP APP DATA DATA OS OS APP DATA OS APP DATA OS APP DATA OS Examples of Active Data Unstructured data Structured data Virtual Machines APP DATA OS VMs After Deduplication Duplicate Data Is Eliminated 34

Q&A / Feedback Please send any questions or comments on this presentation to SNIA: trackdatamgmt@snia.org Many thanks to the following individuals for their contributions to this tutorial. - SNIA Education Committee Data Deduplication and Space Reduction (DDSR) Special Interest Group Matthew Brisse Daniel Budiansky Mike Dutch Michael Fishman Larry Freeman Devin Hamilton Jason Iehl Shane Jackson Gene Nagle Thomas Rivera Tom Sas Gideon Senderov It s easy to get involved with the DMF! - Find a passion - Join a committee - Gain knowledge & influence - Make a difference www.snia.org/forums/dmf 35

Approaches to VM Backup Details Backup Client within each VM Backup Client on Proxy Server Backup Services as Virtual Appliances Mount VM Storage Snapshots The specific backup application determines where deduplication is performed 36

Deduplicated Primary Storage Space/Performance tradeoff File Services Home directories Tech Pubs Email Fixed content Replicated databases Test and application development Source code version control system Virtualized environments Deduplicationenabled file system Candidate Files Deduplicated Data 37

Storage Tiering with Deduplication Tier 1: Primary Storage Application Servers Tier 2: Deduplicated Storage 38

File Tiering with Deduplication NFS/CIFS Files and Stubs C D A A Migrate Shared Files A D NAS Client B B E E Primary Storage Secondary Storage 39

Backup Remote Office Target Deduplication Remote Site Remote Recovery Site Target Deduplication Primary systems WAN Primary systems Appliance Optimized Replication Encryption Data Center Appliance Tape Vault Primary systems Appliance 40

Deduplication within the Backup Cloud Application and Data on Customer Premises Backup as Service within the Cloud Deduplicated Data Deduplication performed by the backup client Application and Data on Customer Premises Deduplication Target within the Cloud Data 41