Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

Similar documents

Symantec Endpoint Protection 11.0 Architecture, Sizing, and Performance Recommendations

Technical White Paper for the Oceanspace VTL6000

Understanding EMC Avamar with EMC Data Protection Advisor

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Turnkey Deduplication Solution for the Enterprise

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

Understanding EMC Avamar with EMC Data Protection Advisor

Symantec NetBackup deduplication general deployment guidelines

A Deduplication File System & Course Review

Symantec NetBackup Deduplication Guide

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Data Backup and Archiving with Enterprise Storage Systems

Backup for branch offices and compartment backups. Måns Höiom & Rikard Lindkvist

Data deduplication is more than just a BUZZ word

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Enterprise-class Backup Performance with Dell DR6000 Date: May 2014 Author: Kerry Dolan, Lab Analyst and Vinny Choinski, Senior Lab Analyst

Speeding Up Cloud/Server Applications Using Flash Memory

Evaluation of Enterprise Data Protection using SEP Software

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

WHITE PAPER BRENT WELCH NOVEMBER

Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team

Storage for VDI Environments

Symantec NetBackup Deduplication Guide

Demystifying Deduplication for Backup with the Dell DR4000

Creating a Cloud Backup Service. Deon George

Symantec NetBackup Appliances

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

NetBackup Best Practice Using Tape Storage with Deduplicating Disk Storage

VDI Without Compromise with SimpliVity OmniStack and Citrix XenDesktop

EMC DATA DOMAIN OPERATING SYSTEM

EMC DATA DOMAIN OPERATING SYSTEM

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Contents Introduction... 5 Deployment Considerations... 9 Deployment Architectures... 11

Benchmarking Hadoop & HBase on Violin

Barracuda Backup Deduplication. White Paper

SOLUTION BRIEF. Resolving the VDI Storage Challenge

Cloud Optimize Your IT

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

Moving Virtual Storage to the Cloud

EMC BACKUP-AS-A-SERVICE

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Direct virtual machine creation from backup with BMR

8Gb Fibre Channel Adapter of Choice in Microsoft Hyper-V Environments

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

Symantec NetBackup Deduplication Guide

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Riverbed Whitewater/Amazon Glacier ROI for Backup and Archiving

EMC DATA DOMAIN PRODUCT OvERvIEW

09'Linux Plumbers Conference

An Oracle White Paper November Oracle Real Application Clusters One Node: The Always On Single-Instance Database

Milestone Solution Partner IT Infrastructure MTP Certification Report Scality RING Software-Defined Storage

Symantec NetBackup for Microsoft SharePoint Server Administrator s Guide

Deploying De-Duplication on Ext4 File System

TEST REPORT SUMMARY MAY 2010 Symantec Backup Exec 2010: Source deduplication advantages in database server, file server, and mail server scenarios

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Practical Performance Understanding the Performance of Your Application

Backup Exec 2012 Agents and Options

An Oracle White Paper September Oracle Exadata Database Machine - Backup & Recovery Sizing: Tape Backups

WHY DO I NEED FALCONSTOR OPTIMIZED BACKUP & DEDUPLICATION?

Best Practices for Deploying Citrix XenDesktop on NexentaStor Open Storage

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

Reducing Replication Bandwidth for Distributed Document Databases

Backup and Recovery. Backup and Recovery. Introduction. DeltaV Product Data Sheet. Best-in-class offering. Easy-to-use Backup and Recovery solution

Veeam Cloud Connect. Version 8.0. Administrator Guide

Performance and scalability of a large OLTP workload

Protect Data... in the Cloud

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage

VIDEO SURVEILLANCE WITH SURVEILLUS VMS AND EMC ISILON STORAGE ARRAYS

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

Hardware Configuration Guide

Windows Server ,500-user pooled VDI deployment guide

SAN Conceptual and Design Basics

Redefining Backup for VMware Environment. Copyright 2009 EMC Corporation. All rights reserved.

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Google File System. Web and scalability

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Data Deduplication and Tivoli Storage Manager

How To Make A Backup System More Efficient

Backup Exec 15: Deduplication Option

Benchmarking Cassandra on Violin

WHITE PAPER. Dedupe-Centric Storage. Hugo Patterson, Chief Architect, Data Domain. Storage. Deduplication. September 2007

WHITE PAPER: customize. Best Practice for NDMP Backup Veritas NetBackup. Paul Cummings. January Confidence in a connected world.

Oracle Hyperion Financial Management Virtualization Whitepaper

Backup Exec 2014: Deduplication Option

Symantec NetBackup for Microsoft SharePoint Server Administrator s Guide

MANAGED SERVICE PROVIDERS SOLUTION BRIEF

Understanding Data Locality in VMware Virtual SAN

W H I T E P A P E R : T E C H N I C A L. Understanding and Configuring Symantec Endpoint Protection Group Update Providers

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Symantec NetBackup 5000 Appliance Series

Protect Microsoft Exchange databases, achieve long-term data retention

Quantum StorNext. Product Brief: Distributed LAN Client

Symantec Desktop and Laptop Option 7.6

Transcription:

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos Symantec Research Labs

Symantec FY 2013 (4/1/2012 to 3/31/2013) Revenue: $ 6.9 billion Segment Revenue Example Business Consumer 30% Norton security, Norton Zone Security and compliance 30% Symantec Endpoint Protection, Mail/Web security, Server protection Storage and server management 36% Backup and recovery, Information availability Services 4% Managed security service Symantec Research Labs 2

Symantec Research Labs Leading experts in security, availability and systems doing innovative research across all of Symantec s businesses Our mission is to ensure Symantec s long-term leadership by fostering innovation, generating new ideas, and developing next-generation technologies across all of our businesses. A global organization: Mountain View, CA Culver City, CA Herndon, VA Waltham, MA Ongoing collaboration with other researchers, government agencies and universities such as: Sophia Antipolis, FR and numerous others Symantec Research Labs 3

Symantec Research Labs Charter Foster innovation and develop key technologies Operational Model Work closely with existing and emerging businesses Identify and work on relevant hard technical challenges Success Factors Transfer technology Develop new intellectual property Forge research collaborations and publish innovative work Symantec Research Labs 4

Symantec Research Labs (SRL) Product Development Advanced Development SRL Academic Research Proportion of time spent f (risk, time_horizon, speculation) Symantec Research Labs 5

Building a High Performance Deduplication System

Deduplication is Maturing Deduplication (dedupe) is becoming a standard feature of backup/archival systems, file systems, cloud storage, network traffic optimization, Many approaches, algorithms, techniques Inline/offline, file/block level, fixed/variable block size, global/local dedupe Near-optimal duplicate detection and usage of available raw storage However, in practice, scalability is still an issue Single node capacity < a few hundred TB Scalability achieved (mostly) using multi-node systems Capacity still relatively low (usually single-digit PB) High acquisition/maintenance cost Complex design, performance impact Symantec Research Labs 7

Hypothesis and Goal We advocate that improving single-node performance is critical Why? Better building blocks fewer nodes necessary Potential for single-node deployments for small/medium businesses Lower acquisition/management/energy cost Hosting/relocation flexibility How? Single-node performance = high scalability and throughput, good dedupe efficiency Making the best of a node s resources to address challenges Built and tested a complete deduplication system from scratch Including client, server and network components Symantec Research Labs 8

Core Deduplication Functionality Each file is identified by a file fingerprint FFP Dedupe metadata service C:\Documents and Settings\AAA\My Documents\ Budget_2011.xls 260 Kb 05/05/2011 A 00101 B 00011 C 10101 D Content 11001 E 11100 F 00010 G 11111 H 011 FFP1 101011 C:\Documents and Settings\AAA\My Documents\ Budget_2011.xls 260 Kb 05/05/2011 C:\Documents and Settings\BBB\My Documents\ Budget_2011.xls 260 Kb 05/10/2011 FFP1 FFP1 C:\Documents and Settings\BBB\My Documents\ Budget_2011.xls 270 Kb 06/03/2011 FFP2 C:\Documents and Settings\BBB\My Documents\ Budget_2011.xls 260 Kb 05/10/2011 Content C:\Documents and Settings\BBB\My Documents\ Budget_2011.xls 260 270 Kb 05/05/2011 06/03/2011 A 00101 B 00011 C O Updated Content Content 10101 D 11001 01011 F 00010 G 11111 P 1101 FFP1 101011 Each file consists of data segments. FFP2 011011 WAN For each segment a segment fingerprint SFP is calculated. (MD5, SHA-1/2 etc.) FFP1 Dedupe storage service A B C D E F G H O P FFP2 Symantec Research Labs 9

Challenges Addressed Original Data dedupe Duplicates Removed Have I seen segment X before? (and if yes, where is it stored?) Index: maps SFPs to disk location Segment indexing challenges System capacity bound by indexing capacity Increasing the segment size not a good solution System performance bound by index speed Reference management challenges Need to keep track of who is using what Capacity Segment size Item Scale Remarks Capacity bound by the ability to track references and manage/reclaim segments Reference update performance may also become a bottleneck Client and network performance challenges File 1 File 2 Symantec Research Labs 10 File 3 C = 1000 TB S = 4 KB Num of segments N = 250*10 9 N = C/S Metadata per segment E = 22 B Total segment metadata I = 5500 GB I = N*E Disk speed Z = 400 MB/sec Lookup speed goal 100 Kops/sec Z/S

Performance Goals Single-node scalability goals 100+ billion objects per node, with high throughput Single-node performance goals Near-raw-disk throughput for backup, restore and deletion (i.e., space reclamation) Reasonable deduplication efficiency Client-server interface capable of delivering desired throughput Assumptions: Opportunity for improving duplicate detection is becoming scarce Aiming for perfect duplicate detection may limit scalability Willing to trade some deduplication efficiency for high scalability Symantec Research Labs 11

Outline 1 2 3 4 Architecture Overview Progressive Sampled Indexing Grouped Mark-and-Sweep Evaluation 12

Architecture Overview Design Client Server architecture Client component may reside on server Client Initiates backup creation/restore Performs client I/O Performs segmentation, fingerprint calculation Issues fingerprint lookups to server initiates data transfers only for new segments Server File Manager Hierarchical list of backup and file metadata Central concept: backup represents a list of files Organizes backups into (roughly equal-size) backup groups Server Segment Manager Manages storage units, called containers, and implements storage logic Hosts the fingerprint index responsible for duplicate detection Performs reference management operations Symantec Research Labs 13

Architecture Overview Main Concepts Containers represented by unique container ID (CID) Container contents: raw data segments + catalogue of segment SFPs File represented by a list of <SFP, CID> pairs Identified by its file fingerprint FFP = hash(sfp1, SFP2,, SFP4) FFP File = ( < SFP1, CID1 > < SFP2, CID1 > > < SFP3, CID2 > < SFP4, CID3 > ) Notice that file segment location stored inline Data segments directly locatable from file metadata No need for location services by the index e.g., at restore time Symantec Research Labs 14

Sampled Indexing Directly locatable objects + relaxed dedupe goals No requirement for a complete index freedom to do sampling Sampled Indexing: keep 1 out of T SFPs E.g., modulo T sampling Sampling rate R ( fullness level) = 1/T = = function (RAM, storage capacity, desired segment size) Symantec Research Labs 15

Progressive Sampled Indexing Estimate of sampling rate R = (M/E) / (C/S) M = memory GBs, E = bytes/entry M/E = total index entries C = storage TBs, S = segment size KBs C/S = total segments Example: 32 bytes/entry, 4KB segments and 32GB of RAM No sampling (R=1) 4 TB storage R ~= 1/100 i.e., keep 1 out of 100 SFPs 400 TB Progressive sampling: used storage VS available storage Sampling rate = function(used storage, available RAM) Start with no sampling (R=1) Progressively decrease R, down to R min = (M/E) / (C/S) Symantec Research Labs 16

Fingerprint Caching Straight sampling poor deduplication efficiency Only 1 out of T segments deduplicatable Solution: take advantage of spatial locality Index hit <SFP, CID> use CID to pre-fetch SFPs in container catalogue Part of index memory: SFP cache Lower bound for sampling rate: at least one sampled entry per container Symantec Research Labs 17

Reference Management Challenges and Goals Problem: Is it safe to delete a segment? Is anyone using it? Challenge: do it correctly and efficiently Mistakes can not be tolerated e.g., after a crash Potentially used in a distributed environment Performance requirements: > 100 Kops/sec Reference counting: efficient but hard to guarantee correctness Repeated and/or lost updates lead to irrecoverable errors Reference lists: repeated updates solved, lost updates still a problem Very inefficient (list maintenance, serialization to disk, etc.) Mark-and-Sweep (aka Garbage Collection) Workload proportional to system capacity E.g., 400 TB system, 4 KB segments, 22 byte SFP ~2.2 TB read from disk during mark phase x10 dedupe factor ~ 22 TB File 3 File 1 File 2 Symantec Research Labs 18

Grouped Mark-and-Sweep (GMS) Mark: divide and save Track changes to backup groups Only re-mark changed groups Mark results saved and reused Sweep: track affected containers Only sweep containers that may have deletions Result: workload = function(amount of change) Instead of system capacity Some files deleted Backup Group 1 Backup 1 Backup 2 No changes Backup Group 2 Backup 3 Backup 4 Some files added Backup Group 3 Backup 5 Mark results for Mark Containers Group 3 results for Group to sweep 2 Mark results for Group 1 G2 mark G1 mark G3 mark G1 mark G3 mark G2 mark G2 mark G2 mark Container 1 Container 2 Container 3 Container 4 Container 5 Symantec Research Labs 19

Deduplication Server Evaluation Evaluation metrics: single-node throughput, scalability and dedupe efficiency Implemenation in C++, using Pthreads Portable (Linux and Windows) Index implementation: pointer-free, three cache eviction policies 8-core server, 32 GB RAM, 24 TB disk array, 128 GB PCI-X SSD, 64-bit Linux Baseline disk array throughput: ~1,000 MB/sec (both read/write) Two types of data-sets: Synthetic data-set: multiple 3 GB files 100% unique segments Virtual machine image files: 4 versions of a golden image Evaluation configuration: R = 1/101, 25 GB index 200 TB per 10 GB of RAM (500 TB in our case) Symantec Research Labs 20

Backup Throughput: Synthetic data Multiple backup streams of synthetic data Concurrency levels: 1, 16, 32, 64, 128, 256 backup streams Symantec Research Labs 21

Backup Scalability Populate system at 95% capacity (480 out of 500 TB) All containers created, container catalogues, metadata stored, but no raw data Repeated throughput tests Indexing overhead Symantec Research Labs 22

Grouped Mark-and-Sweep Throughput and Scalability Reference update performance is critical happens daily! Tests: backup add and delete (reference addition and deletion) When system is empty and when near-full slope slope = = update update throughput ~= ~= 3.2 3.1 GB/sec GB/sec Symantec Research Labs 23

Deduplication Efficiency Synthetic data-set: multiple backups, no less than 97% VM images a very common and important dedupe workload Data-set: VM0: Golden image, corporate WinXP SP2 + some utilites VM1: VM0 + all Microsoft Service Packs, Patches etc. VM2: VM1 + large anti-virus suite VM3: VM2 + more utilities (pdf readers, compression tools etc.) Calculated ground truth for all image segments (unique/duplicates) VM Unique Segments Globally Unique Ideal Disk Usage (MB) Actual Disk Usage (MB) Success Rate VM0 518,326 518,326 2,123 2,211 96% VM1 733,267 921,522 3,775 3,938 96% VM2 904,579 1,189,230 4,871 5,085 96% VM3 1,145,029 1,616,585 6,621 6,860 97% Symantec Research Labs 24

Conclusions We have built and tested a complete deduplication system Scale: hundreds of TBs per node Backup throughput: ~1,000 MB/sec (unique data), ~6,000 MB/sec (duplicates) Reference management throughput: more than 3 GB/sec Deduplication efficiency: ~ 97% of duplicates detected We introduced new mechanisms for scalability & performance Sampled (SSD) indexing, grouped mark-and-sweep, pipelined client architecture Our results demonstrate that high single-node deduplication server performance is possible Symantec Research Labs 25

Status Most of the technologies are in use in the product. Make impact! Symantec is increasing R&D investment. We are hiring! Fellowship Symantec Research Labs 26

Thank you! Questions? Copyright 2010 Symantec Corporation. All rights reserved. Symantec and the Symantec Logo are trademarks or registered trademarks of Symantec Corporation or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. This document is provided for informational purposes only and is not intended as advertising. All warranties relating to the information in this document, either express or implied, are disclaimed to the maximum extent allowed by law. The information in this document is subject to change without notice. Symantec Research Labs 27