09'Linux Plumbers Conference

Similar documents

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

Data Deduplication and Tivoli Storage Manager

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski Spała

Linux Filesystem Comparisons

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

Data Deduplication HTBackup

UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s.

Filesystems Performance in GNU/Linux Multi-Disk Data Storage

Protect Data... in the Cloud

Redefining Backup for VMware Environment. Copyright 2009 EMC Corporation. All rights reserved.

Efficient Backup with Data Deduplication Which Strategy is Right for You?

Deduplication Demystified: How to determine the right approach for your business

Demystifying Deduplication for Backup with the Dell DR4000

ADVANCED DEDUPLICATION CONCEPTS. Larry Freeman, NetApp Inc Tom Pearce, Four-Colour IT Solutions

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

File System Management

Linux Powered Storage:

Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication

<Insert Picture Here> Btrfs Filesystem

Barracuda Backup Deduplication. White Paper

Availability Digest. Data Deduplication February 2011

Data Deduplication and Tivoli Storage Manager

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

A Deduplication File System & Course Review

DEDUPLICATION BASICS

Chapter 11: File System Implementation. Operating System Concepts with Java 8 th Edition

Maximize Your Virtual Environment Investment with EMC Avamar. Rob Emsley Senior Director, Product Marketing

Network File System (NFS) Pradipta De

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

LDA, the new family of Lortu Data Appliances

DEDUPLICATION NOW AND WHERE IT S HEADING. Lauren Whitehouse Senior Analyst, Enterprise Strategy Group

Understanding EMC Avamar with EMC Data Protection Advisor

efficient protection, and impact-less!!

Data Deduplication Background: A Technical White Paper

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

Deploying De-Duplication on Ext4 File System

Cloud-integrated Storage What & Why

Real-time Compression: Achieving storage efficiency throughout the data lifecycle

3Gen Data Deduplication Technical

Checklist and Tips to Choosing the Right Backup Strategy

CEMEX en Concreto con EMC. Jose Luis Bedolla EMC Corporation Back Up Recovery and Archiving

Cloud-integrated Enterprise Storage. Cloud-integrated Storage What & Why. Marc Farley

WHITE PAPER. DATA DEDUPLICATION BACKGROUND: A Technical White Paper

DEXT3: Block Level Inline Deduplication for EXT3 File System

EMC AVAMAR. a reason for Cloud. Deduplication backup software Replication for Disaster Recovery

Long term retention and archiving the challenges and the solution

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Cloud Services. May 28 th, 2014 Athens, Greece

bup: the git-based backup system Avery Pennarun

ANDREW HERTENSTEIN Manager Microsoft Modern Datacenter and Azure Solutions En Pointe Technologies Phone

ZFS Administration 1

SOP Common service PC File Server

Presents. Attix5 Technology. An Introduction

Avamar. Technology Overview

Effective Planning and Use of TSM V6 Deduplication

Trends in Enterprise Backup Deduplication

Seriously: Tape Only Backup Systems are Dead, Dead, Dead!

CA ARCserve Family r15

The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle

NEXT GENERATION STORAGE EFFICIENCY WITH OneFS SMARTDEDUPE

Turnkey Deduplication Solution for the Enterprise

Protecting enterprise servers with StoreOnce and CommVault Simpana

Tiered Data Protection Strategy Data Deduplication. Thomas Störr Sales Director Central Europe November 8, 2007

How to Get Started With Data

Hvordan sikrer du ditt virtuelle datasenter?

Overview of RD Virtualization Host

HP StoreOnce: reinventing data deduplication

EMC DATA DOMAIN OPERATING SYSTEM

HP Data Protector software and HP StoreOnce backup systems for federated deduplication and flexible deployment

EMC DATA DOMAIN OPERATING SYSTEM

File Systems Management and Examples

How To Make A Backup System More Efficient

Business-centric Storage FUJITSU Storage ETERNUS CS800 Data Protection Appliance

Chapter 12 File Management

Chapter 12 File Management. Roadmap

Network Attached Storage. Jinfeng Yang Oct/19/2015

CA ARCserve r Data Deduplication Frequently Asked Questions

Speeding Up Cloud/Server Applications Using Flash Memory

Two Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface

Enterprise-class Backup Performance with Dell DR6000 Date: May 2014 Author: Kerry Dolan, Lab Analyst and Vinny Choinski, Senior Lab Analyst

Creating a Cloud Backup Service. Deon George

Physical Data Organization

Hardware Configuration Guide

NEXT-GENERATION STORAGE EFFICIENCY WITH EMC ISILON SMARTDEDUPE

Deduplication has been around for several

Hyper-converged IT drives: - TCO cost savings - data protection - amazing operational excellence

Optimizing Ext4 for Low Memory Environments

Protect Microsoft Exchange databases, achieve long-term data retention

Which filesystem should I use? LinuxTag Heinz Mauelshagen Consulting Development Engineer

Protecting the Microsoft Data Center with NetBackup 7.6

Using HP StoreOnce Backup systems for Oracle database backups

Transcription:

09'Linux Plumbers Conference Data de duplication Mingming Cao IBM Linux Technology Center cmm@us.ibm.com 2009 09 25

Current storage challenges Our world is facing data explosion. Data is growing in a amazing rate Severe challenges just storing the data we have today, imagine how much more difficult and expensive storing six times more data tomorrow. Eliminating redundant data is very important!

Existing technology... Hard links/cow/file clone Compression All are done at file level. There are more room to save space from. Imagine this lpc dedup.ods lpc dedup 1.ods lpc dedup 2.ods Copy to another file 5M Made slight modification and save as another file Backup lpc dedup 3.ods 5M 5M 5M

Definition Data de duplication A method of reducing storage needs by eliminating redundant data. Only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy. Data de duplication is extended compression, more efficient to remove the redundant data. Could be done at file level, sub file(block) level and even bit level

de duplication before dedup.ods file2.ods A B A C D C A E A B A C D C A E

de duplication after dedup.ods file2.ods A B A C D C A E A B C D E

de duplication for VM VM1 OS app1 app2 data1 data2 free free free VM2 OS app1 app3 data3 data4 free free free VM3 OS app3 app2 data5 data6 free free free VM4 OS app1 app2 data1 data2 free free free

de duplication for VM(after) VM1 VM2 VM3 VM4 OS app1 app2 app3 data1data OS 2 data3 data4 data5 data6 free free free free free free free free free free free free free free free free free free free free free free

de duplication benefit Two major savings Storage footprint(6x healthcare, 3x VM, 20x backup) Network bandwidth to transfer data across WAN Disks are cheep, but there is more than just space Save more energy (power) and cooling Disaster and recovery becomes manageable Save resources to manage same amount of data Typical workload: Backup, Archives, Healthcare, Virtualization, NAS, remote office etc.

de duplication concerns Large CPU and memory resources required for deduplication processing Potentially more fragmented files/filesystem Potentially increase risk of lost data Might not work with encryption Hash collison still possible

de duplication ratios Indicating how much reduction by de duplication Before 50TB, after 10TB, ratio is 5:1 Ratio could various from 2:1 to 10:1, depends on Type of data Change rate of the data Amount of redundant data Type of backup performed (full, incremental or differential) de duplication methods

de duplication process Data generates a unique number (key) Has seen this key in the index before? No Yes Insert this key to index tree store new data to disk Duplicated, reference to the orignal data exit

Where: source vs target method where advantages disadvantages Source performed at the data source, before transfer to target location Reduce network bandwidth;awareness of data usage and format may allow more effective data dedup Deduplication consumes CPU cycles on the file/ application server; May not dedup files across various sources Target performed at the target (e.g. by backup software or storage appliance) Applies to any type of Deduplication consumes filesystem; No impact to CPU cycles on the data ingestion;possible target server or storage for parellel data device deduplication

When: In line vs Post process method when advantages disadvantages In line deduplication occurs in the primary data path. No data is written to disk until the deduplication process is complete. Immediate data reduction, uses the least disk space No post processing Perfornance concerns, high cput and memory cost; Post process Deduplication occurs on the secondary storage. Data were first store data on disk and then deduplicate Easy to implement; No impact to data ingestion;possible for parellel data deduplication Data being processed twice; Need extra space for dedup;race with concurrent writes

In line de dup in btrfs: How to detect the redundancy? Make sense...btrfs already create checksum for every fs block, and stored on disk. Re use hash value for duplication key. To speed look up, need separate checksum index tree, indexed by checksums rather than logical offset Duplication screen could happen at data writeout time. After data get compressed, but before delayed allocation allocate space and flush out data If hash collision occurs, do byte to byte compare to ensure no data lost

In line de dup in btrfs: How to lower the cost? Memory usage is the key to dedup performance The dedup hash tree needs in memory. For 1TB fs needs 8G RAM for SHA256, or 4G RAM for MD5 Make dedup optional: filesystem mount option, or enable/disable dedup on file/subvolumes etc Fragmentation Apply policies to defrag to group shared files close to each other Reduce seek time: frequently and lately shared blocks are likely already pinged in memory Might be less an issue with SSD

In line dedup in btrfs: Keep the impact low Could have impact to running applications Gets some latency stats, enable/disable dedup if ingestion is high Could have a background scrub thread to do dedup on files that didn't get dedup in line before writeout to disk Flag to each btrfs extent to indicating deduped or not, to avoid double dedup

User space de duplication? User apps do the job instead of kernel. Could avoid ingestion. Could apply to any filesystem (ext4, btrfs, xfs etc) The checksum is maintained in userspace. Introduce VFS API to allow apps to poll whether chunk of data have been modified before merge Could use inode ctime/mtime or inode version Better, a new system call to tell a range of file(offset, size,transaction ID) has be changed since then

Summary Linux needs data de duplication technology to able to control data explosion... One size won't fit all perhaps both?