Quanqing XU Quanqing.Xu@nicta.com.au. YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

Similar documents

Multi-level Metadata Management Scheme for Cloud Storage System

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

Data Deduplication and Tivoli Storage Manager

A Deduplication-based Data Archiving System

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Data Deduplication and Tivoli Storage Manager

Theoretical Aspects of Storage Systems Autumn 2009

DEXT3: Block Level Inline Deduplication for EXT3 File System

Apache Hadoop. Alexandru Costan

A Data De-duplication Access Framework for Solid State Drives

Avamar Backup and Data De-duplication Exam

Veeam Best Practices with Exablox

A Deduplication File System & Course Review

Cumulus: filesystem backup to the Cloud

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Reducing Replication Bandwidth for Distributed Document Databases

Creating a Cloud Backup Service. Deon George

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

Open source Google-style large scale data analysis with Hadoop

NETAPP SYNCSORT INTEGRATED BACKUP. Technical Overview. Peter Eicher Syncsort Product Management

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop & its Usage at Facebook

Apache Hadoop FileSystem and its Usage in Facebook

IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

Turnkey Deduplication Solution for the Enterprise

Hadoop & its Usage at Facebook

Big data management with IBM General Parallel File System

Database Scalability {Patterns} / Robert Treat

Leveraging Public Clouds to Ensure Data Availability

Berkeley Ninja Architecture

Bigdata High Availability (HA) Architecture

HTTP-Level Deduplication with HTML5

The Google File System

bup: the git-based backup system Avery Pennarun

Trends in Enterprise Backup Deduplication

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Module 14: Scalability and High Availability

DISK IMAGE BACKUP. For Physical Servers. VEMBU TECHNOLOGIES TRUSTED BY OVER 25,000 BUSINESSES

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Egnyte Local Cloud Architecture. White Paper

The assignment of chunk size according to the target data characteristics in deduplication backup system

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Optimize VMware and Hyper-V Protection with HP and Veeam

Distributed File Systems

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Architectures Haute-Dispo Joffrey MICHAÏE Consultant MySQL

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Distributed Block-level Storage Management for OpenStack

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)

Appendix A Core Concepts in SQL Server High Availability and Replication

Enterprise Backup and Restore technology and solutions

INTENSIVE FIXED CHUNKING (IFC) DE-DUPLICATION FOR SPACE OPTIMIZATION IN PRIVATE CLOUD STORAGE BACKUP

SEP Software. About SEP. Key Features ONE BACKUP & DISASTER RECOVERY SOLUTION FOR THE ENTIRE ENTERPRISE

Turbo Charge Your Data Protection Strategy

LDA, the new family of Lortu Data Appliances

Deduplication Demystified: How to determine the right approach for your business

Speeding Up Cloud/Server Applications Using Flash Memory

Understanding EMC Avamar with EMC Data Protection Advisor

Web-Based Data Backup Solutions

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Byte-index Chunking Algorithm for Data Deduplication System

Cloud De-duplication Cost Model THESIS

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Alternatives to Big Backup

Protecting your SQL database with Hybrid Cloud Backup and Recovery. Session Code CL02

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

Bloom Filters. Christian Antognini Trivadis AG Zürich, Switzerland

Open source large scale distributed data management with Google s MapReduce and Bigtable

A Web Site Protection Oriented Remote Backup and Recovery Method

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Data Compression and Deduplication. LOC Cisco Systems, Inc. All rights reserved.

An Efficient Deduplication File System for Virtual Machine in Cloud

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

A programming model in Cloud: MapReduce

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski Spała

Tushar Joshi Turtle Networks Ltd

ZFS Backup Platform. ZFS Backup Platform. Senior Systems Analyst TalkTalk Group. Robert Milkowski.

Redefining Microsoft SQL Server Data Management. PAS Specification

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

THE HADOOP DISTRIBUTED FILE SYSTEM

MySQL Cluster New Features. Johan Andersson MySQL Cluster Consulting johan.andersson@sun.com

Transcription:

Quanqing XU Quanqing.Xu@nicta.com.au YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

Outline Motivation YuruBackup s Architecture Backup Client File Scan, Data De-duplication and Data Transmission Metadata Server Communication with Clients, Global Fingerprint Lookup and Store, and Highly Scalable Cluster of Metadata Servers Demo Preliminary experimental results Development status 2

Motivation Yuruware needs incremental backup in the cloud Cloud storage providers High reliability and scalability at low cost Ultra large-scale storage space 905 billion objects in Amazon S3, Q1/2012 Customers Backup and restore progressive data within short time Backup up to petabytes of data in total To build a large-scale cloud backup system System scalability Storage efficiency Backup and restoration performance NICTA Copyright [1] http://aws.typepad.com/aws/2012/04/amazon-s3-905-billion-objects-and-650000-requestssecond.html 2010 3

The Architecture of YuruBackup To increase scalability to accommodate PB-scale data To improve space efficiency to reduce costs To save bandwidth to adapt to the low bandwidth of WAN Metadata of PB-scale data Backup Agent Write master Source-side De-duplication PB-scale space A cluster of metadata servers Target-side De-duplication slave Metadata Agent slave Snapshots Cloud Storage Read Read RPC, parallel transmission, data/metadata separation 4

Storage Hierarchy Snapshot A virtual file Collection Block Chunk Snapshot A Snapshot B Collection Block Chunk 5

Mapping blocks from memory to disk A block <collectionuuid, blockno, checksum, start, length> Components Memory Block, Block Proxy and TAR Store Memory Block... Memory Block... Memory Block In Memory Block Proxy TAR Store In Disk Collection...... Collection Collection 6

The Flow Chart of Backup Process Create DB connection to metadata catalog Initialize the TAR store T Initialize the Metadata Manager Scan a directory to get a file list The file list is empty? Yes Release the Metadata Manager Release the TAR store Close DB connection to metadata catalog No Remove a file and write its incremental backup into T T s size >= a given size? Yes Write T into disk and clear it No 7

Backup Client It provides a functional interface to users. Backup and restoration To reduce I/O requests Read/Write Buffer To locate items Compressed BF Berkeley DB Source-side dedup CD Chunking Transmission Batched RPC Parallel uploading 8

Source-side de-duplication Rabin s Fingerprinting Given a string A = a m a m-1 a 1 A k-bit Rabin fingerprint is computed as follows: m 1 m 2 Let, A( t) a t a t a t a m m 1 Choose an irreducible polynomial P(t) P k k 1 ( t) pkt pk 1t p0 Compute Rabin s fingerprint f(a) f ( A) A( t) mod P( t) Content-defined Chunking (SOSP 01) low_order(f, k) = c 2 1 C 1 C 2 C 3... [1] Muthitacharoen A, Chen B, Maziéres D. A low-bandwidth network file system. In: Proc. of the 18th ACM Symp. on Operating System Principles (SOSP 2001). New York: ACM Press, 2001. 174 187. w 9

Duplication Detection based on Bloom filter Observations Most files are never changed after their creations (ATC 04) Over 2/3 of files have not been modified (FAST 07) Index Summary based on Compressed BF(ACM 70, PODC 01) Approximate set membership problem Trade-off between space and false positive probability Three functions 1) Initialize(initElementCount, desiredfpp) 2) Insert(fingerprint) 3) Lookup(fingerprint) [1] Burton H. Bloom. Space/time trade-o s in hash coding with allowable errors. ACM Communications, 13(7), 1970. [2] Mitzenmacher. Compressed Bloom Filters. In Twentieth ACM Symposium on Principles of Distributed Computing, August 2001. 10

Metadata Server Communication with Clients A single, batched and asynchronous lookup RPC for n FPs The callback function enqueues the updated request Global FP Lookup and Store Global Index Summary Global target-side deduplication FP Lookup FP Store 11

Highly Scalable Cluster of MDSs SQL Nodes with NDB...... YuruBackup Clients...... Load Balancer DataNodes Slaves...... Masters SQL Nodes with NDB+InnoDB Data replication To make reads scalable MySQL replication Failover Data partitioning To make writes scalable MySQL cluster Read Write Replication Load balancing To aware of which nodes are readable and writable 12

Demo of YuruBackup Chunk Partition Duplication Detection 13

An example of a snapshot (5 new blocks) B 1 B 3 B 5 B 7 B 12 14

An example of incremental backup emacs-23.2a emacs-23.3a 15

Comparison ReducedRatio = Datasets Hbase 48.0 68.7 (97.5) 28.0 1.3 3,462 4,375 59.25 16 Average 92.14 114.3 (162.8) 62.41 3.75 5,144 17,308 47.79 Nonoverlap data size (MB) # BytesSentByRsync - # BytesOfData - # BytesOfMetadata # BytesSentByRsync rsync Transferred data size (MB) Transferred data size (MB) Table 1. Dataset YuruBackup # chunks Data Metadata # old chunks # new chunks Emacs 140.2 155.7 (155.9) 60.4 1.6 15,731 11,484 61.23 Eclipse 234.4 233.0 (234.9) 220.3 1.1 277 84,317 5.53 GCC 107.8 94.7 (428.6) 37.8 23.8 12,386 9,659 60.05 Hadoop-src 93.0 210.8 (214.1) 57.5 2.4 5,365 15,420 72.73 Hadoop-bin 37.2 110.1 (110.5) 27.8 0.2 656 10,489 74.71 Lucene-src 17.1 14.8 (64.8) 6.2 1.2 296 1,590 58.02 Lucene-bin 143.1 153.9 (156.4) 132.6 2.8 2,191 26,200 13.79 Hive-src 94.9 94.1 (144.0) 48.0 3.0 11,072 7,885 48.94 Hive-bin 5.7 7.2 (21.7) 5.5 0.1 0 1,660 23.62 (%) 16

Others YuruBackup is deployed atop Amazon S3 metadata servers are running in EC2 will be deployed in other cloud platforms Performance evaluation De-duplication Efficiency De-duplication Overhead Scalability Backup Window Fine-granularity Restoration, etc. 17

Current Development Status Program directories (~12,000 LOC) include: header files, ~1,200 LOC src: source files, ~5,200 LOC 18

Thank you! Q&A

Dataset OverlapRatio = OverlapDataSize TransferredDataSize Emacs eclipse gcc Hadoopsrc Hadoopbin Objects # Files Data size (MB) 23.2a 4,321 155.4 23.3a 4,331 155.9 galileo 2,587 225.9 Helios-SR2 2,754 234.9 4.6.0 71,103 427.2 4.6.1 71,376 428.6 0.20.204.0 5,811 208.0 0.20.205.0 6,004 214.1 0.20.204.0 507 105.0 0.20.205.0 538 110.5 # Overlap Files Overlap data size (MB) (%) 957 15.7 (10.09) 33 0.5 (0.21) 70,545 320.8 (74.86) 3,246 121.1 (56.56) 429 73.3 (66.36) 20

Dataset lucenesrc Lucenebin Hive-src Hive-bin Objects # Files Data size (MB) 3.3.0 2,644 62.4 3.4.0 2,956 64.8 3.3.0 6,520 136.9 3.4.0 7,150 156.4 0.7.0 7,934 143.7 0.7.1 7,976 144.0 0.7.0 280 21.6 0.7.1 295 21.7 hbase 0.90.3 3,428 97.2 0.90.4 3,444 97.5 Linux shell: diff urnas v1 v2 # Overlap Files Overlap data size (MB) (%) 2,226 47.7 (73.58) 208 13.3 (8.51) 3,720 49.1 (34.10) 257 16.0 (73.88) 1,477 49.6 (50.81) Return 21

The rsync Algorithm f.old f.new A 2. A sends the checksums to B 4. B tells A how to construct file f.new from f.old and the literal data. B 1. A computes the checksum of each block S i in file f.old 3. B searches the file f.new and find the difference between f.old and f.new. The checksum consist of rolling 32-bit checksums (adler-32 checksum) and a 128-bit MD4 checksum. Return 22