Two-Level Metadata Management for Data Deduplication System



Similar documents
Multi-level Metadata Management Scheme for Cloud Storage System

Byte-index Chunking Algorithm for Data Deduplication System

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

A Network Differential Backup and Restore System based on a Novel Duplicate Data Detection algorithm

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE

Deploying De-Duplication on Ext4 File System

86 Int. J. Engineering Systems Modelling and Simulation, Vol. 6, Nos. 1/2, 2014

Hardware Configuration Guide

Implementation of IR-UWB MAC Development Tools Based on IEEE a

Read Performance Enhancement In Data Deduplication For Secondary Storage

Cisco WAAS Context-Aware DRE, the Adaptive Cache Architecture

A block based storage model for remote online backups in a trust no one environment

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

The assignment of chunk size according to the target data characteristics in deduplication backup system

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

Offloading file search operation for performance improvement of smart phones

Understanding EMC Avamar with EMC Data Protection Advisor

Home Appliance Control and Monitoring System Model Based on Cloud Computing Technology

INTENSIVE FIXED CHUNKING (IFC) DE-DUPLICATION FOR SPACE OPTIMIZATION IN PRIVATE CLOUD STORAGE BACKUP

PACK: PREDICTION-BASED CLOUD BANDWIDTH AND COST REDUCTION SYSTEM

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

Data Deduplication Scheme for Cloud Storage

NAS 259 Protecting Your Data with Remote Sync (Rsync)

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Dynamic Load Balancing of Virtual Machines using QEMU-KVM

IMAV: An Intelligent Multi-Agent Model Based on Cloud Computing for Resource Virtualization

Module: Sharepoint Administrator

Cloud-integrated Enterprise Storage. Cloud-integrated Storage What & Why. Marc Farley

3Gen Data Deduplication Technical

Fast Device Discovery for Remote Device Management in Lighting Control Networks

CloudFTP: A free Storage Cloud

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

Cloud Storage Backup for Storage as a Service with AT&T

EMC VNXe File Deduplication and Compression

Figure 1. The cloud scales: Amazon EC2 growth [2].

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Cloud-integrated Storage What & Why

09'Linux Plumbers Conference

Data Deduplication HTBackup

A Data De-duplication Access Framework for Solid State Drives

Data Deduplication and Corporate PC Backup

Barracuda Backup Deduplication. White Paper

A Research Using Private Cloud with IP Camera and Smartphone Video Retrieval

A Method of Deduplication for Data Remote Backup

A Deduplication File System & Course Review

The Design and Implementation of the Integrated Model of the Advertisement and Remote Control System for an Elevator

Understanding EMC Avamar with EMC Data Protection Advisor

PARALLELS CLOUD SERVER

A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing

Tertiary Backup objective is automated offsite backup of critical data

Redefining Backup for VMware Environment. Copyright 2009 EMC Corporation. All rights reserved.

Development of IaaS-based Cloud Co-location and Management System using Open Source Cloud Stack

Designing and Embodiment of Software that Creates Middle Ware for Resource Management in Embedded System

Performance Comparison Analysis of Linux Container and Virtual Machine for Building Cloud

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper.

A Deduplication-based Data Archiving System

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Cloud Panel Service Evaluation Scenarios


Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

Migration of Medical Image Data Stored through Mini-PACS to Full-PACS

Research and Performance Analysis of HTML5 WebSocket for a Real-time Multimedia Data Communication Environment

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

A Study of Key management Protocol for Secure Communication in Personal Cloud Environment

WIRELESS PUBLIC KEY INFRASTRUCTURE FOR MOBILE PHONES

An Efficient Application Virtualization Mechanism using Separated Software Execution System

Security of Cloud Storage: - Deduplication vs. Privacy

How to use Cloud Solutions by Swisscom for Disaster Recovery. Whitepaper. Fabian Haldimann Stefan Lengacher Thomas Gfeller

ontune SPA - Server Performance Monitor and Analysis Tool

A Survey on Deduplication Strategies and Storage Systems

bup: the git-based backup system Avery Pennarun

Disk-to-Disk-to-Offsite Backups for SMBs with Retrospect

AnyBackup Family. Product Overview

A Resilient Device Monitoring System in Collaboration Environments

ITA Mail Archive Setup Guide

Lecture 11. RFS A Network File System for Mobile Devices and the Cloud

Cyber Forensic for Hadoop based Cloud System

Creating a Cloud Backup Service. Deon George

Media Exchange really puts the power in the hands of our creative users, enabling them to collaborate globally regardless of location and file size.

CROSS PLATFORM AUTOMATIC FILE REPLICATION AND SERVER TO SERVER FILE SYNCHRONIZATION

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

Web-Based Data Backup Solutions

V:Drive - Costs and Benefits of an Out-of-Band Storage Virtualization System

Data Backup and Archiving with Enterprise Storage Systems

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

Mail Archive and Management System. Product White Paper

Side channels in cloud services, the case of deduplication in cloud storage

On Benchmarking Popular File Systems

Inside Dropbox: Understanding Personal Cloud Storage Services

Transcription:

Two-Level Metadata Management for Data Deduplication System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3.,Young Woong Ko 1 1 Dept. of Computer Engineering, Hallym University Chuncheon, Korea { kongjs, yuko}@hallym.ac.kr 2 Dept. of Computer Science and Engineering, Korea University, Seoul, Korea mjfeel@korea.ac.kr 3 Dept. of Computer Science, Dongduk Womens University, Seoul, Korea wanlee@dongduk.ac.kr Abstract. Data deduplication is an essential solution to reduce storage space requirement. Especially chunking based data deduplication is very effective for backup workloads which tend to be files that evolve slowly, mainly through small changes and additions. In this paper, we introduce a novel data deduplication scheme which can be efficiently used with low bandwidth network in a rapid time. The key points are using tree map searching and classifying data as global and metadata. These are the main aspects to influencing fast performance of the data deduplication. Keywords: File, Synchronization, chunking, Hash 1 Introduction Data deduplication is a way to reduce storage space by eliminating data to ensure that only single instance of data is stored in storage medium. Data deduplication technique has also drawn attraction as a means of dealing with large data and is regarded as an enabling technology. Currently, there are several well-known file deduplication schemes including Rsync[1], Venti[2] and multi-mode[3]. Currently, the deduplication scheme is used in cloud storage. For example, Dropbox adapts VLC(Variable-Length Chunking) for processing data deduplication, so Dropbox can reduce network bandwidth when data transfer between client and server. One of the key drawbacks of Dropbox approach is fixed chunk size. Dropbox uses fixed size chunk for data deduplication. So if the file is very big then the metadata size also very increased. In this paper, we proposed two-level metadata management for efficient data deduplication. The proposed system has its own characteristics for providing fast 1 This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science and Technology(2011-0009358) and this research was supported by the MKE, Korea and NHN. under IT/SW Creative research program supervised by the NIPA (NIPA-2012) IST 2013, ASTL Vol. 23, pp. 299-303, 2013 SERSC 2013 299

Proceedings, The 2nd International Conference on Information Science and Technology performance. If the file size is smaller than threshold value, the metadata of the file belongs to local metadata. Server creates private metadata folder for each user and only owner has a permission to access for corresponding folder. When user performs to run and backup file for local metadata, server loads the metadata from its private folder to process deduplication process. Oppositely, if file size is bigger than threshold value, file metadata belongs to global metadata. Server does not need to create folder for each user and every user can access it. When a file is restored or backed-up, file is classified as Global, server loads the data from global folder to execute the process. These are key factor to be faster and low bandwidth performance of the proposed system. 2 System Design and Implementation Figure 1 shows overall system architecture of the proposed system. We adapt VLC approach for block chunking and source-based data deduplication approach. The server manages metadata by preserving and comparing hashes from the client. Figure 1 Tree Map Hierarchy of Chunk manager The system is divided into several parts; chunk manager, chunker and protocol interface. These modules are used for reducing bandwidth of network and faster performance when backing up in the client. The main aspect of faster performance than other deduplication system is the metadata separation between user local metadata and global metadata. Chunk manager uses tree map which is synchronized internally. The hash values are expressed and managed as 20 byte in hexadecimal. Tree map includes this key value as sorted and also includes other data values. Map tree sets its values into as red-black tree, therefore using tree map is very fast compared with other data structure. Server chunk is stored in map tree as sorted by its hash value. Chunk manager has methods which finding overlapping data between client and server 300

Two-Level Metadata Management for Data Deduplication System metadata chunks (comparemeta), loading memory from disk (startmanager) and saving data to disk from the memory (stopmanager). If there is a file with its size is bigger than predefined value, server stores this metadata to public. We call this kind of data to global data and there is needed to be synchronization because all users can access this folder. Global chunk manager is responsible for managing this global data. Server loads corresponding user local metadata memory when user is connected to the system. Loaded local metadata is the metadata of file which size is lower than predefined value. The user usually manages the local metadata. The module is saving and loading for the metadata, therefore we can perform fast and efficiently update the individual file due to saves the metadata separately for each user. Figure 2 shows the data deduplication processing flow of the proposed system. The numbers on the figure illustrates flow sequence of deduplication between Server and Client. Figure 2 Flow of data deduplication 1 Server loads global metadata into memory by global chunk manager when system starts. 2 Global metadata can be accessed by all users and it is managed by Global chunk manager. 3 Client accesses server by port number 2121, the login process is completed after user putting User Id and Password. 4 Server and Client are connected by response message and commands through protocol interface. 5 Client passes the selected file to the chunker when it sends backup request to the server. The chunker returns metadata (hash value, offset and chunk size) of transmitting file. 301

Proceedings, The 2nd International Conference on Information Science and Technology 6 Metadata of a client file is transmitted by Data Transfer Process. 7 Transmitted data is deduplicated and backed-up through chunk manager from global chunk manager and Local chunk manager, and also metadata comparison process. 3 Experiment Result In this work, we evaluate the data deduplication system using dynamic chunking. The server and the client platform consist of 3 GHz Pentium 4 Processor, WD-1600JS hard disk and 100 Mbps network. The software is implemented on Linux kernel version 2.6.18 Fedora Core 9. We made experimental data set using for modifying a file in a random manner. In this experiment, we modified a data file using lseek() function in Linux system using randomly generated file offset and applied a patch to make test data file. The Input data of experiments are patched 40% and 80%. Figure 3 Evaluation result of network bandwidth Figure 3 shows experiment result of network usage when data transfers using global metadata. There are 2GB original file and its copy versions with 40%, 60% duplicated files. User1, User2 and User3 each transfers original, 40% duplicated and 60% duplicated versions of file through the FTP between client and server. FTP scheme transfers 2GB file whereas the proposed system consumes less network bandwidth when transfers file using global metadata. 4 Conclusion In this paper, we proposed two-level metadata management for efficient data deduplication. The main idea is to separate metadata management by considering file size. If the file size is smaller than threshold value, the metadata of the file belongs to 302

Two-Level Metadata Management for Data Deduplication System local metadata. On the contrary, if file size is bigger than threshold value, file metadata belongs to global metadata. Server creates private metadata folder for each user and only owner has a permission to access for corresponding folder. When user performs to run and backup file for local metadata, server loads the metadata from its private folder to process deduplication process. However, when a file is restored or backed-up, file is classified as global, server loads the data from global folder to execute the process. Our approach shows fast and low bandwidth performance compared with FTP approach. References 1. Tridgell, A.: Efficient algorithms for sorting and synchronization. PhD thesis, The Australian National University (1999) 2. Quinlan, S., Dorward, S.: Venti: a new approach to archival storage. In: Proceedings of the FAST 2002 Conference on File and Storage Technologies. Volume 4 (2002) 3. Jung, H., Park, W., Lee, W., Lee, J., Ko, Y.: Data deduplication system for supporting multimode. In Proceedings of the Third international conference on Intelligent information and database systems-volume Part I, Springer-Verlag (2011) 303