Two-Level Metadata Management for Data Deduplication System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3.,Young Woong Ko 1 1 Dept. of Computer Engineering, Hallym University Chuncheon, Korea { kongjs, yuko}@hallym.ac.kr 2 Dept. of Computer Science and Engineering, Korea University, Seoul, Korea mjfeel@korea.ac.kr 3 Dept. of Computer Science, Dongduk Womens University, Seoul, Korea wanlee@dongduk.ac.kr Abstract. Data deduplication is an essential solution to reduce storage space requirement. Especially chunking based data deduplication is very effective for backup workloads which tend to be files that evolve slowly, mainly through small changes and additions. In this paper, we introduce a novel data deduplication scheme which can be efficiently used with low bandwidth network in a rapid time. The key points are using tree map searching and classifying data as global and metadata. These are the main aspects to influencing fast performance of the data deduplication. Keywords: File, Synchronization, chunking, Hash 1 Introduction Data deduplication is a way to reduce storage space by eliminating data to ensure that only single instance of data is stored in storage medium. Data deduplication technique has also drawn attraction as a means of dealing with large data and is regarded as an enabling technology. Currently, there are several well-known file deduplication schemes including Rsync[1], Venti[2] and multi-mode[3]. Currently, the deduplication scheme is used in cloud storage. For example, Dropbox adapts VLC(Variable-Length Chunking) for processing data deduplication, so Dropbox can reduce network bandwidth when data transfer between client and server. One of the key drawbacks of Dropbox approach is fixed chunk size. Dropbox uses fixed size chunk for data deduplication. So if the file is very big then the metadata size also very increased. In this paper, we proposed two-level metadata management for efficient data deduplication. The proposed system has its own characteristics for providing fast 1 This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science and Technology(2011-0009358) and this research was supported by the MKE, Korea and NHN. under IT/SW Creative research program supervised by the NIPA (NIPA-2012) IST 2013, ASTL Vol. 23, pp. 299-303, 2013 SERSC 2013 299
Proceedings, The 2nd International Conference on Information Science and Technology performance. If the file size is smaller than threshold value, the metadata of the file belongs to local metadata. Server creates private metadata folder for each user and only owner has a permission to access for corresponding folder. When user performs to run and backup file for local metadata, server loads the metadata from its private folder to process deduplication process. Oppositely, if file size is bigger than threshold value, file metadata belongs to global metadata. Server does not need to create folder for each user and every user can access it. When a file is restored or backed-up, file is classified as Global, server loads the data from global folder to execute the process. These are key factor to be faster and low bandwidth performance of the proposed system. 2 System Design and Implementation Figure 1 shows overall system architecture of the proposed system. We adapt VLC approach for block chunking and source-based data deduplication approach. The server manages metadata by preserving and comparing hashes from the client. Figure 1 Tree Map Hierarchy of Chunk manager The system is divided into several parts; chunk manager, chunker and protocol interface. These modules are used for reducing bandwidth of network and faster performance when backing up in the client. The main aspect of faster performance than other deduplication system is the metadata separation between user local metadata and global metadata. Chunk manager uses tree map which is synchronized internally. The hash values are expressed and managed as 20 byte in hexadecimal. Tree map includes this key value as sorted and also includes other data values. Map tree sets its values into as red-black tree, therefore using tree map is very fast compared with other data structure. Server chunk is stored in map tree as sorted by its hash value. Chunk manager has methods which finding overlapping data between client and server 300
Two-Level Metadata Management for Data Deduplication System metadata chunks (comparemeta), loading memory from disk (startmanager) and saving data to disk from the memory (stopmanager). If there is a file with its size is bigger than predefined value, server stores this metadata to public. We call this kind of data to global data and there is needed to be synchronization because all users can access this folder. Global chunk manager is responsible for managing this global data. Server loads corresponding user local metadata memory when user is connected to the system. Loaded local metadata is the metadata of file which size is lower than predefined value. The user usually manages the local metadata. The module is saving and loading for the metadata, therefore we can perform fast and efficiently update the individual file due to saves the metadata separately for each user. Figure 2 shows the data deduplication processing flow of the proposed system. The numbers on the figure illustrates flow sequence of deduplication between Server and Client. Figure 2 Flow of data deduplication 1 Server loads global metadata into memory by global chunk manager when system starts. 2 Global metadata can be accessed by all users and it is managed by Global chunk manager. 3 Client accesses server by port number 2121, the login process is completed after user putting User Id and Password. 4 Server and Client are connected by response message and commands through protocol interface. 5 Client passes the selected file to the chunker when it sends backup request to the server. The chunker returns metadata (hash value, offset and chunk size) of transmitting file. 301
Proceedings, The 2nd International Conference on Information Science and Technology 6 Metadata of a client file is transmitted by Data Transfer Process. 7 Transmitted data is deduplicated and backed-up through chunk manager from global chunk manager and Local chunk manager, and also metadata comparison process. 3 Experiment Result In this work, we evaluate the data deduplication system using dynamic chunking. The server and the client platform consist of 3 GHz Pentium 4 Processor, WD-1600JS hard disk and 100 Mbps network. The software is implemented on Linux kernel version 2.6.18 Fedora Core 9. We made experimental data set using for modifying a file in a random manner. In this experiment, we modified a data file using lseek() function in Linux system using randomly generated file offset and applied a patch to make test data file. The Input data of experiments are patched 40% and 80%. Figure 3 Evaluation result of network bandwidth Figure 3 shows experiment result of network usage when data transfers using global metadata. There are 2GB original file and its copy versions with 40%, 60% duplicated files. User1, User2 and User3 each transfers original, 40% duplicated and 60% duplicated versions of file through the FTP between client and server. FTP scheme transfers 2GB file whereas the proposed system consumes less network bandwidth when transfers file using global metadata. 4 Conclusion In this paper, we proposed two-level metadata management for efficient data deduplication. The main idea is to separate metadata management by considering file size. If the file size is smaller than threshold value, the metadata of the file belongs to 302
Two-Level Metadata Management for Data Deduplication System local metadata. On the contrary, if file size is bigger than threshold value, file metadata belongs to global metadata. Server creates private metadata folder for each user and only owner has a permission to access for corresponding folder. When user performs to run and backup file for local metadata, server loads the metadata from its private folder to process deduplication process. However, when a file is restored or backed-up, file is classified as global, server loads the data from global folder to execute the process. Our approach shows fast and low bandwidth performance compared with FTP approach. References 1. Tridgell, A.: Efficient algorithms for sorting and synchronization. PhD thesis, The Australian National University (1999) 2. Quinlan, S., Dorward, S.: Venti: a new approach to archival storage. In: Proceedings of the FAST 2002 Conference on File and Storage Technologies. Volume 4 (2002) 3. Jung, H., Park, W., Lee, W., Lee, J., Ko, Y.: Data deduplication system for supporting multimode. In Proceedings of the Third international conference on Intelligent information and database systems-volume Part I, Springer-Verlag (2011) 303