Security Ensured Redundant Data Management under Cloud Environment

Transcription

1 Security Ensured Redundant Data Management under Cloud Environment K. Malathi 1 M. Saratha 2 1 PG Scholar, Dept. of CSE, Vivekanandha College of Technology for Women, Namakkal. 2 Assistant Professor, Dept. of CSE, Vivekanandha College of Technology for Women, Namakkal. Abstract: Cloud backup service provides offsite storage for the users with disaster recovery support. Deduplication methods are used to control high data redundancy in backup dataset.data deduplication is a data compression approach applied in communication or storage environment. Limited resource level and I/O overhead are considered in the deduplication process. Source deduplication strategies can be divided into two categories. They are local source deduplication (LSD) and global source deduplication (GSD). LSD only detects redundancy in backup dataset from the same device at the client side and only sends the unique data chunks to the cloud storage. GSD performs duplicate check in backup datasets from all clients in the cloud side before data transfer over WAN. Redundant data management process is achieved using Application aware Local-Global source Deduplication scheme. File size filter is used to separate the small size files. Application aware chunking strategy is used in Intelligent Chunker to break the backup data streams. Application aware deduplicator deduplicates the data chunks from the same type of files. Hash engine is used to generate chunk finger prints. Data redundancy check is carried out in application-aware indices in both local client and remote cloud. File metadata is updated with redundant chunk location details. Segments and corresponding finger prints are stored in the cloud data center using self describing data structure. The deduplication scheme is enhanced with encrypted data handling features. Encrypted cloud storage model is used to secure personal data values. Deduplication scheme is adapted to control data redundancy under Smart Phone environment. File level deduplication scheme is designed for global level deduplication process. I. INTRODUCTION Cloud computing is a novel paradigm that provides infrastructure, platform and software as a service. A cloud platform can be either virtualized or not. Virtualizing the cloud platform increases the resources availability and the flexibility of their management. It also reduces the cost through hardware multiplexing and helps energy saving. Virtualization is then a key enabling technology of cloud computing. System virtualization refers to the software and hardware techniques that allow partitioning one physical machine into multiple virtual instances that run concurrently and share the underlying physical resources and devices. The recent introduction of digital TV, digital camcorders and other communication technologies has rapidly accelerated the amount of data being maintained in digital form. In 2007, for the first time ever, the total volume of digital contents exceeded the global storage capacity and it is estimated that by 2011 only half of the digital information will be stored. Further, the volume of automatically generated information exceeds the volume of human generated digital information. Compounding the problem of storage space, digitized information has a more fundamental problem: it is more vulnerable to error compared to the information in legacy media, e.g., paper, book and film. When data is stored in a computer storage system, a single storage error or power failure can put a large amount of information in danger. To protect against such problems, a number of technologies to strengthen the availability and reliability of digital data have been used, including mirroring, replication and adding parity information. In the application layer, the administrator replicates the data onto additional copies called backups so that that the original information can be restored in case of data loss. Due to the exponential growth in the volume of digital data, the backup operation is no longer routine. Further, exploiting commonalities in a file or among a set of files when storing and transmitting contents is no longer an option. By properly removing information redundancy in a file system, the amount of information to manage is effectively reduced, significantly reducing the time and space requirement of managing information. The deduplication module partitions a file into chunks, generates the respective summary information, which we call a fingerprint and looks up Fingerprint Table to determine if the respective chunk already exists. If it does not exist, the fingerprint value is inserted into Fingerprint Table. Chunking and fingerprint management is the key technical constituents which governs the overall deduplication performance. There are a number of ways for chunking, e.g., variable size chunking, fixed size chunking, or mixture of both. There are a number of ways to managing fingerprints. Legacy index structure, e.g., B+tree and hashing does not fit for deduplication workload. A sequence of fingerprints generated from a single file does not yield any spatial locality in Fingerprint Table. On the same token, a sequence of fingerprint lookup operations can result in a random read on Fingerprint Table and therefore each fingerprint lookup can result in disk access. Given that most of the deduplication operation needs to be performed online, it is critical that fingerprint lookup and insert is performed with minimal disk access. 401 Techscripts

2 II. RELATED WORK Chunk-based deduplication is the most widely used deduplication method for secondary storage. Such a system breaks a data file or stream into contiguous chunks and eliminates duplicate copies by recording references to previous, identical chunks. Numerous studies have investigated content-addressable storage using whole files, fixed-size blocks, content defined chunks and combinations or comparisons of these approaches [3]; generally, these have found that using content-defined chunks improves deduplication rates when small file modifications are stored. Once the data are divided into chunks, it is represented by a secure fingerprint used for deduplication. A technique to decrease the in-memory index requirements is presented in Sparse Indexing, which uses a sampling technique to reduce the size of the fingerprint 12 index. The backup set is broken into relatively large regions in a content-defined manner similar to our super chunks, each containing thousands of chunks. Regions are then deduplicated against a few of the most similar regions that have been previously stored using a sparse, in-memory index with only a small loss of deduplication. While Sparse Indexing is used in a single system to reduce its memory footprint, the notion of sampling within a region of chunks to identify other chunks against which new data may be deduplicated is similar to our sampling approach in stateful routing. We use those matches to direct to a specific node, while they use matches to load a cache for deduplication. Several other deduplication clusters have been presented in the literature. Bhagwat et al. [2] describe a distributed deduplication system based on Extreme Binning : data are forwarded and stored on a file basis and the representative chunk ID is used to determine the destination. An incoming file is only deduplicated against a file with a matching representative chunk ID rather than against all data in the system. Note that Extreme Binning is intended for operations on individual files, not aggregates of all files being backed up together. In the latter case, this approach limits deduplication when inter-file locality is poor, suffers from increased cache misses and data skew and requires multiple passes over the data when these aggregates are too big to fit in memory. DEBAR [10] also deduplicates individual files written to their cluster. Unlike our system, DEBAR deduplicates files partially as they are written to disk and completes deduplication during post-processing by sharing fingerprints between nodes. HYDRAstor [8] is a cluster deduplication storage system that creates chunks from a backup stream and routes chunks to storage nodes and HydraFS [5] is a file system built on top of the underlying HYDRAstor architecture. Throughput of hundreds of MB/s is achieved on 4-12 storage nodes while using 64 KB-sized chunks. Individual chunks are routed by evenly partitioning fingerprint space across storage nodes, which is similar to the routing techniques used by Avamar [11] and PureDisk [7]. In comparison, our system uses larger super-chunks for routing to maximize cache locality and throughput but also uses smaller chunks for deduplication to achieve higher deduplication. Choosing the right chunking granularity presents a tradeoff between deduplication and system capacity and throughput even in a single-node system. Bimodal chunking [9] is based on the observation that using large chunks reduces metadata overhead and improves throughput, but large chunks fail to recover some deduplication opportunities when they straddle the point where new data are added to the stream. Bimodal chunking tries to identify such points and uses a smaller chunk size around them for better deduplication. III. DATA REDUNDANCY CONTROL SCHEMES FOR CLOUD SERVERS Data deduplication, an effective data compression approach that exploits data redundancy, partitions large data objects into smaller parts, called chunks, represents these chunks by their fingerprints replaces the duplicate chunks with their fingerprints after chunk fingerprint index lookup and only transfers or stores the unique chunks for the purpose of communication or storage efficiency. Source deduplication that eliminates redundant data at the client site is obviously preferred to target deduplication due to the former s ability to significantly reduce the amount of data transferred over wide area network (WAN) with low communication bandwidth [1]. For dataset with logical size L and physical size P, source deduplication can reduce the data transfer time to P/L that of traditional cloud backup. Data deduplication is a resourceintensive process, which entails the CPU-intensive hash calculations for chunking and fingerprinting and the I/O-intensive operations for identifying and eliminating duplicate data. Such resources are limited in a typical personal computing device. Therefore, it is desirable to achieve a tradeoff between deduplication effectiveness and system overhead for personal computing devices with limited system resources. In the traditional storage stack comprising applications, file systems and storage hardware, each of the layers contains different kinds of information about the data they manage and such information in one layer is typically not available to any other layers. Codesign for storage and application is possible to optimize deduplication based storage system when the lower-level storage layer has extensive knowledge about the data structures and their access characteristics in the higher-level application layer. ADMAD improves 402 Techscripts

3 redundancy detection by application- specific chunking methods that exploit the knowledge about concrete file formats. ViDeDup [4] is a frame-work for video deduplication based on an application-level view of redundancy at the content level rather than at the byte level. But all these prior work only focus on the effectiveness of deduplication to remove more redundancy without consider the system overheads for high efficiency in deduplication process. In this paper, we propose ALG-Dedupe, an Application aware Local-Global source deduplication scheme that not only exploits application awareness, but also combines local and global duplication detection, to achieve high deduplication efficiency by reducing the deduplication latency to as low as the application-aware local deduplication while saving as much cloud storage cost as the application-aware global deduplication [6]. Our application-aware deduplication design is motivated by the systematic deduplication analysis on personal storage. We observe that there is a significant difference among different types of applications in the personal computing environment in terms of data redundancy, sensitivity to different chunking methods and independence in the deduplication process. Thus, the basic idea of ALG- Dedupe is to effectively exploit this application difference and awareness by treating different types of applications independently and adaptively during the local and global duplicate check processes to significantly improve the deduplication efficiency and reduce the system overhead. We have made several contributions in the paper. We propose a new metric, bytes saved per second, to measure the efficiency of different deduplication schemes on the same platform. We design an application-aware deduplication scheme that employs an intelligent data chunking method and an adaptive use of hash functions to minimize computational overhead and maximize deduplication effectiveness by exploiting application awareness. We combine local deduplication and global deduplication to balance the effectiveness and latency of deduplication. To relieve the disk index lookup bottleneck, we provide application-aware index structure to suppress redundancy independently and in parallel by dividing a central index into many independent small indices to optimize lookup performance. We also propose a data aggregation strategy at the client side to improve data transfer efficiency by grouping many small data packets into a single larger one for cloud storage. Our prototype implementation and real dataset driven evaluations show that our ALG-Dedupe outperforms the existing state-of-the-art source deduplication schemes in terms of backup window, energy efficiency and cost saving for its high deduplication efficiency and low system overhead. IV. APPLICATION-AWARE DEDUPLICATION PROCESS ALG-Dedupe are designed to meet the requirement of deduplication efficiency with high deduplication effectiveness and low system overhead. The main idea of ALG Dedupe is 1) exploiting both low-overhead local resources and high-overhead cloud resources to reduce the computational overhead by employing an intelligent data chunking scheme and an adaptive use of hash functions based on application awareness and 2) to mitigate the on-disk index lookup bottleneck by dividing the full index into small independent and application-specific indices in an application- aware index structure. It combines local-global source deduplication with application awareness to improve deduplication effectiveness with low system overhead on the client side. A. File Size Analysis Most of the files in the PC dataset are tiny files that less than 10 KB in file size, accounting for a negligibly small percentage of the storage capacity. As shown in our statistical evidences, about 60.3 percent of all files are tiny files, accounting for only 1.7 percent of the total storage capacity of the dataset. To reduce the metadata overhead, ALG-Dedupe filters out these tiny files in the file size filter before the deduplication process and groups data from many tiny files together into larger units of about 1 MB each in the segment store to increase the data transfer efficiency over WAN. B. Data Chunking Process The deduplication efficiency of data chunking scheme among different applications differs greatly. Depending on whether the file type is compressed or whether SC can outperform CDC in deduplication efficiency, we divide files into three main categories: compressed files, static uncompressed files and dynamic uncompressed files. The dynamic files are always editable, while the static files are uneditable in common. To strike a better tradeoff between duplicate elimination ratio and deduplication overhead, we deduplicate compressed files with WFC, separate static uncompressed files into fix-sized chunks by SC with ideal chunk size and break dynamic uncompressed files into variable-sized chunks with optimal average chunk size using CDC based on the Rabin fingerprinting to identify chunk boundaries. C. Application-Aware Deduplicator After data chunking in intelligent chunker module, data chunks will be deduplicated in the applicationaware deduplicator by generating chunk fingerprints in the hash engine and detecting duplicate chunks in 403 Techscripts

4 both the local client and remote cloud. ALG-Dedupe strikes a good balance between alleviating computation overhead on the client side and avoiding hash collision to keep data integrity. We employ an extended 12- byte Rabin hash value as chunk fingerprint for local duplicate data detection and a MD5 value for global duplicate detection of compressed files with WFC. In both local and global detection scenarios, a SHA-1 value of chunk serves as chunk fingerprint of SC in static uncompressed files and a MD5 value is used as chunk fingerprint of dynamic uncompressed files since chunk length is another dimension for duplicate detection in CDC-based deduplication. To achieve high deduplication efficiency, the application aware deduplicator first detects duplicate data in the application-aware local index corresponding to the local datasetwith low deduplication latency in the PC client and then compares local deduplicated data chunks with all data stored in the cloud by looking up fingerprints in the application-aware global index on the cloud side for high data reduction ratio. Only the unique data chunks after global duplicate detection are stored in the cloud storage with parallel container management. D. Modified AES Algorithm The Advanced Encryption Standard (AES) is an encryption standard adopted by the U.S. government. The standard comprises three block ciphers: AES-128, AES-192 and AES-256, adopted from a larger collection originally published as Rijndael. Each AES cipher has a 128-bit block size, with key sizes of 128, 192 and 256 bits, respectively. The AES ciphers have been analyzed extensively and are now used worldwide, as was the case with their predecessor, the Data Encryption Standard (DES). The AES has now replaced DES as the preferred encryption standard. AES is a cryptographically secure encryption algorithm. A brute force attack requires 2128 trials for the 128-bit key size. In addition, the structure of the algorithm and the round functions used in it ensure high immunity to linear and differential cryptanalysis. Attacks against AES haven't been successful till now and it is the current encryption standard. The AES design can be used in any application that requires protection of data during transmission through the communication network, including applications such as electronic commerce transactions, ATM machines and wireless communication. To increase the robustness of the AES algorithm, we have to use longer encryption keys and larger data matrix. To keep the processing time at low values, we have to maintain unchanged the complexity of the AES algorithm. The modified AES algorithm (MAES) will work on data matrices. The encryption key will have an equivalent length of about 384, 512, 768 and 1024 bits and the modified algorithm is denoted according to it: MAES-384, MAES- 512, MAES-768 and MAES V. ISSUES ON REDUNDANT DATA CONTROL SCHEMES Application aware Local-Global source Deduplication (ALG-Dedupe) scheme is used to control redundancy in cloud backups. File size filter is used to separate the small size files. Application aware chunking strategy is used in Intelligent Chunker to break the backup data streams. Application aware deduplicator deduplicates the data chunks from the same type of files. Hash engine is used to generate chunk finger prints. Data redundancy check is carried out in application-aware indices in both local client and remote cloud. File metadata is updated with redundant chunk location details. Segments and corresponding finger prints are stored in the cloud data center using self describing data structure (container). The following problems are identified from the existing system. They are resource constrained mobile devices are not supported, data security is not considered, deduplication is not applied for small size files and backup window size selection is not optimized. VI. SECURITY ENSURED REDUNDANT DATA MANAGEMENT The deduplication system is adapted for the Computer and Smart phone clients. The system provides security for the backup data values. Small size files are also included in the deduplication process. The system is divided into six major modules. They are Cloud Backup Server, Chunking Process, Block level Deduplication, File level Deduplication, Security Process and Deduplication in Smart Phones. The cloud backup server module is designed to maintain the backup data for the clients. Chunking process module is designed to split the file into blocks. Block signature generation and deduplication operations are carried out in block level deduplication module. File level deduplication module is designed to perform deduplication in file level. Data security module is designed to protect the backup data values. Deduplication process is performed in the mobile phones in Deduplication under Smart phones module. A. Cloud Backup Server The cloud backup server module is designed to maintain the backup data for the clients. Chunking process module is designed to split the file into blocks. Block signature generation and deduplication operations are carried out in block level deduplication module. File level deduplication module is designed to perform deduplication in file level. Data security module is designed to protect the backup data values. Deduplication process is performed in the mobile phones in Deduplication under Smart phones module. 404 Techscripts

5 Cloud Server Upload Chunks Security Process Local Deduplicat ion Global Deduplica tion Local Deduplicat ion PC Client PC Client Figure 1: Security Ensured Redundant Data Management Scheme B. Chunking Process File size filter is used to separate the tiny files. Intelligent chunker is used to breakup the large size files into chunks. Backup files are divided into three categories. They are compressed files, static uncompressed files and dynamic uncompressed files. Static files are uneditable and dynamic files are editable. Compressed files are chunked with Whole File Chunking (WFC) mechanism. Static uncompressed files are partitioned into fix-sized chunks by Static Chunking (SC). Dynamic uncompressed files are braked into variable-sized chunks by Content Defined Chunking (CDC). C. Block Level Deduplication Chunks finger prints are generated in the hash engine. Rabin hash functions with 12 bytes are used as chunk fingerprint for local duplicate data detection for compressed files. Message Digest MD5 algorithm is used for global deduplication process in compressed files. Secure Hash Algorithm (SHA1) is used for deduplication in uncompressed static files. Chunks finger prints are generated in the hash engine. Rabin hash functions with 12 bytes are used as chunk fingerprint for local duplicate data detection for compressed files. Message Digest MD5 algorithm is used for global deduplication process in compressed files. Secure Hash Algorithm (SHA1) is used for deduplication in uncompressed static files. Dynamic uncompressed files are hashed using Message Digest (MD5) algorithm. Deduplicate detection is carried out in the local client 405 Techscripts

6 and remote cloud. Fingerprints are indexed in local and global level. Deduplication is performed by verifying the finger print index values. D. File Level Deduplication Tiny files are maintained under segment store environment. File level deduplication is performed on files with the size less than 10 KB. File level fingerprints are generated using Rabin hash Function. Deduplication is performed with file level fingerprint index verification mechanism. E. Security Process The backup data values are maintained in encrypted form. Modified Advanced Encryption Standard (MAES) algorithm is used in the encryption/decryption process. Encryption process is performed after the deduplication process. Local and global keys are used for the data security process. Deduplication in Smart Phones. F. Deduplication in Smart Phones The deduplication process is tuned for smart phone environment. Smart phones are used as client for cloud backup services. File level and block level deduplication tasks are supported by the system. Data security is also provided for the smart phone environment. VII. CONCLUSION Cloud storage mediums are used to manage public/private data values. Source deduplication methods are applied to limit the storage and communication requirements. Application aware Local-Global source Deduplication (ALG-Dedupe) mechanism performs redundancy filtering in same client and all client environments. The Security ensured Application aware Local-Global source Deduplication (SALG-Dedupe) scheme is designed with security and mobile device support features. Deduplication and power efficiency is improved in the computer and smart devices environment. The system reduces the cost for cloud backup services. Data access rate is increased by the system. The system achieves intra-client and inter-client redundancy with high deduplication effectiveness. REFERENCES [1] P. Shilane, M. Huang, G. Wallace and W. Hsu, WAN Optimized Replication of Backup Datasets Using Stream- Informed Delta Compression, in Proc. 10th USENIX Conf. FAST, 2012, pp [2] D. Bhagwat, K. Eshghi, D. D. Long and M. Lillibridge. Extreme binning: scalable, parallel deduplication for chunk-based file backup. In MASCOTS 09: Proceedings of the 17th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Sept [3] D. T.Meyer and W. J. Bolosky. A Study of Practical Deduplication. In FAST 11: Proceedings of the 9th Conference on File and Storage Technologies, February [4] A. Katiyar and J. Weissman, ViDeDup: An Application-Aware Framework for Video De-Duplication, in Proc. 3rd USENIX Workshop Hot-Storage File Syst., 2011, pp [5] C. Ungureanu, Aranya and A. Bohra. HydraFS: a high throughput files system for the HYDRAstor content-addressable storage system. In FAST 10: Proceedings of the 8th Conference on File and Storage Technologies, February [6] Cheng-Kang Chu, Sherman S.M. Chow and Robert H. Deng, Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud Storage IEEE Transactions On Parallel And Distributed Systems, Vol. 25, No. 2, February 2014 [7] M. Dewaikar. Symantec NetBackup PureDisk: optimizing backups with deduplication for remote offices, data center and virtual machines. tbackup_puredisk_wp-en-us.pdf, September [8] C. Dubnicki, G. Leszek, H. Lukasz, C. Ungureanu and M. Welnicki. HYDRAstor: ascalable secondary storage. In FAST 09: Proceedings of the 7th conference on File and Storage Technologies, pages , February [9] E. Kruus, C. Ungureanu and C. Dubnicki. Bimodal content defined chunking for backup streams. In FAST 10: Proceedings of the 8th Conference on File and Storage Technologies, February [10] T. Yang, D. Feng, Z. Niu, K. Zhou and Y. Wan. DEBAR: a scalable high-performance deduplication storage system for backup and archiving. In IEEE International Symposium on Parallel& Distributed Processing, May [11] EMC Corporation. Efficient data protection with EMC Avamar global deduplication software. collateral/software/white-papers/h2681-efdta-prot-av%amar.pdf, July Techscripts