Storage Optimization in Cloud Environment using Compression Algorithm



Similar documents
A* Algorithm Based Optimization for Cloud Storage

Analysis of Compression Algorithms for Program Data

Image Compression through DCT and Huffman Coding Technique

On the Use of Compression Algorithms for Network Traffic Classification

Compression techniques

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: 2.114

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

Arithmetic Coding: Introduction

HIGH DENSITY DATA STORAGE IN DNA USING AN EFFICIENT MESSAGE ENCODING SCHEME Rahul Vishwakarma 1 and Newsha Amiri 2

Streaming Lossless Data Compression Algorithm (SLDC)

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

Information, Entropy, and Coding

ANALYSIS OF THE EFFECTIVENESS IN IMAGE COMPRESSION FOR CLOUD STORAGE FOR VARIOUS IMAGE FORMATS

A Perfect CRIME? TIME Will Tell. Tal Be ery, Web research TL

A Survey on Cloud Computing

Data Mining Un-Compressed Images from cloud with Clustering Compression technique using Lempel-Ziv-Welch

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Extended Application of Suffix Trees to Data Compression

SRC Research. d i g i t a l. A Block-sorting Lossless Data Compression Algorithm. Report 124. M. Burrows and D.J. Wheeler.

An Efficient Checkpointing Scheme Using Price History of Spot Instances in Cloud Computing Environment

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

FEDERATED CLOUD: A DEVELOPMENT IN CLOUD COMPUTING AND A SOLUTION TO EDUCATIONAL NEEDS

Keywords: Cloudsim, MIPS, Gridlet, Virtual machine, Data center, Simulation, SaaS, PaaS, IaaS, VM. Introduction

Transformation of LOG file using LIPT technique

How to Send Video Images Through Internet

Searching BWT compressed text with the Boyer-Moore algorithm and binary search

A Load Balancing Model Based on Cloud Partitioning for the Public Cloud

How To Understand Cloud Computing

Optimal Service Pricing for a Cloud Cache

Data Compression Using Long Common Strings

Chapter 4: Computer Codes

CHAPTER 5. Obfuscation is a process of converting original data into unintelligible data. It

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM

CHAPTER 2 LITERATURE REVIEW

Lossless Grey-scale Image Compression using Source Symbols Reduction and Huffman Coding

Secured Storage of Outsourced Data in Cloud Computing

Content-aware Partial Compression for Big Textual Data Analysis Acceleration

Third Southern African Regional ACM Collegiate Programming Competition. Sponsored by IBM. Problem Set

A comprehensive survey on various ETC techniques for secure Data transmission

PRIVACY PRESERVATION ALGORITHM USING EFFECTIVE DATA LOOKUP ORGANIZATION FOR STORAGE CLOUDS

International Journal of Advanced Research in Computer Science and Software Engineering

Ranked Keyword Search Using RSE over Outsourced Cloud Data

Network (Tree) Topology Inference Based on Prüfer Sequence

Secured Lossless Medical Image Compression Based On Adaptive Binary Optimization

Introduction to Hadoop

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

Lempel-Ziv Factorization: LZ77 without Window

From Grid Computing to Cloud Computing & Security Issues in Cloud Computing

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Performance Evaluation of Round Robin Algorithm in Cloud Environment

Search Query and Matching Approach of Information Retrieval in Cloud Computing

Fast Arithmetic Coding (FastAC) Implementations

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Cloud Computing: Computing as a Service. Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

FREE computing using Amazon EC2

Secure Collaborative Privacy In Cloud Data With Advanced Symmetric Key Block Algorithm

EFFECTIVE DATA RECOVERY FOR CONSTRUCTIVE CLOUD PLATFORM

Signature Amortization Technique for Authenticating Delay Sensitive Stream

VIRTUAL LABORATORY: MULTI-STYLE CODE EDITOR

Participatory Cloud Computing and the Privacy and Security of Medical Information Applied to A Wireless Smart Board Network

Compressing Medical Records for Storage on a Low-End Mobile Phone

Variables, Constants, and Data Types

A NEW LOSSLESS METHOD OF IMAGE COMPRESSION AND DECOMPRESSION USING HUFFMAN CODING TECHNIQUES

TCP/IP Networking, Part 2: Web-Based Control

Comparison of different image compression formats. ECE 533 Project Report Paula Aguilera

Multilevel Communication Aware Approach for Load Balancing

ANALYSIS AND EFFICIENCY OF ERROR FREE COMPRESSION ALGORITHM FOR MEDICAL IMAGE

Key Components of WAN Optimization Controller Functionality

Binary Trees and Huffman Encoding Binary Search Trees

Dissertation Title: SOCKS5-based Firewall Support For UDP-based Application. Author: Fung, King Pong

RANKING OF CLOUD SERVICE PROVIDERS IN CLOUD

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Introduction to Hadoop

High performance computing network for cloud environment using simulators

Structures for Data Compression Responsible persons: Claudia Dolci, Dante Salvini, Michael Schrattner, Robert Weibel

A PPM-like, tag-based branch predictor

Storage Management for Files of Dynamic Records

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Tape Drive Data Compression Q & A

Data Storage Security in Cloud Computing for Ensuring Effective and Flexible Distributed System

Lossless Medical Image Compression using Predictive Coding and Integer Wavelet Transform based on Minimum Entropy Criteria

A Data De-duplication Access Framework for Solid State Drives

encoding compression encryption

Optimization of ETL Work Flow in Data Warehouse

The enhancement of the operating speed of the algorithm of adaptive compression of binary bitmap images

Today s topics. Digital Computers. More on binary. Binary Digits (Bits)

Transcription:

Storage Optimization in Cloud Environment using Compression Algorithm K.Govinda 1, Yuvaraj Kumar 2 1 School of Computing Science and Engineering, VIT University, Vellore, India kgovinda@vit.ac.in 2 School of Computing Science and Engineering, VIT University, Vellore, India yuva.murak@gmail.com Abstract: Cloud Storage provides users with storage space and makes user friendly and timely acquire data, which is foundation of all kinds of cloud applications. However, there is lack of deep studies and research on how to optimize cloud storage aiming at improvement of data access performance over cloud. In this environment, consumers are billed as per they used. Generally it is called as Pay-as-you-go. In other words, If you have used for an hour, you are about to pay for the used hour. It is based upon the services. Each service has its own cost. Currently, there are various cloud computing service provider such Amazon, Google, IBM and so on. Most of the professional companies are being shifted to cloud computing architecture environment because of Space, Speed and Resource availability. Only you need to pay for the service you have been consumed, In order to reduce OPEX for cloud consumer. Optimization is challenging task. In this paper we propose a storage optimization mechanism to reduce the storage space over the cloud. greatly facilitate the users. But, high demands are proposed for cloud management system itself. For example, a service failure occurs in Simple Storage Service (S3) in July 2008, and this failure lasted for eight hours, making online companies relying on S3 suffer a great loss. The reason causing the system failure is that the S3 system can not effectively route the user's requests to the appropriate physical storage server. Therefore, cloud storage must be optimized to ensure that the data storage and access efficiency. The rest of the paper is organized as follows. Chapter2 descries different data compression techniques, Chapter3 describes the proposed LZW method and Chapter4 describes implementation followed by conclusion. Keywords: Optimization, Storage, OPEX, LZW and LZ78. 1. INTRODUCTION Cloud computing is a new form of distributed computing mode after grid computing and pervasive computing. Its aim is to build a virtual infrastructure providing users with remotely computing and storage capacity [1-3]. Since 2006, there have been some of the more successful cloud facilities, such as Amazon's Elastic Compute Cloud [3], IBM's Blue Cloud [5], Nimbus [6], Open Nebula [7], and Google s Google App Engine [8] and so on. Cloud storage is a kind of cloud computing. It provides space for data storage, and user-friendly and timely access way to user, such as a simple storage service Simple Storage Service (S3) built on Amazon EC2 as well as the Google File System [9]. The greatest advantage of cloud storage is it enables users at any time access data. In cloud system, storage management system automatically analyses user s requirements and locate and transform data, which Figure 1 Cloud Storage Scenario - 57 -

2 LITERATURE REVIEW 2.1 Huffman Coding Huffman coding [10] is an entropy encoding algorithm used for lossless data compression. It uses a specific method for choosing the representation for each symbol, resulting in a prefix-free code that expresses the most common characters using shorter strings of bits than are used for less common source symbols. Huffman coding is optimal when the probability of each input symbol is a negative power of two. Prefix-free codes tend to have slight inefficiency on small alphabets, where probabilities often fall between these optimal points. "Blocking", or expanding the alphabet size by coalescing multiple symbols into "words" of fixed or variable-length before Huffman coding, usually helps, especially when adjacent symbols are correlated. Prediction by Partial Matching (PPM) [11, 12] is an adaptive statistical data compression technique based on context modeling and prediction. In general, PPM predicts the probability of a given character based on a given number of characters that immediately precede it. Predictions are usually reduced to symbol rankings. The number of previous symbols, n, determines the order of the PPM model which is denoted as PPM(n). Unbounded variants where the context has no length limitations also exist and are denoted as PPM*. If no prediction can be made based on all n context symbols a prediction is attempted with just n-1 symbols. This process is repeated until a match is found or no more symbols remain in context. At that point a fixed prediction is made. PPM is conceptually simple, but often computationally expensive. Much of the work in optimizing a PPM model is handling inputs that have not already occurred in the input stream[13]. The obvious way to handle them is to create a "neverseen" symbol which triggers the escape sequence. But what probability should be assigned to a symbol that has never been seen. This is called the zero-frequency problem. PPM compression implementations vary greatly in other details. The actual symbol selection is usually recorded using arithmetic coding, though it is also possible to use Huffman encoding or even some type of dictionary coding technique. The underlying model used in most PPM algorithms can also be extended to predict multiple symbols. The symbol size is usually static, typically a single byte, which makes generic handling of any file format easy. 2.2 LZ77 LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the input (uncompressed) data stream. A match is encoded by a pair of numbers called a length distance pair, which is equivalent to the statement "each of the next length characters is equal to the characters exactly distance characters behind it in the uncompressed stream". To spot matches, the encoder must keep track of some amount of the most recent data, such as the last 2 kb, 4 kb, or 32 kb. The structure in which this data is held is called a sliding window, which is why LZ77 is sometimes called sliding window compression. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. The larger the sliding window is, the longer back the encoder may search for creating references. It is not only acceptable but frequently useful to allow length-distance pairs to specify a length that actually exceeds the distance. As a copy command, this is puzzling: "Go back four characters and copy e10characters from that position into the current position"[10]. How can ten characters be copied over when only four of them are actually in the buffer? Tackling one byte at a time, there is no problem serving this request, because as a byte is copied over, it may be fed again as input to the copy command. When the copy-from position makes it to the initial destination position, it is consequently fed data that was pasted from the beginning of the copy-from position. The operation is thus equivalent to the statement "copy the data you were given and repetitively paste it until it fits". 2.3 LZ78 LZ78algorithms achieve compression by replacing repeated occurrences of data with references to a dictionary that is built based on the input data stream. Each dictionary entry is of the form dictionary[...] = {index, character}, where index is the index to a previous dictionary entry, and character is appended to the string represented by dictionary[index]. For example, "abc" would be stored (in reverse order) as follows: dictionary[k] = {j, 'c'}, dictionary[j] = {i, 'b'}, dictionary[i] = {0, 'a'}, where an index of 0 implies the end of a string. The algorithm initializes last matching index = 0 and next available index = 1. For each character of the input stream, the dictionary is searched for a match: {last matching index, character}. If a match is found, then last matching index is set to the index of the matching entry, and nothing is output. If a match is not found, then a new dictionary entry is created: dictionary[next available index] = {last matching index, character}, and the algorithm outputs last matching index, followed by character, then resets last matching index = 0 and increments next available index. Once the dictionary is full, no more entries are added. When the end in th of the input stream is reached, the algorithm outputs last matching index. It is very important to know that the strings stored in the dictionary is in the reversed order[14-16]. LZW is an LZ78 based algorithm that uses a dictionary pre-initialized with all possible symbols. The main improvement of LZW is that when a match is not found, the current input stream character is assumed that it will be the first character of an existing string in the dictionary (since the dictionary is initialized with all - 58 -

possible characters), so only the last matching index is output (which may be the pre-initialized dictionary index corresponding to the previous symbol. Table 1 The String Table after compression phase over string /BAT/BE/BAR/BATS. Input String=/BAT/BE/BAR/BATS Character Input Code Output New Code Value New String /B / 256 /B A B 257 BA T A 258 AT / T 259 T/ BE 256 260 /BE / E 261 E/ BA 256 262 /BA R A 263 AR / R 264 R/ BAT 262 265 /BAT S T 266 TS EOF S 3. CHAR = get input character 4. IF STRING+CHAR is in the string table then 5. STRING = STRING + character 6. ELSE 7. output the code for STRING 8. add STRING+CHAR to the string table 9. STRING = CHAR 10. END of IF 11. END of WHILE 12. Output the code for STRING 3.2 Algorithm - LZW_DECOMPRESS 1. Read O_CODE 2. output O_CODE 3. WHILE there are still input characters DO 4. Read N_CODE 5. STRING = get translation of N_CODE 6. output STRING 7. CHAR = first character in STRING 8. add OLD_CODE + CHAR to the translation table 9. O_CODE = N_CODE 10. END of WHILE 3.3 Data Flow Diagram (DFD) 3 PROPOSED METHOD LZW compression generally interchanges a set of characters with single code. This compression methodology will never understand the input text. Behalf of it LZW constructs a table known as string translation table from the text which is being compressed. The string translation table build by LZW generates a strict-length of code to strings. The translation table is initialized with all single-character strings. Each and every single time a previously-encountered string is received from the input, the longest such previously-encountered string is verified, and then the code for the string which is encountered now is concatenated with the initialized extension character and stored in the table[17]. The code for this longest previously-encountered string is the output and the extension character is used as the beginning of the next word. Compression occurs by the translation table which translates the set of characters to a single code as the output instead of a string of characters as shown in table1. Although LZW is often explained in the context of compressing text files, it can be used on any type of file. However, it generally performs best on files with repeated substrings, such as text files. 3.1 Algorithm - LZW_COMPRESS 1. STRING = get input character 2. WHILE there are still input characters DO 4 IMPLEMENTATION Figure 2 Data Flow Diagram We implemented the LZW algorithm using java and achieved around 50% of compression as shown in Fig2, - 59 -

Fig3 and Fig4. The Fig2 shows the size of the file before compression, Fig3 shows the size of the file after compression and Fig5 shows the size of the file after uncompress. Figure 5. after uncompression Figure 3 before compression Figure 4 after compression 5 CONCLUSION We can conclude that if the LZW can be used in cloud environment to compress the data during storage so that the data transfer time and storage reduced subsequently. Because compression and decompression depends upon compiler; as in advancing technologies new and new processors are coming so compiler speed is increasing. By implementation we can conclude that LZW does not have overhead of sending the key because decompression is predefined with ASCII values. So LZW can be successfully integrated with Cloud. References [1] Weiss. Computing in the Clouds[J]. networker 2007,11(4):16-25. [2] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, et al. Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility[j].future Generation Computer Systems 2009,25:599-616. [3] Twenty experts define cloud computing[url]. http :// cloud computing. sys.con.com/ read/612375_p.htm (18.07.08). [4] Amazon Inc. Amazon Web Services EC2 site[url]. http :// aws. a m a zon.com/ec2, 2008. [5] IBM Blue Cloud project [URL]. http://www-03.ibm.com /press /us/en / pressrelease/22613.wss/, access on June 2008. [6] Nimbus Project [URL].http://workspace.globus.org /clouds /nimbus.html/, 2008. [7] OpenNEbula Project [URL]. http://www.opennebula.org/, access on Apr. 2008. [8] S. Ghemawat, H. Gobioff, and S. Leung. The google file system[c]. In Proceedings of the 19th ACM - 60 -

Symposium on Operating Systems Principles, pages 29 43,2003 [9] GoogleApp [URL]http://appengine.google.com/ access on June 2008. [10] C:\Documents and Settings\DELL\Local Settings\temp\IM\LZSS (LZ77) Discussion and Implementation.mht. [11] D.A. Huffman, "A Method for the Construction of Minimum Redundancy Codes", Proceedings of the I.R.E., September 1952, pp 1098-1102. [12] T. Bell, J. Cleary, and I. Witten, Data compression using adaptive coding and partial string matching, IEEE Transactions on Communications, Vol. 32 (4), p. 396-402, 1984. [13] A. Moffat, Implementing the PPM data compression scheme, IEEE Transactions on Communications, Vol. 38 (11), pp. 1917-1921, November 1990. [14] Ziv, J., & Lempel, A. A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, 23(3), pp.337-343, May 1977. [15] Ziv, J., & Lempel, A. Compression of individual sequences via variable-rate coding, IEEE Trans. Inform. Theory, 24(5), 530-536, September 1978. [16] M. Burrows and D. J. Wheeler, A Block-sorting Lossless Data Compression Algorithm, Digital Systems Research Canter Research Report 124, May 1994. [17] Welch, T.A. A technique for high performance data compression, IEEE Computer, 17(6), 819, 1984. Author s Profile Mr.K.Govinda, Ph.D Scholar and A.P (SG) in School of Computing Science and Engineering of VIT University, Vellore, Tamil Nadu. He has more than X years of teaching experience and his areas of interests are Database, Distributed Database, Data Warehousing & Mining and Cloud Computing. Yuvaraj Kumar received the M.Sc. degrees in Computer Science from VIT University in 2012. - 61 -