Storage Optimization in Cloud Environment using Compression Algorithm

Storage Optimization in Cloud Environment using Compression Algorithm K.Govinda 1, Yuvaraj Kumar 2 1 School of Computing Science and Engineering, VIT University, Vellore, India kgovinda@vit.ac.in 2 School of Computing Science and Engineering, VIT University, Vellore, India yuva.murak@gmail.com Abstract: Cloud Storage provides users with storage space and makes user friendly and timely acquire data, which is foundation of all kinds of cloud applications. However, there is lack of deep studies and research on how to optimize cloud storage aiming at improvement of data access performance over cloud. In this environment, consumers are billed as per they used. Generally it is called as Pay-as-you-go. In other words, If you have used for an hour, you are about to pay for the used hour. It is based upon the services. Each service has its own cost. Currently, there are various cloud computing service provider such Amazon, Google, IBM and so on. Most of the professional companies are being shifted to cloud computing architecture environment because of Space, Speed and Resource availability. Only you need to pay for the service you have been consumed, In order to reduce OPEX for cloud consumer. Optimization is challenging task. In this paper we propose a storage optimization mechanism to reduce the storage space over the cloud. greatly facilitate the users. But, high demands are proposed for cloud management system itself. For example, a service failure occurs in Simple Storage Service (S3) in July 2008, and this failure lasted for eight hours, making online companies relying on S3 suffer a great loss. The reason causing the system failure is that the S3 system can not effectively route the user's requests to the appropriate physical storage server. Therefore, cloud storage must be optimized to ensure that the data storage and access efficiency. The rest of the paper is organized as follows. Chapter2 descries different data compression techniques, Chapter3 describes the proposed LZW method and Chapter4 describes implementation followed by conclusion. Keywords: Optimization, Storage, OPEX, LZW and LZ78. 1. INTRODUCTION Cloud computing is a new form of distributed computing mode after grid computing and pervasive computing. Its aim is to build a virtual infrastructure providing users with remotely computing and storage capacity [1-3]. Since 2006, there have been some of the more successful cloud facilities, such as Amazon's Elastic Compute Cloud [3], IBM's Blue Cloud [5], Nimbus [6], Open Nebula [7], and Google s Google App Engine [8] and so on. Cloud storage is a kind of cloud computing. It provides space for data storage, and user-friendly and timely access way to user, such as a simple storage service Simple Storage Service (S3) built on Amazon EC2 as well as the Google File System [9]. The greatest advantage of cloud storage is it enables users at any time access data. In cloud system, storage management system automatically analyses user s requirements and locate and transform data, which Figure 1 Cloud Storage Scenario - 57 -

2 LITERATURE REVIEW 2.1 Huffman Coding Huffman coding [10] is an entropy encoding algorithm used for lossless data compression. It uses a specific method for choosing the representation for each symbol, resulting in a prefix-free code that expresses the most common characters using shorter strings of bits than are used for less common source symbols. Huffman coding is optimal when the probability of each input symbol is a negative power of two. Prefix-free codes tend to have slight inefficiency on small alphabets, where probabilities often fall between these optimal points. "Blocking", or expanding the alphabet size by coalescing multiple symbols into "words" of fixed or variable-length before Huffman coding, usually helps, especially when adjacent symbols are correlated. Prediction by Partial Matching (PPM) [11, 12] is an adaptive statistical data compression technique based on context modeling and prediction. In general, PPM predicts the probability of a given character based on a given number of characters that immediately precede it. Predictions are usually reduced to symbol rankings. The number of previous symbols, n, determines the order of the PPM model which is denoted as PPM(n). Unbounded variants where the context has no length limitations also exist and are denoted as PPM*. If no prediction can be made based on all n context symbols a prediction is attempted with just n-1 symbols. This process is repeated until a match is found or no more symbols remain in context. At that point a fixed prediction is made. PPM is conceptually simple, but often computationally expensive. Much of the work in optimizing a PPM model is handling inputs that have not already occurred in the input stream[13]. The obvious way to handle them is to create a "neverseen" symbol which triggers the escape sequence. But what probability should be assigned to a symbol that has never been seen. This is called the zero-frequency problem. PPM compression implementations vary greatly in other details. The actual symbol selection is usually recorded using arithmetic coding, though it is also possible to use Huffman encoding or even some type of dictionary coding technique. The underlying model used in most PPM algorithms can also be extended to predict multiple symbols. The symbol size is usually static, typically a single byte, which makes generic handling of any file format easy. 2.2 LZ77 LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the input (uncompressed) data stream. A match is encoded by a pair of numbers called a length distance pair, which is equivalent to the statement "each of the next length characters is equal to the characters exactly distance characters behind it in the uncompressed stream". To spot matches, the encoder must keep track of some amount of the most recent data, such as the last 2 kb, 4 kb, or 32 kb. The structure in which this data is held is called a sliding window, which is why LZ77 is sometimes called sliding window compression. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. The larger the sliding window is, the longer back the encoder may search for creating references. It is not only acceptable but frequently useful to allow length-distance pairs to specify a length that actually exceeds the distance. As a copy command, this is puzzling: "Go back four characters and copy e10characters from that position into the current position"[10]. How can ten characters be copied over when only four of them are actually in the buffer? Tackling one byte at a time, there is no problem serving this request, because as a byte is copied over, it may be fed again as input to the copy command. When the copy-from position makes it to the initial destination position, it is consequently fed data that was pasted from the beginning of the copy-from position. The operation is thus equivalent to the statement "copy the data you were given and repetitively paste it until it fits". 2.3 LZ78 LZ78algorithms achieve compression by replacing repeated occurrences of data with references to a dictionary that is built based on the input data stream. Each dictionary entry is of the form dictionary[...] = {index, character}, where index is the index to a previous dictionary entry, and character is appended to the string represented by dictionary[index]. For example, "abc" would be stored (in reverse order) as follows: dictionary[k] = {j, 'c'}, dictionary[j] = {i, 'b'}, dictionary[i] = {0, 'a'}, where an index of 0 implies the end of a string. The algorithm initializes last matching index = 0 and next available index = 1. For each character of the input stream, the dictionary is searched for a match: {last matching index, character}. If a match is found, then last matching index is set to the index of the matching entry, and nothing is output. If a match is not found, then a new dictionary entry is created: dictionary[next available index] = {last matching index, character}, and the algorithm outputs last matching index, followed by character, then resets last matching index = 0 and increments next available index. Once the dictionary is full, no more entries are added. When the end in th of the input stream is reached, the algorithm outputs last matching index. It is very important to know that the strings stored in the dictionary is in the reversed order[14-16]. LZW is an LZ78 based algorithm that uses a dictionary pre-initialized with all possible symbols. The main improvement of LZW is that when a match is not found, the current input stream character is assumed that it will be the first character of an existing string in the dictionary (since the dictionary is initialized with all - 58 -

possible characters), so only the last matching index is output (which may be the pre-initialized dictionary index corresponding to the previous symbol. Table 1 The String Table after compression phase over string /BAT/BE/BAR/BATS. Input String=/BAT/BE/BAR/BATS Character Input Code Output New Code Value New String /B / 256 /B A B 257 BA T A 258 AT / T 259 T/ BE 256 260 /BE / E 261 E/ BA 256 262 /BA R A 263 AR / R 264 R/ BAT 262 265 /BAT S T 266 TS EOF S 3. CHAR = get input character 4. IF STRING+CHAR is in the string table then 5. STRING = STRING + character 6. ELSE 7. output the code for STRING 8. add STRING+CHAR to the string table 9. STRING = CHAR 10. END of IF 11. END of WHILE 12. Output the code for STRING 3.2 Algorithm - LZW_DECOMPRESS 1. Read O_CODE 2. output O_CODE 3. WHILE there are still input characters DO 4. Read N_CODE 5. STRING = get translation of N_CODE 6. output STRING 7. CHAR = first character in STRING 8. add OLD_CODE + CHAR to the translation table 9. O_CODE = N_CODE 10. END of WHILE 3.3 Data Flow Diagram (DFD) 3 PROPOSED METHOD LZW compression generally interchanges a set of characters with single code. This compression methodology will never understand the input text. Behalf of it LZW constructs a table known as string translation table from the text which is being compressed. The string translation table build by LZW generates a strict-length of code to strings. The translation table is initialized with all single-character strings. Each and every single time a previously-encountered string is received from the input, the longest such previously-encountered string is verified, and then the code for the string which is encountered now is concatenated with the initialized extension character and stored in the table[17]. The code for this longest previously-encountered string is the output and the extension character is used as the beginning of the next word. Compression occurs by the translation table which translates the set of characters to a single code as the output instead of a string of characters as shown in table1. Although LZW is often explained in the context of compressing text files, it can be used on any type of file. However, it generally performs best on files with repeated substrings, such as text files. 3.1 Algorithm - LZW_COMPRESS 1. STRING = get input character 2. WHILE there are still input characters DO 4 IMPLEMENTATION Figure 2 Data Flow Diagram We implemented the LZW algorithm using java and achieved around 50% of compression as shown in Fig2, - 59 -

Fig3 and Fig4. The Fig2 shows the size of the file before compression, Fig3 shows the size of the file after compression and Fig5 shows the size of the file after uncompress. Figure 5. after uncompression Figure 3 before compression Figure 4 after compression 5 CONCLUSION We can conclude that if the LZW can be used in cloud environment to compress the data during storage so that the data transfer time and storage reduced subsequently. Because compression and decompression depends upon compiler; as in advancing technologies new and new processors are coming so compiler speed is increasing. By implementation we can conclude that LZW does not have overhead of sending the key because decompression is predefined with ASCII values. So LZW can be successfully integrated with Cloud. References [1] Weiss. Computing in the Clouds[J]. networker 2007,11(4):16-25. [2] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, et al. Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility[j].future Generation Computer Systems 2009,25:599-616. [3] Twenty experts define cloud computing[url]. http :// cloud computing. sys.con.com/ read/612375_p.htm (18.07.08). [4] Amazon Inc. Amazon Web Services EC2 site[url]. http :// aws. a m a zon.com/ec2, 2008. [5] IBM Blue Cloud project [URL]. http://www-03.ibm.com /press /us/en / pressrelease/22613.wss/, access on June 2008. [6] Nimbus Project [URL].http://workspace.globus.org /clouds /nimbus.html/, 2008. [7] OpenNEbula Project [URL]. http://www.opennebula.org/, access on Apr. 2008. [8] S. Ghemawat, H. Gobioff, and S. Leung. The google file system[c]. In Proceedings of the 19th ACM - 60 -

Symposium on Operating Systems Principles, pages 29 43,2003 [9] GoogleApp [URL]http://appengine.google.com/ access on June 2008. [10] C:\Documents and Settings\DELL\Local Settings\temp\IM\LZSS (LZ77) Discussion and Implementation.mht. [11] D.A. Huffman, "A Method for the Construction of Minimum Redundancy Codes", Proceedings of the I.R.E., September 1952, pp 1098-1102. [12] T. Bell, J. Cleary, and I. Witten, Data compression using adaptive coding and partial string matching, IEEE Transactions on Communications, Vol. 32 (4), p. 396-402, 1984. [13] A. Moffat, Implementing the PPM data compression scheme, IEEE Transactions on Communications, Vol. 38 (11), pp. 1917-1921, November 1990. [14] Ziv, J., & Lempel, A. A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, 23(3), pp.337-343, May 1977. [15] Ziv, J., & Lempel, A. Compression of individual sequences via variable-rate coding, IEEE Trans. Inform. Theory, 24(5), 530-536, September 1978. [16] M. Burrows and D. J. Wheeler, A Block-sorting Lossless Data Compression Algorithm, Digital Systems Research Canter Research Report 124, May 1994. [17] Welch, T.A. A technique for high performance data compression, IEEE Computer, 17(6), 819, 1984. Author s Profile Mr.K.Govinda, Ph.D Scholar and A.P (SG) in School of Computing Science and Engineering of VIT University, Vellore, Tamil Nadu. He has more than X years of teaching experience and his areas of interests are Database, Distributed Database, Data Warehousing & Mining and Cloud Computing. Yuvaraj Kumar received the M.Sc. degrees in Computer Science from VIT University in 2012. - 61 -