International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 3 (2013), pp. 139-146 International Research Publications House http://www. irphouse.com /ijict.htm Analysis and Comparison of Algorithms for Lossless Data Compression Hyderabad, INDIA. Abstract Data compression is an art used to reduce the size of a particular file. The goal of data compression is to eliminate the redundancy in a file s code in order to reduce its size. It is useful in reducing the data storage space and in reducing the time needed to transmit the data. Data compression can either be lossless or lossy. Lossless data compression recreates the exact original data from the compressed data while lossy data compression cannot regenerate the perfect original data from the compressed data. Lossy methods are mainly used for compressing sound, images or video. A lot of data compression algorithms are available to compress files of different formats. This paper involves the discussion and comparison of a selected set of lossless data compression algorithms. Keywords: Data compression, Lossless Compression, Lossy Compression, Huffman Coding, Arithmetic Coding, Run Length Encoding. 1. Introduction Data compression is the art of representing information in compact form. It reduces the file size which in turn reduces the required storage space and makes the transmission of data quicker. Compression techniques try to find redundant data and remove these redundancies. Data compression can be divided into two broad classes: lossless data compression and lossy data compression. In lossless compression, the exact original data can be recovered from compressed data. It is used when the difference between original data and decompressed data cannot be tolerated. Medical images, text needed in legal purposes and computer executable files are compressed using lossless
140 compression techniques. Lossy compression, as the name suggests, involves loss of information. It is used in the applications where the lack of reconstruction is not an issue. Videos and audios are compressed using lossy compression. The extremely fast growth of data that needs to be stored and transferred has given rise to the demands of better transmission and storage techniques. Various lossless data compression algorithms have been proposed and used. Huffman Coding, Arithmetic Coding, Shannon Fano Algorithm, Run Length Encoding Algorithm are some of the techniques in use. This paper examines Huffman Coding, Arithmetic Coding, and Run Length Encoding Algorithm. 2. Run Length Encoding Run Length Encoding (RLE) is the simplest of the data compression algorithms. It replaces runs of two or more of the same character with a number which represents the length of the run, followed by the original character. Single characters are coded as runs of 1. The major task of this algorithm is to identify the runs of the source file, and to record the symbol and the length of each run. The Run Length Encoding algorithm uses those runs to compress the original source file while keeping all the non-runs without using for the compression process. Example of RLE: Input: AAABBCCCCD Output: 3A2B4C1D 3. Huffman Coding First Huffman coding algorithm was developed by David Huffman in 1951. Huffman coding is an entropy encoding algorithm used for lossless data compression. In this algorithm fixed length codes are replaced by variable length codes. When using variable-length code words, it is desirable to create a prefix code, avoiding the need for a separator to determine codeword boundaries. Huffman Coding uses such prefix code. Huffman procedure works as follow: 1. Symbols with a high frequency are expressed using shorter encodings than symbols which occur less frequently. 2. The two symbols that occur least frequently will have the same length. The Huffman algorithm uses the greedy approach i.e. at each step the algorithm chooses the best available option. A binary tree is built up from the bottom up. To see how Huffman Coding works, let s take an example. Assume that the characters in a file to be compressed have the following frequencies: A: 25 B: 10 C: 99 D: 87 E: 9 F: 66 The processing of building this tree is: 1. Create a list of leaf nodes for each symbol and arrange the nodes in the order from highest to lowest.
Analysis and Comparison of Algorithms for Lossless Data Compression 141 C:99 D:87 F:66 A:25 B:10 E:9 2. Select two leaf nodes with the lowest frequency. Create a parent node with these two nodes and assign the frequency equal to the sum of the frequencies of two child nodes. Now add the parent node in the list and remove the two child nodes from the list. And repeat this step until you have only one node left.
142 3. Now label each edge. The left child of each parent is labeled with the digit 0 and right child with 1. The code word for each source letter is the sequence of labels along the path from root to the leaf node representing the letter. Huffman Codes are shown below in the table Table 1: Huffman Codes. C 00 D 01 F 10 A 110 B 1110 E 1111 4. Arithmetic Coding Arithmetic Coding is useful for small alphabets with highly skewed probabilities. In this method, a code word is not used to represent a symbol of the text. Instead, it produces a code for an entire message. Arithmetic Coding assigns an interval to each symbol. Then a decimal number is assigned to this interval. Initially, the interval is [0, 1). A message is represented by a half open interval [x, y) where x and y are real numbers between 0 and 1. The interval is then divided into sub-intervals. The number of sub-intervals is identical to the number of symbols in the current set of symbols and size is proportional to their probability of appearance. For each symbol a new internal division takes place based on the last sub interval. Consider an example illustrating encoding in Arithmetic Coding.
Analysis and Comparison of Algorithms for Lossless Data Compression 143 Table 2: Encoding in Arithmetic Coding. Symbol Probability Range X 0.5 [0.0, 0.5) Y 0.3 [0.5, 0.8) Z 0.2 [0.8, 1.0) Table 3: Encoding symbol YXX. Symbol Range Low Value High Value 0 1 Y 1 0.5 0.8 X 0.3 0.5 0.65 X 0.15 0.5 0.575 In table 3, range, high value and low value are calculated as: Range= High value Low value High Value= Low value + Range * high range of the symbol being computed Low Value= Low value + Range * low range of the symbol being computed The string YXX is represented by an arbitrary number within the interval [0.5, 0.575). Figure 1: Graphical display of shrinking ranges.
144 5. Measuring compression performances There are various criteria to measure the performance of a compression algorithm. However, the main concern has always been the space efficiency and time efficiency. Following are some measurements used to evaluate the performances of lossless algorithm. 1. Compression Ratio: It is the ratio between the size of the compressed file and the size of the source file. 2. Compression factor: It is the inverse of the compression ratio. 3. Saving percentage: it calculates the shrinkage of the source file. 6. Comparing the algorithms: 1. Run Length Encoding: In the worst case RLE generates the output data which is 2 times more than the size of input data. This is due to the fewer amount of runs in the source file. And the files that are compressed have very high values of compression ratio. This algorithm does not provide significant improvement over the original file. 2. Huffman Coding vs. Arithmetic Coding: Huffman Coding Algorithm uses a static table for the whole coding process, so it is faster. However it does not produce efficient compression ratio. On the contrary, Arithmetic algorithm can generate a high compression ratio, but its compression speed is slow. The table 4 presents a simple comparison between these compression methods. Table 4: Huffman Coding Vs. Arithmetic Coding. COMPRESSION METHOD ARITHMETIC HUFFMAN Compression ratio Very good Poor Compression speed Slow Fast Decompression speed Slow Fast Memory space Very low Low Compressed pattern matching No Yes Permits Random access No Yes
Analysis and Comparison of Algorithms for Lossless Data Compression 145 Conclusion Arithmetic coding techniques outperforms Huffman coding and Run Length Encoding. Also the Compression ratio of the Arithmetic coding algorithm is better than the other two algorithms examined above. In this paper, it is found that the Arithmetic Coding is the most efficient algorithm among the selected ones. References [1] Introduction to Data Compression, Khalid Sayood, Ed Fox (Editor), March 2000. [2] Ken Huffman. Profile: David A. Huffman, Scientific American, September 1991, pp. 54 58. [3] Blelloch, E., 2002. Introduction to Data Compression, Computer Science Department, Carnegie Mellon University. [4] Senthil Shanmugasundaram, Robert Lourdusamy, A Comparative Study Of Text Compression Algorithm, International Journal of Wisdom Based Computing, Vol.1 (3) [5] S.R. Kodituwakku. U. S.Amarasinghe Comparison Of Lossless Data Compression Algorithms For Text Data [6] P.Yellamma Dr.Narasimham Challa. Performance Analysis Of Different Data Compression Techniques On Text File October-2012. [7] http://www.ieeeghn.org/wiki/index.php/historyof Lossless Data Compression Algorithms [8] http://www.binaryessence.com/dct/en000003.htm [9] Data compression Wikipedia.
146