Analysis and Comparison of Algorithms for Lossless Data Compression

Similar documents
Image Compression through DCT and Huffman Coding Technique

Compression techniques

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM

Class Notes CS Creating and Using a Huffman Code. Ref: Weiss, page 433

International Journal of Advanced Research in Computer Science and Software Engineering

Information, Entropy, and Coding

encoding compression encryption

Lossless Grey-scale Image Compression using Source Symbols Reduction and Huffman Coding

Reading.. IMAGE COMPRESSION- I IMAGE COMPRESSION. Image compression. Data Redundancy. Lossy vs Lossless Compression. Chapter 8.

CHAPTER 2 LITERATURE REVIEW

Symbol Tables. Introduction

A NEW LOSSLESS METHOD OF IMAGE COMPRESSION AND DECOMPRESSION USING HUFFMAN CODING TECHNIQUES

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Binary Trees and Huffman Encoding Binary Search Trees

ANALYSIS OF THE EFFECTIVENESS IN IMAGE COMPRESSION FOR CLOUD STORAGE FOR VARIOUS IMAGE FORMATS

Introduction to Medical Image Compression Using Wavelet Transform

Analysis of Compression Algorithms for Program Data

On the Use of Compression Algorithms for Network Traffic Classification

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

Arithmetic Coding: Introduction

Fast Arithmetic Coding (FastAC) Implementations

Secured Lossless Medical Image Compression Based On Adaptive Binary Optimization

Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction

Hybrid Compression of Medical Images Based on Huffman and LPC For Telemedicine Application

Entropy and Mutual Information

Comparison of different image compression formats. ECE 533 Project Report Paula Aguilera

Tape Drive Data Compression Q & A

Narrow Bandwidth Streaming Video Codec

Gambling and Data Compression

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

The enhancement of the operating speed of the algorithm of adaptive compression of binary bitmap images

Development and Evaluation of Point Cloud Compression for the Point Cloud Library

Khalid Sayood and Martin C. Rost Department of Electrical Engineering University of Nebraska

Hybrid Lossless Compression Method For Binary Images

FUNDAMENTALS of INFORMATION THEORY and CODING DESIGN

Binary Coded Web Access Pattern Tree in Education Domain

A comprehensive survey on various ETC techniques for secure Data transmission

Introduction to image coding

Sheet 7 (Chapter 10)

Original-page small file oriented EXT3 file storage system

Research on the UHF RFID Channel Coding Technology based on Simulink

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

ELEC3028 Digital Transmission Overview & Information Theory. Example 1

Storage Optimization in Cloud Environment using Compression Algorithm

Scalable Prefix Matching for Internet Packet Forwarding

S. Muthusundari. Research Scholar, Dept of CSE, Sathyabama University Chennai, India Dr. R. M.

ER E P M A S S I CONSTRUCTING A BINARY TREE EFFICIENTLYFROM ITS TRAVERSALS DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A

Today s topics. Digital Computers. More on binary. Binary Digits (Bits)

Physical Data Organization

Sachin Dhawan Deptt. of ECE, UIET, Kurukshetra University, Kurukshetra, Haryana, India

Video compression: Performance of available codec software

Protein Protein Interaction Networks

Figure 1: Relation between codec, data containers and compression algorithms.

International Journal of Software and Web Sciences (IJSWS)

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Structure for String Keys

FCE: A Fast Content Expression for Server-based Computing

Introduction Advantages and Disadvantages Algorithm TIME COMPLEXITY. Splay Tree. Cheruku Ravi Teja. November 14, 2011

Introduction to Learning & Decision Trees

INTERNATIONAL JOURNAL OF APPLIED ENGINEERING RESEARCH, DINDIGUL Volume 1, No 3, 2010

Video and Audio Codecs: How Morae Uses Them

The following themes form the major topics of this chapter: The terms and concepts related to trees (Section 5.2).

Parametric Comparison of H.264 with Existing Video Standards

Automatic Network Protocol Analysis

Video Encryption Exploiting Non-Standard 3D Data Arrangements. Stefan A. Kramatsch, Herbert Stögner, and Andreas Uhl

Design and Implementation of a Storage Repository Using Commonality Factoring. IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Streaming Lossless Data Compression Algorithm (SLDC)

Algorithms and Data Structures

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

From Last Time: Remove (Delete) Operation

HIGH DENSITY DATA STORAGE IN DNA USING AN EFFICIENT MESSAGE ENCODING SCHEME Rahul Vishwakarma 1 and Newsha Amiri 2

Modified Golomb-Rice Codes for Lossless Compression of Medical Images

White paper. H.264 video compression standard. New possibilities within video surveillance.

Using SPSS, Chapter 2: Descriptive Statistics

An Evaluation of Self-adjusting Binary Search Tree Techniques

Content Delivery Networks. Shaxun Chen April 21, 2009

A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heap-order property

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

Less naive Bayes spam detection

Analysis of Algorithms I: Optimal Binary Search Trees

How To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3

Data Mining Un-Compressed Images from cloud with Clustering Compression technique using Lempel-Ziv-Welch

DATA SECURITY USING PRIVATE KEY ENCRYPTION SYSTEM BASED ON ARITHMETIC CODING

Performance Analysis of medical Image Using Fractal Image Compression

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Third Southern African Regional ACM Collegiate Programming Competition. Sponsored by IBM. Problem Set

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

A Comparison of General Approaches to Multiprocessor Scheduling

plc numbers Encoded values; BCD and ASCII Error detection; parity, gray code and checksums

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Transcription:

International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 3 (2013), pp. 139-146 International Research Publications House http://www. irphouse.com /ijict.htm Analysis and Comparison of Algorithms for Lossless Data Compression Hyderabad, INDIA. Abstract Data compression is an art used to reduce the size of a particular file. The goal of data compression is to eliminate the redundancy in a file s code in order to reduce its size. It is useful in reducing the data storage space and in reducing the time needed to transmit the data. Data compression can either be lossless or lossy. Lossless data compression recreates the exact original data from the compressed data while lossy data compression cannot regenerate the perfect original data from the compressed data. Lossy methods are mainly used for compressing sound, images or video. A lot of data compression algorithms are available to compress files of different formats. This paper involves the discussion and comparison of a selected set of lossless data compression algorithms. Keywords: Data compression, Lossless Compression, Lossy Compression, Huffman Coding, Arithmetic Coding, Run Length Encoding. 1. Introduction Data compression is the art of representing information in compact form. It reduces the file size which in turn reduces the required storage space and makes the transmission of data quicker. Compression techniques try to find redundant data and remove these redundancies. Data compression can be divided into two broad classes: lossless data compression and lossy data compression. In lossless compression, the exact original data can be recovered from compressed data. It is used when the difference between original data and decompressed data cannot be tolerated. Medical images, text needed in legal purposes and computer executable files are compressed using lossless

140 compression techniques. Lossy compression, as the name suggests, involves loss of information. It is used in the applications where the lack of reconstruction is not an issue. Videos and audios are compressed using lossy compression. The extremely fast growth of data that needs to be stored and transferred has given rise to the demands of better transmission and storage techniques. Various lossless data compression algorithms have been proposed and used. Huffman Coding, Arithmetic Coding, Shannon Fano Algorithm, Run Length Encoding Algorithm are some of the techniques in use. This paper examines Huffman Coding, Arithmetic Coding, and Run Length Encoding Algorithm. 2. Run Length Encoding Run Length Encoding (RLE) is the simplest of the data compression algorithms. It replaces runs of two or more of the same character with a number which represents the length of the run, followed by the original character. Single characters are coded as runs of 1. The major task of this algorithm is to identify the runs of the source file, and to record the symbol and the length of each run. The Run Length Encoding algorithm uses those runs to compress the original source file while keeping all the non-runs without using for the compression process. Example of RLE: Input: AAABBCCCCD Output: 3A2B4C1D 3. Huffman Coding First Huffman coding algorithm was developed by David Huffman in 1951. Huffman coding is an entropy encoding algorithm used for lossless data compression. In this algorithm fixed length codes are replaced by variable length codes. When using variable-length code words, it is desirable to create a prefix code, avoiding the need for a separator to determine codeword boundaries. Huffman Coding uses such prefix code. Huffman procedure works as follow: 1. Symbols with a high frequency are expressed using shorter encodings than symbols which occur less frequently. 2. The two symbols that occur least frequently will have the same length. The Huffman algorithm uses the greedy approach i.e. at each step the algorithm chooses the best available option. A binary tree is built up from the bottom up. To see how Huffman Coding works, let s take an example. Assume that the characters in a file to be compressed have the following frequencies: A: 25 B: 10 C: 99 D: 87 E: 9 F: 66 The processing of building this tree is: 1. Create a list of leaf nodes for each symbol and arrange the nodes in the order from highest to lowest.

Analysis and Comparison of Algorithms for Lossless Data Compression 141 C:99 D:87 F:66 A:25 B:10 E:9 2. Select two leaf nodes with the lowest frequency. Create a parent node with these two nodes and assign the frequency equal to the sum of the frequencies of two child nodes. Now add the parent node in the list and remove the two child nodes from the list. And repeat this step until you have only one node left.

142 3. Now label each edge. The left child of each parent is labeled with the digit 0 and right child with 1. The code word for each source letter is the sequence of labels along the path from root to the leaf node representing the letter. Huffman Codes are shown below in the table Table 1: Huffman Codes. C 00 D 01 F 10 A 110 B 1110 E 1111 4. Arithmetic Coding Arithmetic Coding is useful for small alphabets with highly skewed probabilities. In this method, a code word is not used to represent a symbol of the text. Instead, it produces a code for an entire message. Arithmetic Coding assigns an interval to each symbol. Then a decimal number is assigned to this interval. Initially, the interval is [0, 1). A message is represented by a half open interval [x, y) where x and y are real numbers between 0 and 1. The interval is then divided into sub-intervals. The number of sub-intervals is identical to the number of symbols in the current set of symbols and size is proportional to their probability of appearance. For each symbol a new internal division takes place based on the last sub interval. Consider an example illustrating encoding in Arithmetic Coding.

Analysis and Comparison of Algorithms for Lossless Data Compression 143 Table 2: Encoding in Arithmetic Coding. Symbol Probability Range X 0.5 [0.0, 0.5) Y 0.3 [0.5, 0.8) Z 0.2 [0.8, 1.0) Table 3: Encoding symbol YXX. Symbol Range Low Value High Value 0 1 Y 1 0.5 0.8 X 0.3 0.5 0.65 X 0.15 0.5 0.575 In table 3, range, high value and low value are calculated as: Range= High value Low value High Value= Low value + Range * high range of the symbol being computed Low Value= Low value + Range * low range of the symbol being computed The string YXX is represented by an arbitrary number within the interval [0.5, 0.575). Figure 1: Graphical display of shrinking ranges.

144 5. Measuring compression performances There are various criteria to measure the performance of a compression algorithm. However, the main concern has always been the space efficiency and time efficiency. Following are some measurements used to evaluate the performances of lossless algorithm. 1. Compression Ratio: It is the ratio between the size of the compressed file and the size of the source file. 2. Compression factor: It is the inverse of the compression ratio. 3. Saving percentage: it calculates the shrinkage of the source file. 6. Comparing the algorithms: 1. Run Length Encoding: In the worst case RLE generates the output data which is 2 times more than the size of input data. This is due to the fewer amount of runs in the source file. And the files that are compressed have very high values of compression ratio. This algorithm does not provide significant improvement over the original file. 2. Huffman Coding vs. Arithmetic Coding: Huffman Coding Algorithm uses a static table for the whole coding process, so it is faster. However it does not produce efficient compression ratio. On the contrary, Arithmetic algorithm can generate a high compression ratio, but its compression speed is slow. The table 4 presents a simple comparison between these compression methods. Table 4: Huffman Coding Vs. Arithmetic Coding. COMPRESSION METHOD ARITHMETIC HUFFMAN Compression ratio Very good Poor Compression speed Slow Fast Decompression speed Slow Fast Memory space Very low Low Compressed pattern matching No Yes Permits Random access No Yes

Analysis and Comparison of Algorithms for Lossless Data Compression 145 Conclusion Arithmetic coding techniques outperforms Huffman coding and Run Length Encoding. Also the Compression ratio of the Arithmetic coding algorithm is better than the other two algorithms examined above. In this paper, it is found that the Arithmetic Coding is the most efficient algorithm among the selected ones. References [1] Introduction to Data Compression, Khalid Sayood, Ed Fox (Editor), March 2000. [2] Ken Huffman. Profile: David A. Huffman, Scientific American, September 1991, pp. 54 58. [3] Blelloch, E., 2002. Introduction to Data Compression, Computer Science Department, Carnegie Mellon University. [4] Senthil Shanmugasundaram, Robert Lourdusamy, A Comparative Study Of Text Compression Algorithm, International Journal of Wisdom Based Computing, Vol.1 (3) [5] S.R. Kodituwakku. U. S.Amarasinghe Comparison Of Lossless Data Compression Algorithms For Text Data [6] P.Yellamma Dr.Narasimham Challa. Performance Analysis Of Different Data Compression Techniques On Text File October-2012. [7] http://www.ieeeghn.org/wiki/index.php/historyof Lossless Data Compression Algorithms [8] http://www.binaryessence.com/dct/en000003.htm [9] Data compression Wikipedia.

146