International Journal of Advanced Research in Computer Science and Software Engineering

Volume 3, Issue 7, July 23 ISSN: 2277 28X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Greedy Algorithm: Huffman Algorithm Annu Malik, Neeraj Goyat, Prof. Vinod Saroha(Guide) Computer Science and Engineering (Network Security) B.P.S M.V.,Khanpur kalan Haryana, India Abstract This paper presents a survey on Greedy Algorithm. This discussion is centered on overview of huffman code, huffman algorithm and applications of greedy algorithm. A greedy algorithm is an algorithm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum. In many problems, a greedy strategy does not in general produce an optimal solution, but nonetheless a greedy heuristic may yield locally optimal solutions that approximate a global optimal solution in a reasonable time.greedy algorithms determine the minimum number of coins to give while making change. These are the steps a human would take to emulate a greedy algorithm to represent 36 cents using only coins with values {, 5,, 2}. The coin of the highest value, less than the remaining change owed, is the local optimum. (Note that in general the change-making problem requires dynamic programming or integer programming to find an optimal solution; However, most currency systems, including the Euro and US Dollar, are special cases where the greedy strategy does find an optimum solution.) Keywords Greedy, huffman, activity, optimal, algorithm etc I. INTRODUCTION Greedy Algorithm solves problem by making the choice that seems best at the particular moment. Many Optimization problems can be solved using a greedy algorithm. Some problems have no efficient solution, but a greedy algorithm may provide an efficient solution that is close to optimal. A greedy algorithm works if a problem exhibit the following two properties: ) Greedy Choice Property: A globally optimal solution can be arrived at by making a locally optimal solution. In other words, an optimal solution can be obtained by making greedy choices. 2) Optimal Substructure: Optimal solutions contains optimal sub solutions. In other words, solutions to sub problems of an optimal solution are optimal. II. HUFFMAN CODES Data can be encoded efficiently using Huffman codes. It is widely used and very effective technique for compressing data; savings of 2% to 9% are typical, depending on the characteristics of the file being compressed. Huffman's greedy algorithm uses a table of the frequencies of occurrence of each character to build up an optimal way of representing each character as a binary string. Suppose we have 5 characters in a data file. Normal storage: 8 bits per character (ASCII)-8* 5 bits in a file. But, we want to compress the file and store it compactly. Suppose only 6 characters appear in the file: a b c d e f Total Frequency 45 3 2 6 9 5 How can we represent the data in a compact way? [] Fixed Length Code: Each letter represented by an equal number of bits. With a fixed length code, at least 3 bits per character: For example: a b c d e f For a file with 5 characters, we need 3* 5 bits. 23, IJARCSSE All Rights Reserved Page 296

July - 23, pp. 296-33 [2] A Variable-length code: It can do considerably better than fixed -length code, by giving frequent characters short code words and infrequent characters long code words. For example, a b c d e f Number of bits = (45*+3*3+2*3+6*3+9*4+5*4)* =2.24* 5 bits. Thus, 224 bits to represent the file, a saving of approximately 25%. In fact, this is an optimal character code for this file. Let us denote the characters by C, C 2, C 3,.., C n and devote their frequencies by f, f 2,, f n. Suppose there is an encoding E in which a bit string S i of length s i represents C i, the length of the file compressed by using encoding E is L(E, F) = s i * f i for i= to n III. PREFIX CODES The prefixes of an encoding of one character must not be equal to a complete encoding of another character e.g. and are not valid codes because is a prefix of. This constraint is called the prefix constraint. Codes in which no codewords is also a prefix f some other code word are called prefix codes. Shortening the encoding of one character may lengthen the encoding of others. To find an encoding E that satisfies the prefix constraint and minimizes L(E, F). Prefix codes are desirable because they simplify encoding(compression) and decoding. Encoding is always simple for any binary character code, we just concatenate the code words representing each character of the file. Decoding is also quite simple with a prefix code. Since no codeword is a prefix of any other, the codeword that begins an encoded file is unambiguous. We can simply identify the initial codeword, and repeat the decoding process on the remainder of the encoded file. The decoding process needs a convenient representation for the prefix code so that the initial codeword can be easily picked off. A binary tree whose leaves are the given characters provides one such representation. We interpret the binary codeword for a character as the path from the root to that character, where means go to the left child and means go to the right child. Note that these are not binary search trees, since the leaves need not appear in sorted order and internal nodes do not contain character keys. An optimal code for a file is always represented by a full binary tree, in which every non-leaf node has two children. The fixed-length code in our example is not optimal since its tree, because it is not a full binary tree: there are code words beginning..., but none beginning... Since we can now restrict our attention to full binary trees, we can say that if C is the alphabet from which the characters are drawn, then the tree for an optimal prefix code has exactly C leaves, one for each letter of the alphabet, and exactly C - internal nodes. Fig. Trees corresponding to the coding schemes. Each leaf is labeled with a character and its frequency of occurrence. Each internal node is labeled with the sum of the weights of the leaves in its subtree. 86 4 58 28 4 b:3 c:2 d:6 e:9 f:5 (a). Not Optimal (a). The tree corresponding to the fixed-length code a=,...,f= 23, IJARCSSE All Rights Reserved Page 297

July - 23, pp. 296-33 55 25 3 c:2 b:3 4 d:6 f:5 e:9 (b) Optimal (b). the tree corresponding to the optimal prefix code a=,b=,..., f=. GREEDY ALGORITHM FOR CONSTRUCTING A HUFFMAN CODE Huffman invented a greedy algorithm that constructs an optimal prefix code called a Huffman Code. A+B A B A B The algorithm builds the tree T corresponding to the optimal code in a bottom-up manner. It begins with a set of C leaves and performs a sequence of C - merging operations to create the final tree. In the pseudo code HUFFMAN(C), we assume that C is a set of n characters and that each character c C is an object with a defined frequency f[c]. A priority queue Q, keyed 23, IJARCSSE All Rights Reserved Page 298

July - 23, pp. 296-33 on f, is used to identify the two least-frequent objects to merge together. The result of the merger of two objects is a new object whose frequency is the sum of the frequencies of the two objects that were merged. HUFFMAN(C) 3) n C 4) Q C 5) for I to n- 6) do z ALLOCATE-NODE() 7) x left[z] EXTRACT-MIN(Q) 8) y right[z] EXTRACT-MIN(Q) 9) f[z] f[x] + f[y] ) INSERT(Q, z) ) return EXTRACT-MIN(Q) The analysis of the remaining time of Huffman's algorithm assumes that Q is implemented as a binary heap. For a set of n characters, the initialization of Q in line Q in the 2 can be performed in O(n) time using the BUILD-HEAP operation requires time O(nlgn), the loop contributes O(nlgn) to the running time. Thus, the total running time of HUFFMAN on a set of n characters is O(nlgn). The algorithm is based on a reduction of a problem with n characters to a problem with n- characters. A new character replaces two existing ones. f:5 e:9 c:2 b:3 d:6 c:2 b:3 4 d:6 f:5 e:9 4 25 d:6 f:5 e:9 c:2 b:3 25 3 c:2 b:3 4 d:6 f:5 e:9 23, IJARCSSE All Rights Reserved Page 299

July - 23, pp. 296-33 55 25 3 c:2 b:3 4 d:6 f:5 e:9 a b c d e f Example: Find an optimal huffman code for the following set of frequencies: a : 5 b : 25 c : 5 d : 4 e : 75 Solution: Given that :C = {a, b, c, d, e} f ( C ) = {5, 25, 5, 4, 75} n = 5 Q c i.e., c 5 b 25 d 4 a 5 e 75 For i to 4 i = Z Allocate node x Extract-Min(Q) y Extract-Min(Q) c 5 b 25 d 4 a 5 e 75 x y 23, IJARCSSE All Rights Reserved Page 3

July - 23, pp. 296-33 Left[z] x Right[z] y F[z] f(x) + f[y] =5+25 F[z] = 4 i.e., z 4 d 4 a 5 e 75 c 5 b 25 x Left[z] y Right[z] Again, for i = 2 x 4 y d 4 a 5 e 75 c 5 b 25 z Allocate node x 4 y 4 left[z] x Right[z] y F[z] = 4 + 4 = 8 8 a 5 e 75 Left[z] x 4 d 4 y Right[z] c 5 b 25 23, IJARCSSE All Rights Reserved Page 3

July - 23, pp. 296-33 Similarly we apply the same process, we get 8 25 4 d 4 c 5 b 25 a 5 e 75 25 8 25 4 d 4 a 5 e 75 c 5 b 25 Huffman Tree Using Huffman Codes Each message has a different tree. The tree must be saved with the message. Huffman codes are effective for long files where the savings in the message can offset the cost for storing the tree. Decode files by starting at root and proceeding down the tree according to the bits in the message ( = left, = right). When a leaf is encountered, output the character at that leaf and restart at the root. Huffman codes are also effective when the tree can be pre-computed and used for a large number of messages (e.g., a tree based on the frequency of occurrence of characters in the English Language). Huffman codes are not very good for random files(each character about the same frequency). 23, IJARCSSE All Rights Reserved Page 32

July - 23, pp. 296-33 IV. APPLICATIONS Greedy algorithms mostly (but not always) fail to find the globally optimal solution, because they usually do not operate exhaustively on all the data. They can make commitments to certain choices too early which prevent them from finding the best overall solution later. For example, all known greedy coloring algorithms for the graph coloring problem and all other NP-complete problems do not consistently find optimum solutions. Nevertheless, they are useful because they are quick to think up and often give good approximations to the optimum. If a greedy algorithm can be proven to yield the global optimum for a given problem class, it typically becomes the method of choice because it is faster than other optimization methods like dynamic programming. Examples of such greedy algorithms are Kruskal's algorithm and Prim's algorithm for finding minimum spanning trees, Dijkstra's algorithm for finding single-source shortest paths, and the algorithm for finding optimum Huffman trees. V. CONCLUSION Greedy algorithms are usually easy to think of, easy to implement and run fast. Proving their correctness may require rigorous mathematical proofs and is sometimes insidious hard. In addition, greedy algorithms are infamous for being tricky. Missing even a very small detail can be fatal. But when you have nothing else at your disposal, they may be the only salvation. With backtracking or dynamic programming you are on a relatively safe ground. With greedy instead, it is more like walking on a mined field. Everything looks fine on the surface, but the hidden part may backfire on you when you least expect. While there are some standardized problems, most of the problems solvable by this method call for heuristics. There is no general template on how to apply the greedy method to a given problem, however the problem specification might give you a good insight. In some cases there are a lot of greedy assumptions one can make, but only few of them are correct. They can provide excellent challenge opportunities... REFERENCES Wikipedia Google Algorithms Design and Analysis by Udit Agarwal 23, IJARCSSE All Rights Reserved Page 33