Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module LOSSLESS IMAGE COMPRESSION SYSTEMS

Lesson 3 Lossless Compresson: Huffman Codng

Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy.. State Shannon s Codng theorem for noseless channels. 3. Measure the codng effcency of an encodng scheme. 4. State the basc prncples of Huffman codng. 5. Assgn Huffman codes to a set of symbols of known probabltes. 6. Encode a strng of symbols nto a Huffman coded bt stream. 7. Decode a Huffman coded bt stream. 3.0 Introducton In lesson-, we have learnt the basc dfference between lossless and lossy mage compresson schemes. There are some applcatons, such as satellte mage analyss, medcal and busness document archval, medcal mages for dagnoss etc, where loss may not be tolerable and lossless compresson technques are to be used. Some of the popular lossless mage compresson technques beng used are (a) Huffman codng, (b) Arthmetc codng, (c) Zv- Lempel codng, (d) Bt-plane codng, (e) Run-length codng etc. In ths lesson, we frst begn wth the nformaton-theoretc bass of lossless codng, see how to measure the nformaton content of a set of source symbols based on the probabltes of the symbols. We shall thereafter present Shannon s thorem whch provdes the theoretcal lower bound of bt-rate achevable through lossless compresson technques. The rest of the lesson s devoted to the detaled treatment of Huffman codng, whch s one of the most popular lossless compresson technques adopted n multmeda standards. 3. Source entropy- a measure of nformaton content Generaton of nformaton s generally modeled as a random process that has probablty assocated wth t. If P(E) s the probablty of an event, ts nformaton content I(E), also known as self nformaton s measured as I ( E) = log..(3.) P ( E) If P(E)=, that s, the event always occurs (lke sayng The sun rses n the east ), then we obtan from above that I(E)=0, whch means that there s no nformaton assocated wth t. The base of the logarthm expresses the unt of nformaton and f the base s, the unt s bts. For other values m of the base,

the nformaton s expressed as m-ary unts. Unless otherwse mentoned, we shall be usng the base- system to measure nformaton content. a =,,..., n havng Now, suppose that we have an alphabet of n symbols { } probabltes of occurrences P ( a ) P( a ),..., P( ), a n. If k s the number of source outputs generated, whch s consdered to be suffcently large, then the average number of occurrences of symbol a s kp ( a ) and the average self-nformaton obtaned from k outputs s gven by k n = P ( a ) log P( a ) and the average nformaton per source output for the source z s gven by H n ( z ) = P( a ) log P( a ) = (3.) The above quantty s defned as the entropy of the source and measures the uncertanty of the source. The relatonshp between uncertanty and entropy can be llustrated by a smple example of two symbols a and a, havng probabltes and P respectvely. Snce, the summaton of probabltes s equal to, P ( a ) ( a ) P ( a ) P( a = ) and usng equaton (3.), we obtan H ( z ) = P( a ) log P( a ) ( P( a )) log( P( a )) (3.3) If we plot H(z) versus P( a ), we obtan the graph shown n Fg.3..

It s nterestng to note that the entropy s equal to zero for P(a )=0 and P(a )=. These correspond to the cases where at least one of the two symbols s certan to be present. H(z) assumes maxmum value of -bt for P(a )=/. Ths corresponds to the most uncertan case, where both the symbols are equally probable. 3.. Example: Measurement of source entropy If the probabltes of the source symbols are known, the source entropy can be measured usng equaton (3.). Say, we have fve symbols a, a,..., a5 havng the followng probabltes: P ( a ) = 0., P( a ) = 0., P( a3 ) = 0.05, P( a4 ) = 0.6, P( a5 ) = 0. 05 Usng equaton (3.), the source entropy s gven by H ( z) = 0.log 0. 0.log 0. 0.05log 0.05 0.6log0.6 0.05log 0. 05 bts =.67 bts 3. Shannon s Codng Theorem for noseless channels We are now gong to present a very mportant theorem by Shannon, whch expresses the lower lmt of the average code word length of a source n terms of ts entropy. Stated formally, the theorem states that n any codng scheme, the average code word length of a source of symbols can at best be equal to the source entropy and can never be less than t. The above theorem assumes the codng to be lossless and the channel to be noseless.

If m(z) s the mnmum of the average code word length obtaned out of dfferent unquely decpherable codng schemes, then as per Shannon s theorem, we can state that ( z) H ( z) m.. (3.4) 3.3 Codng effcency The codng effcency (η) of an encodng scheme s expressed as the rato of the source entropy H(z) to the average code word length L(z) and s gven by η = H L ( z) ( z)..(3.5) ( ) ( ) Snce L z H z H(z) are postve, accordng to Shannon s Codng theorem and both L(z) and 0 η...(3.6) 3.4 Basc prncples of Huffman Codng Huffman codng s a popular lossless Varable Length Codng (VLC) (Secton-.4.3) scheme, based on the followng prncples: (a) Shorter code words are assgned to more probable symbols and longer code words are assgned to less probable symbols. (b) No code word of a symbol s a prefx of another code word. Ths makes Huffman codng unquely decodable. (c) Every source symbol must have a unque code word assgned to t. In mage compresson systems (Secton-.4), Huffman codng s performed on the quantzed symbols. Qute often, Huffman codng s used n conjuncton wth other lossless codng schemes, such as run-length codng, to be dscussed n lesson-4. In terms of Shannon s noseless codng theorem, Huffman codng s optmal for a fxed alphabet sze, subject to the constrant that that the source symbols are coded one at a tme.

3.5 Assgnng Bnary Huffman codes to a set of symbols We shall now dscuss how Huffman codes are assgned to a set of source symbols of known probablty. If the probabltes are not known a pror, t should be estmated from a suffcently large set of samples. The code assgnment s based on a seres of source reductons and we shall llustrate ths wth reference to the example shown n Secton-3... The steps are as follows: Step-: Arrange the symbols n the decreasng order of ther probabltes. Symbol Probablty a 4 0.6 a 0. a 0. a 3 0.05 a 5 0.05 Step-: Combne the lowest probablty symbols nto a sngle compound symbol that replaces them n the next source reducton. Symbol Probablty a 4 P(a 4 )=0.6 a P(a )=0. a P(a )= 0. a 3 V a 5 P(a 3 )+P(a 5 )=0. In ths example, a 3 and a 5 are combned nto a compound symbol of probablty 0.. Step-3: Contnue the source reductons of Step-, untl we are left wth only two symbols. Symbol Probablty a 4 P(a 4 )=0.6 a P(a )=0. a V ( a 3 V a 5 ) P(a )+P(a 3 )+P(a 5 )=0. Symbol Probablty a 4 P(a 4 )=0.6 a V ( a V ( a 3 V a 5 )) P(a )+P(a )+P(a 3 )+P(a 5 )=0.4 The second symbol n ths table ndcates a compound symbol of probablty 0.4. We are now n a poston to assgn codes to the symbols.

Step-4: Assgn codes 0 and to the last two symbols. Symbol Probablty Assgned Code a 4 0.6 0 a 0. 0 a 0. 0 a 3 V a 5 0. In ths case, 0 s assgned to the symbol a 4 and s assgned to the compound symbol a V ( a V ( a 3 Va 5 )). All the elements wthn ths compound symbol wll therefore have a prefx. Step-5: Work backwards along the table to assgn the codes to the elements of the compound symbols. Contnue tll codes are assgned to all the elementary symbols. Symbol Probablty Assgned Code a 4 0.6 0 a V ( a V( a 3 Va 5 )) 0.4 s therefore gong to be the prefx of a, a 3 and a 5, snce ths s the code assgned to the compound symbol of these three. Ths completes the Huffman code assgnment pertanng to ths example. From ths table, t s evdent that shortest code word (length=) s assgned to the most probable symbol a 4 and the longest code words (length=4) are assgned to the two least probable symbols a 3 and a 5. Also, each symbol has a unque code and no code word s a prefx of code word for another symbol. The codng has therefore fulflled the basc requrements of Huffman codng, stated n Secton- 3.4. Symbol Probablty Assgned Code a 4 0.6 0 a 0. 0 a V( a 3 Va 5 ) 0. For ths example, we can compute the average code word length. If L(a ) s the codeword length of symbol a, then the average codeword length s gven by Symbol Probablty Assgned Code a 4 0.6 0 a 0. 0 a 0. 0 a 3 0.05 0 a 5 0.05

n ( ) L( a ) P( a ) L z = = = 0.6 + 0. + 0. 3 + 0.05 4 =.7 bts The codng effcency s gven by ( z) ( ) = = H η L z 0.98 3.6 Encodng a strng of symbols usng Huffman codes After obtanng the Huffman codes for each symbol, t s easy to construct the encoded bt stream for a strng of symbols. For example, f we have to encode a strng of symbols a 4a3a5a4aa4a, we shall start from the left, takng one symbol at a tme. The code correspondng to the frst symbol a 4 s 0, the second symbol a 3 has a code 0 and so on. Proceedng as above, we obtan the encoded bt stream as 000000. In ths example, 6 bts were used to encode the strng of 7 symbols. A straght bnary encodng of 7 symbols, chosen from an alphabet of 5 symbols would have requred bts (3 bts/symbol) and ths encodng scheme therefore demonstrates substantal compresson. 3.7 Decodng a Huffman coded bt stream Snce no codeword s a prefx of another codeword, Huffman codes are unquely decodable. The decodng process s straghtforward and can be summarzed below: Step-: Examne the leftmost bt n the bt stream. If ths corresponds to the codeword of an elementary symbol, add that symbol to the lst of decoded symbols, remove the examned bt from the bt stream and go back to step- untl all the bts n the bt stream are consdered. Else, follow step-. Step-: Append the next bt from the left to the already examned bt(s) and examne f the group of bts correspond to the codeword of an elementary symbol. If yes, add that symbol to the lst of decoded symbols, remove the examned bts from the bt stream and go back to step- untl all the bts n the bt stream are consdered. Else, repeat step- by appendng more bts.

In the encoded bt stream example of Secton-3.6, f we receve the bt stream 000000 and follow the steps descrbed above, we shall frst decode a 4 ( 0 ), then a 3 ( 0 ), followed by a 5 ( ), a 4 ( 0 ),a ( 0 ), a 4 ( 0 ) and a ( 0 ). Ths s exactly what we had encoded. 3.8 Dscussons and Further Readng In ths lesson, we have dscussed Huffman Codng, whch s one of the very popular lossless Varable Length Codng (VLC) schemes, named after the scentst who proposed t. The detals of the codng scheme can be read from hs orgnal paper, lsted n []. For a better understandng of the codng theory n the lght of Shannon s theorem, the reader s referred to Shannon s orgnal paper, lsted n []. We have dscussed how Huffman codes can be constructed from the probabltes of the symbols. The symbol probabltes can be obtaned by makng a relatve frequency of occurrences of the symbols and these essentally make a frst order estmate of the source entropy. Better source entropy estmates can be obtaned f we examne the relatve frequency of occurrences of a group of symbols, say by consderng two consecutve symbols at a tme. Wth reference to mages, we can form pars of gray level values, consderng two consecutve pxels at a tme and thus form a second order estmate of the source entropy. In ths case, Huffman codes wll be assgned to the par of symbols, nstead of ndvdual symbols. Although thrd, fourth and even hgher order estmates would make better approxmatons of the source entropy, the convergence wll be slow and excessve computatons are nvolved. We have also seen that Huffman codng assgnment s based on successve source reductons. For an n-symbol source, (n-) source reductons must be performed. When n s large, lke n the case of gray level values of the mages for whch n=56, we requre 54 steps of reducton, whch s excessvely hgh. In such cases, Huffman codng s done only for few symbols of hgher probablty and for the remanng, a sutable prefx code, followed by a fxed length code s adopted. Ths scheme s referred to as truncated Huffman codng. It s somewhat less optmal as compared to Huffman Codng, but code assgnment s much easer. There are other varants of Huffman Codng. In one of the varants, the source symbols, arranged n order of decreasng probabltes are dvded nto a few blocks. Specal shft up and/or shft down symbols are used to dentfy each block and symbols wthn the block are assgned Huffman codes. Ths encodng scheme s referred to as Shft Huffman Codng. The shft symbol s the most probable symbol and s assgned the shortest code word. Interested readers may refer to [3] for further dscussons on Huffman Codng varants.

Questons NOTE: The students are advsed to thoroughly read ths lesson frst and then answer the followng questons. Only after attemptng all the questons, they should clck to the soluton button and verfy ther answers. PART-A A.. Defne the entropy of a source of symbols. A.. How s entropy related to uncertanty? A.3. State Shannon s codng theorem on noseless channels. A.4. Defne the codng effcency of an encodng scheme. A.5. State the basc prncples of Huffman codng. PART-B: Multple Choce In the followng questons, clck the best out of the four choces. B. The entropy of a source of symbols s dependent upon (A) The number of source outputs generated. (B) The average codeword length. (C) The probabltes of the source symbols. (D) The order n whch the source outputs are generated. B. We have two sources of symbols to compare ther entropes. Source- has three symbols a,a and a 3 wth probabltes P ( a ) = 0.9, P( a ) = P( a3 ) = 0. 05. Source- also has three symbols a,a and a 3, but wth probabltes P a =.4, P a = P a 0. ( ) ( ) ( ) 3. 0 3 = (A) Entropy of source- s hgher than that of source-. (B) Entropy of source- s lower than that of source-. (C) Entropy of source- and source- are the same. (D) It s not possble to compute the entropes from the gven data.

B.3 Shannon s codng theorem on noseless channels provdes us wth (A) A lower bound on the average codeword length. (B) An upper bound on the average codeword length (C) A lower bound on the source entropy. (D) An upper bound on the source entropy. B.4 Whch one of the followng s not true for Huffman codng? (A) No codeword of an elementary symbol s a prefx of another elementary symbol. (B) Each symbol has a one-to-one mappng wth ts correspondng codeword. (C) The symbols are encoded as a group, rather than encodng one symbol at a tme. (D) Shorter code words are assgned to more probable symbols. B.5 A source of 4 symbols a, a, a a havng probabltes ( a ) =.5, P( a ) = 0.5, P( a ) = P( a ) 0. 5 0 3 4 = 3, P are encoded by four dfferent encodng schemes and the correspondng codes are shown below. Whch of the followng gves us the best codng effcency? (A) a = 00, a = 0, a3 = 0, a4 = (B) a =, a = 0, a = 0, a 0 3 4 = (C) a = 00, a = 00, a3 = 00, a4 = 0 (D) a =, a = 0, a3 = 0, a4 = 0 4 B.6 Whch of the followng must be ensured before assgnng bnary Huffman codes to a set of symbols? (A) The channel s noseless. (B) There must be exactly n symbols to encode. (C) No two symbols should have the same probablty. (D) The probabltes of the symbols should be known a pror.

B.7 Refer to the Huffman code words assgned to fve symbols a, a,..., a5 n the example shown n Secton-3.5.The bt stream assgned to the sequence of symbols a a a a a a a a a a s 4 5 4 4 4 4 a4 (A) 0000000000 (B) 000000000 (C) 000000000 (D) 000000000 B.8 A 4-symbol alphabet has the followng probabltes P a =., P a = 0.5, P a = 0.5, P a 0. and followng codes are assgned ( ) ( ) ( ) ( ) 5 0 3 4 = a = 0, a = 0, a3 = 0, a4 = to the symbols. The average code word length for ths source s (A).5 (B).5 (C).75 (D).0 B.9 Decode a Huffman encoded bt stream 00000000 whch follows the codes assgnment of the above problem. The sequence of symbols s (A) (C) a3aa3aa3aaa a (B) a3aaa4a3aaa3a a3aaa3aaa a (D) a3aaa4a3aaa a PART-C: Problems C-. A long sequence of symbols generated from a source s seen to have the followng occurrences Symbol Occurrences a 3003 a 996 a 3 07 a 4 487 a 5 497 (a) Assgn Huffman codes to the above symbols, followng a conventon that the group/symbol wth hgher probablty s assgned a 0 and that wth lower probablty s assgned a. (b) Calculate the entropy of the source.

(c) Calculate the average code word length obtaned from Huffman codng. (d) Calculate the codng effcency. (e) Why s the codng effcency less than? SOLUTIONS A. The entropy of a source of symbols s defned as the average nformaton per a =,,..., n havng source output. If we have an alphabet z of n symbols { } probabltes of occurrences P ( a ) P( a ),..., P( ) s gven by H n ( z ) = P( a ) log P( a ) =, a n, the entropy of the source H(z) The unt of entropy s dependent upon the base of the logarthm. For a base of, the unt s bts. In general, for a base m, the unt of entropy s m-ary unts. A. Entropy of a source s related to the uncertanty of the symbols assocated wth t. Greater the uncertanty, hgher s the entropy. We can llustrate ths by the two-symbol example dscussed n Secton-3. A.3 Shannon s codng theorem on lossless channels states that n any codng scheme, the average code word length of a source of symbols can at best be equal to the source entropy and can never exceed t. If m(z) s the mnmum of the average code word length obtaned out of dfferent unquely decpherable codng schemes, then Shannon s theorem states that ( z) H ( z) m A.4 Refer to Secton-3.3 A.5 Refer to Secton-3.4 B. (C) B. (B) B.3 (A) B.4 (C) B.5 (B) B.6 (D) B.7 (A) B.8 (C) B.9 (D).C. (a) Snce the symbols are observed for a suffcently long sequence, the probabltes can be estmated from ther relatve frequences of occurrence. P a.3, P a 0., P a 0., P a 0.5, P a 0. ( ) ( ) ( ) ( ) ( ) 5 0 3 4 5 Based on these probabltes, the source reductons can be done as follows:

Symbo Prob Reducton- Reducton- Reducton -3 l. Symbol Pro b Symbol Pro b Symbol Pro b a 0.3 a 0.3 a 4 +a + 0.45 a + a 5 0.55 a 3 a 5 0.5 a 5 0.5 a 0.3 a 4 +a + 0.45 a 3 a 3 0. a 4 +a 0.5 a 5 0.5 a 4 0.5 a 3 0. a 0. We can now work backwards to assgn Huffman codes to the compound symbols and proceed to the elementary symbols Reducton-3 Reducton- Reducton- Orgnal Symbol Code Symbol Code Symbol Code Symbol Code a + a 5 0 a 00 a 00 a 00 a 4 +a + a 5 0 a 5 0 a 5 0 a 3 a 4 +a + a 3 a 4 +a 0 a 3 a 3 a 4 00 a 0 (b) The source entropy s gven by ( ) P( ) H z n = = a log a = 0.3log 0.3 0.log 0. 0.log 0. 0.5log 0.5 0.5log 0.5 =.7 bts/ symbol bts/ symbol (c) The average code word length s gven by n ( ) P( a ) L( a ) L z = = = 0.3 + 0. 3 + 0. + 0.5 3 + 0.5 =.5 bts/ symbol bts/ symbol (d) The codng effcency ( z) ( z) H.7 η = = = 0.9897 L.5

(e) The student should thnk of ths reason and check later. References. Huffman, D.A., A Method for the Constructon of Mnmum Redundancy Codes, Proc. IRE, vol.40, no.0, pp.098-0, 95.. Shannon, C.E., A Mathematcal Theory of Communcatons, The Bell Sys. Tech. J., vol. XXVII, no. 3, pp.379-43, 948. 3.