- 1 - Handout #22 May 23, 2012 Huffman Encoding and Data Compression. CS106B Spring 2012. Handout by Julie Zelenski with minor edits by Keith Schwarz

CS106B Spring 01 Handout # May 3, 01 Huffman Encoding and Data Compression Handout by Julie Zelenski wit minor edits by Keit Scwarz In te early 1980s, personal computers ad ard disks tat were no larger tan 10MB; today, te puniest of disks are still measured in undreds of gigabytes. Even toug ard drives are getting bigger, te files we want to store (funny pictures of cats, videos, music and so on) seem to keep pace wit tat growt wic makes even today's gargantuan disk seem too small to old everyting. One tecnique to use our storage more optimally is to compress te files. By taking advantage of redundancy or patterns, we may be able to "abbreviate" te contents in suc a way to take up less space yet maintain te ability to reconstruct a full version of te original wen needed. Suc compression could be useful wen trying to cram more tings on a disk or to sorten te time needed to copy/send a file over a network. Tere are compression algoritms tat you may already ave eard of. Some compression formats, suc as JPEG, MPEG, or MP3, are specifically designed to andle a particular type of data file. Tey tend to take advantage of known features of tat type of data (suc as te propensity for pixels in an image to be same or similar colors to teir neigbors) to compress it. Oter tools suc as compress, zip, or pack and programs like StuffIt or ZipIt can be used to compress any sort of file. Tese algoritms ave no a priori expectations and usually rely on studying te particular data file contents to find redundancy and patterns tat allow for compression. Some of te compression algoritms (e.g. JPEG, MPEG) are lossy decompressing te compressed result doesn't recreate a perfect copy of te original. Suc an algoritm compresses by "summarizing" te data. Te summary retains te general structure wile discarding te more minute details. For sound, video, and images, tis imprecision may be acceptable because te bulk of te data is maintained and a few missed pixels or milliseconds of video delay is no big deal. For text data, toug, a lossy algoritm usually isn't appropriate. An example of a lossy algoritm for compressing text would be to remove all te vowels. Compressing te previous sentence by tis sceme results in: n xmpl f lssy lgrtm fr cmprssng txt wld b t rmv ll t vwls. Tis srinks te original 87 caracters down to just 61 and requires only 70% of te original space. To decompress, we could try matcing te consonant patterns to Englis words wit vowels inserted, but we cannot reliably reconstruct te original in tis manner. Is te compressed word "fr" an abbreviation for te word "four" or te word "fir" or "far"? An intelligent reader can usually figure it out by context, but, alas, a brainless computer can't be sure and would not be able to faitfully reproduce te original. For files containing text, we usually want a lossless sceme so tat tere is no ambiguity wen recreating te original meaning and intent. An Overview Te standard ASCII caracter encoding uses te same amount of space (one byte or eigt bits, were eac bit is eiter a 0 or a 1) to store eac caracter. Common caracters don t get any special treatment; tey require te same 8 bits tat are used for muc rarer caracters suc as 'ü' or ' '. A file of 1000 caracters encoded using te ASCII sceme will take 1000 bytes (8000 bits); no more, no less, weter it be a file of 1000 spaces to a file containing instances eac of 50 different caracters. A fixed-lengt encoding like ASCII is convenient because te boundaries between caracters are easily determined and te pattern used for eac caracter is completely fixed (i.e. 'a' is always exactly 97). - 1 -

In practice, it is not te case tat all 56 caracters in te ASCII set occur wit equal frequency. In an Englis text document, it migt be te case tat only 90 or so distinct caracters are used at all (meaning 166 caracters in te ASCII never even appear) and witin tose 90 tere are likely to be significant differences in te caracter counts. Te Huffman encoding sceme takes advantage of te disparity between frequencies and uses less storage for te frequently occurring caracters at te expense of aving to use more storage for eac of te more rare caracters. Huffman is an example of a variable-lengt encoding some caracters may only require or 3 bits and oter caracters may require 7, 10, or 1 bits. Te savings from not aving to use a full 8 bits for te most common caracters makes up for aving to use more tan 8 bits for te rare caracters and te overall effect is tat te file almost always requires less space. ASCII Encoding Te example we re going to use trougout tis andout is encoding te particular string "appy ip op" (don t ask me wat it means, I just made it up!). Using te standard ASCII encoding, tis 13- caracter string requires 13*8 = 10 bits total. Te table below sows te relevant subset of te standard ASCII table. car ASCII bit pattern (binary) 10 01101000 a 97 01100001 p 11 01110000 y 11 01111001 i 105 01101001 o 111 01101111 space 3 00100000 Te string "appy ip op" would be encoded in ASCII as 10 97 11 3 10 105 11 3 10 111. Altoug not easily readable by umans, it would be written as te following stream of bits (eac byte is boxed to sow te boundaries): 01101000 01100001 01110000 01110000 01111001 00100000 01101000 01101001 01110000 00100000 01101000 01101111 01110000 To decode suc a string (i.e. translate te binary encoding back to te original caracters), we merely need to break te encoded stream of bits up into 8-bit bytes, and ten convert eac byte using te fixed ASCII encoding. Te first 8 bits are 01101000, wic is te pattern for number 10, and position 10 in te ASCII set is assigned to lowercase ''. A file encoded in ASCII does not require any additional information to be decoded since te mapping from binary to caracters is te same for all files and computers. A more compact encoding Te first ting you migt notice about ASCII encoding is tat using 8 bits per caracter can be excessively generous. Altoug it allows for te possibility of representing 56 different caracters, we only ave seven distinct caracters in te prase we re trying to encode, and tus could distinguis among tese patterns wit fewer bits. We could set up a special coding table just for tis prase using 3 bits for eac caracter. Creating suc an encoding is trivial: we create a list of te unique caracters, and ten go troug and assign eac a distinct encoded number from 0 to N-1. For example, ere is one possible 3-bit encoding (of te 7! possible permutations): - -

car number bit pattern 0 000 a 1 001 p 010 y 3 011 i 100 o 5 101 space 6 110 Using tis table, "appy ip op" is encoded as 0 1 3 6 0 6 0 5, or in binary: 000 001 010 010 0110 000 100 010 110 000 101 010 Using tree bits per caracter, te encoded string requires 39 bits instead of te original 10 bits, compressing to 38% of its original size. However, to decode tis binary representation, one would need to know te special mapping used, since using 000 for '' is not standard practice and in fact, in tis sceme, eac compressed string uses its own special-purpose mapping tat is not necessarily like any oter. Some sort of eader or auxiliary file would ave to be attaced or included wit te encoded representation tat provides te mapping information. Tat eader would take up some additional space tat would cut into our compression savings. For a large enoug file, toug, te savings from trimming down te percaracter cost would likely outweig te expense of te additional table storage. A variable-lengt encoding Wat if we drop te requirement tat all caracters take up te same number of bits? By using fewer bits to encode caracters like 'p', '', and space tat occur frequently and more to encode caracters like 'y' and 'o' tat occur less frequently, we may be able to compress even furter. We ll later sow ow we generated te table below, but for now just take our word for it tat is represents an optimal Huffman encoding for te string "appy ip op": car bit pattern 01 a 000 p 10 y 1111 i 001 o 1110 space 110 Eac caracter as a unique bit pattern encoding, but not all caracters use te same number of bits. Te string "appy ip op" encoded using te above variable-lengt code table is: 01 000 10 10 11110 01 000 110 0110 10 Te encoded prase requires a total of 3 bits, saving a few more bits from te fixed-lengt version. Wat is tricky about a variable-lengt code is tat we no longer can easily determine te boundaries between caracters in te encoded stream of bits wen decoding. I boxed every oter caracter s bit pattern above to elp you visualize te encoding, but witout tis aid, you migt wonder ow you will know weter te first caracter is encoded wit te two bits 01 or te tree bits 010 or peraps just te first bit 0? If you look at te encoding in te table above, you will see tat only one of tese - 3 -

options is possible. Tere is no caracter tat encodes to te single bit 0 and no caracter tat encodes to te sequence 010 or 0100 or 01000 for tat matter. Tere is, owever, a caracter tat encodes to 01 and tat is ''. One of te important features of te table produced by Huffman coding is te prefix property: no caracter s encoding is a prefix of any oter (i.e. if '' is encoded wit 01 ten no oter caracter s encoding will start wit 01 and no caracter is encoded to just 0). Wit tis guarantee, tere is no ambiguity in determining were te caracter boundaries are. We start reading from te beginning, gatering bits in a sequence until we find a matc. Tat indicates te end of a caracter and we move on to decoding te next caracter. Like te special-purpose fixed-lengt encoding, a Huffman encoded file will need to provide a eader wit te information about te table used so we will be able to decode te file. Eac file s table will be unique since it is explicitly constructed to be optimal for tat file's contents. Encoding seen as a tree One way to visualize any particular encoding is to diagram it as a binary tree. Eac caracter is stored at a leaf node. Any particular caracter encoding is obtained by tracing te pat from te root to its node. Eac left-going edge represents a 0, eac rigt-going edge a 1. For example, tis tree diagrams te compact fixed-lengt encoding we developed previously: 0 1 0 1 0 1 0 1 0 1 0 1 0 a p y i o a _ In te above tree, te encoding for 'y' can be determined by tracing te pat from te root to te 'y' node. Going left ten rigt ten rigt again represents a 011 encoding. A similar, muc larger tree could be constructed for te entire ASCII set, it would be 8 levels deep and at te bottom would be 56 leaf nodes, one for eac caracter. Te node for te caracter 'a' (97 or 01100001 in binary) would be at te end of te left-rigt-rigt-left-left-left-left-rigt pat from te root. We're starting to see wy tey're called binary trees! Now, let s diagram suc a tree for te variable-lengt Huffman encoding we were using: - -

0 1 0 1 0 1 p 0 1 0 1 a i _ 0 o 1 y Te pat to '' is just left rigt or 01, te pat to 'y' is rigt-rigt-rigt-rigt or 1111. Notice tat te prefix property of te Huffman encoding is visually represented by te fact tat caracters only occupy leaf nodes (i.e. tose nodes wic are not a prefix of any furter nodes). Te tree sown above for "appy ip op" is, in fact, an optimal tree tere are no oter tree encodings by caracter tat use fewer tan 3 bits to encode tis string. Tere are oter trees tat use exactly 3 bits; for example you can simply swap any sibling nodes in te above tree and get a different but equally optimal encoding. Te Huffman tree doesn t appear as balanced as te fixed-lengt encoding tree. You ve eard in our discussion on binary searc trees tat an unbalanced tree is bad ting. However, wen a tree represents a caracter encoding, tat lopsidedness is actually a good ting. Te sorter pats represent tose frequently occurring caracters tat are being encoded wit fewer bits and te longer pats are used for more rare caracters. Our plan is to srink te total number of bits required by sortening te encoding for some caracters at te expense of lengtening oters. If all caracters occurred wit equal frequency, we would ave a balanced tree were all pats were rougly equal. In suc a situation we can't acieve muc compression since tere are no real repetitions or patterns to be exploited. Decoding using te tree A particularly compelling reason to diagram an encoding as a tree is te ease wit wic it supports decoding. Let's use te fixed-lengt tree to decode te stream of bits 011101010010011. Start at te beginning of te bits and at te root of te tree. Te first bit is 0, so trace one step to te left, te next bit is 1, so follow rigt from tere, te following bit is 1, so take anoter rigt. We ave now landed at a leaf, wic indicates tat we ave just completed reading te bit pattern for a single caracter. Looking at te leaf's label, we learn we just read a 'y'. Now we pick up were we left off in te bits and start tracing again from te root. Tracing 101 leads us to 'i'. Continuing troug te remaining bits and we decode te string "yippy". Te same pat-following strategy works equally well on te Huffman tree. Decoding te stream of bits 111100110101111 will first trace four steps down to te left to it te 'y' leaf, ten a left-leftrigt pat to te 'i' leaf and so on. Again te decoded string is "yippy". Even toug te encoded caracters don't start and end at evenly spaced boundaries in te Huffman-encoded bits, we ave no trouble determining were eac caracter ends because we can easily detect wen te pat its a leaf node in te encoding tree. - 5 -

Generating an optimal tree Te pertinent question now: ow is tat special tree constructed? We need an algoritm for constructing te optimal tree giving a minimal per-caracter encoding for a particular file. Te algoritm we will use ere was invented by D. Huffman in 195. To begin generating te Huffman tree, eac caracter gets a weigt equal to te number of times it occurs in te file. For example, in te "appy ip op" example, te caracter 'p' as weigt, '' as weigt 3, te space as weigt, and te oter caracters ave weigt 1. Our first task is to calculate tese weigts, wic we can do wit a simple pass troug te file to get te frequency counts. For eac caracter, we create an unattaced tree node containing te caracter value and its corresponding weigt. You can tink of eac node as a tree wit just one entry. Te idea is to combine all tese separate trees into an optimal tree by wiring tem togeter from te bottom upwards. Te general approac is as follows: 1. Create a collection of singleton trees, one for eac caracter, wit weigt equal to te caracter frequency.. From te collection, pick out te two trees wit te smallest weigts and remove tem. Combine tem into a new tree wose root as a weigt equal to te sum of te weigts of te two trees and wit te two trees as its left and rigt subtrees. 3. Add te new combined tree back into te collection.. Repeat steps and 3 until tere is only one tree left. 5. Te remaining node is te root of te optimal encoding tree. Sounds simple, doesn't it? Let's walk troug building te optimal tree for our example string "appy ip op". We start wit tis collection of singletons, te weigt of eac node is labeled underneat: p _ a i o y 3 We start by coosing te two smallest nodes. Tere are four nodes wit te minimal weigt of one, it doesn't matter wic two we pick. We coose 'o' and 'y' and combine tem into a new tree wose root is te sum of te weigts cosen. We replace tose two nodes wit te combined tree. Te nodes remaining in te collection are sown in te ligt gray box at eac stage. p _ a i 3 o y Now we repeat tat step, tis time tere is no coice for te minimal nodes, it must be 'a' and 'i'. We take tose out and combine tem into a weigt tree. Note ow te collection of nodes srinks by one eac iteration (we remove two nodes and add a new one back in). - 6 -

p _ 3 o y a i Again, we pull out te two smallest nodes and build a tree of weigt : p 3 _ o y a i Note wen we build a combined node, it doesn t represent a caracter like te leaf nodes do. Tese interior nodes are used along te pats tat eventually lead to valid encodings, but te prefix itself does not encode a caracter. One more iteration combines te weigt 3 and trees into a combined tree of weigt 5: p 5 _ 3 a i o y Combining te two s gets a tree of weigt 8: 8 5 p 3 _ a i o y - 7 -

And finally, we combine te last two to get our final tree. Te root node of te final tree will always ave a weigt equal to te number of caracters in te input file. 13 8 5 p 3 _ a i o y Note tat tis tree is different from te tree on page, and as sligtly different bit patterns, but bot trees are optimal and te total number of bits required to encode "appy ip op" is te same for eiter tree. Wen we ave coices among equally weigted nodes (suc as in te first step coosing among te four caracters wit weigt 1) picking a different two will result in a different, but still optimal, encoding. Similarly wen combining two subtrees, it is as equally valid to put one of te trees on te left and te oter on te rigt as it is to reverse tem. Remember tat it is essential tat you use te same tree to do bot encoding and decoding of your files. Since eac Huffman tree creates a unique encoding of a particular file, you need to ensure tat your decoding algoritm generates te exact same tree, so tat you can get back te file. Practical Considerations: Te Pseudo-EOF Te preceding discussion of Huffman coding is correct from a teoretical perspective, but tere are a few real-world details we need to address before moving on. One important concern is wat appens wen we try to store a Huffman-encoded sequence on-disk in a file. Eac file on your computer is typically stored as a number of bytes (groups of eigt bits); files are usually measured in megabytes and gigabytes rater tan megabits or gigabits. As a result, if you try to write a Huffman-encoded string of bits into a file, if you don't ave exactly a multiple of eigt bits in your encoding, te operating system will typically pad te rest of te bits wit random bits. For example, suppose tat we want to encode te string aoy using te above Huffman tree. Tis results in te following sequence of bits: 1101001100111 Tis is exactly tirteen bits, wic means tat, wen stored on-disk, te sequence would be padded wit tree extra random bits. Suppose tat tose bits are 111. In tat case, te bit sequence would be written to disk as 1101001100111111 If we were to ten load tis back from disk and decode it into a sequence of caracters, we would get - 8 -

te string aoyi, wic is not te same string tat we started wit! Even worse, if tose random bits end up being 000, ten te stored bit sequence would be 11101001100111000 Te problem is tat as we decode tis, we read te first tirteen bits back as aoy, but encounter an error wen reading te last tree bits because 000 is not a caracter in our encoding sceme. To fix tis problem, we ave to ave some way of knowing wen we've finised reading back all of te bits tat encode our sequence. One way of doing tis is to transform our original input string by putting some special marker at te end. Tis marker won't appear anywere else in te string and serves purely as an indicator tat tere is noting left to read. For example, we migt actually represent te string appy ip op as appy ip op, were marks te end of te input. Wen we build up our Huffman encoding tree for tis string, we will proceed exactly as before, but would add in an extra node for te marker. Here is one possible encoding tree for te caracters in tis new string: 1 6 8 p 3 3 _ 1 a i o y Now, if we want to encode appy ip op, we get te following bitstring: 001100101011110110011011001100111010010 Tis does not come out to a multiple of eigt bits (specifically, it's 39 bits long), wic means tat it will be padded wit extra bits wen stored on-disk. However, tis is of no concern to us because we ave written te marker to te end of te string, as we're decoding we can tell wen to stop reading bits. For example, ere is ow we migt decode te above string: 00 H 1100 A 10 P 10 P 1111 Y 011 00 H 1101 I 10 P 011 00 H 1110 O 10 P 010 0 Extra bits ignored; we knew to stop wen seeing - 9 -

Tis caracter is called a pseudo-end-of-file caracter or pseudo-eof caracter, since it marks were te logical end of te bitstream is, even if te file containing tat bitstream contains some extra garbage bits at te end. Wen you actually implement Huffman encoding in Assignment 6, you will ave to make sure to insert a pseudo-eof caracter into your encoding tree and will ave to take appropriate steps to ensure tat you stop decoding bits wen you reac it. Practical Considerations: Te Encoding Table Tere is one last issue we ave not discussed yet. Suppose tat I want to compress a message and send it to you. Using Huffman coding, I can convert te message (plus te pseudo-eof) into a string of bits and send it to you. However, you cannot decompress te message, because you don't ave te encoding tree tat I used to send te message. Tere are many ways to resolve tis. We could agree on an encoding tree in advance, but tis only works if we already know te distribution of te letters in advance. Tis migt be true if we were always compressing normal Englis text, but in general is not possible to do. A second option, and te option used in Assignment 6, is to prefix te bit sequence wit a eader containing enoug information to reconstruct te Huffman encoding tree. Tere are many options you ave for reading and writing te encoding table. You could store te table at te ead of te file in a long, uman-readable string format using te ASCII caracters '0' and '1', one caracter entry per eac line, like tis: = 01 a = 000 p = 10 y = 1111... Reading tis back in would allow you to recreate te tree pat by pat. You could ave a line for every caracter in te ASCII set; caracters tat are unused would ave an empty bit pattern. Or you could conserve space by only listing tose caracters tat appear in te encoding. In suc a case you must record a number tat tells ow many entries are in te table or put some sort of sentinel or marker at te end so you know wen you ave processed tem all. As an alternative to storing sequences of ASCII '0' and '1' caracters for te bit patterns, you could store just te caracter frequency counts and rebuild te tree again from tose counts in order to decompress. Again we migt include te counts for all caracters (including tose tat are zero) or optimize to only record tose tat are non-zero. Here is ow we migt encode te non-zero caracter counts for te "appy ip op" string (te 7 at te front says tere are 7 entries to follow 6 alpabetic caracters, and te space caracter): 6 3 a1 p y1 i1 o1-10 -

Greedy algoritms Huffman's algoritm is an example of a greedy algoritm. In general, greedy algoritms use smallgrained, or local minimal/maximal coices in attempt to result in a global minimum/maximum. At eac step, te algoritm makes te near coice tat appears to lead toward te goal in te long-term. In Huffman's case, te insigt for its strategy is tat combining te two smallest nodes makes bot of tose caracter encodings one bit longer (because of te added parent node above tem) but given tese are te most rare caracters, it is a better coice to assign tem longer bit patterns tan te more frequent caracters. Te Huffman strategy does, in fact, lead to an overall optimal caracter encoding. Even wen a greedy strategy may not result in te overall best result, it still can be used to approximate wen te true optimal solution requires an exaustive or expensive traversal. In a time or spaceconstrained situation, we migt be willing to accept te quick and easy-to-determine greedy solution as an approximation. - 11 -