CSE 100: HUFFMAN CODES
READING QUIZ NO TALKING NO NOTES Q1: What do the symbol frequencies used in designing optimal codes represent? A. The frequency of occurrence of symbols in the source file to be encoded B. The probability of searching for a symbol in the source file containing the symbols C. The inverse of the code length for each symbol D. The lower bound on the average code length
READING QUIZ NO TALKING NO NOTES Q2: (True or False) Given a set of symbols, the best possible average code length is minimum when the frequency of occurrence of all symbols is uniformly distributed. A. True B. False
READING QUIZ NO TALKING NO NOTES Q3: (True or False) Prefix codes can always be uniquely decoded. A. True B. False
READING QUIZ NO TALKING NO NOTES Q4: Which of the following indicates that a code is NOT a prefix code? A. The binary tree representation of the code is not balanced B. In the binary tree representation of the code, all the symbols appear as leaf nodes C. In the binary tree representation of the code, one or more symbols appear as intermediate nodes (nodes with at least one child)
Code A Symbol Codeword S 00 P 01 A 10 M 11 Corresponding Binary Tree Code B Symbol Codeword S 0 P 1 A 10 M 11 Code C Symbol Codeword S 0 P 10 A 110 M 111
Problem Definition (revisited) Input: The frequency (p i ) of occurrence of each symbol (S i ) Output: Binary tree T that minimizes the following objective function: i=1:n L(T ) = p i Depth(S i in T ) Solution: Huffman Codes
The David Huffman Story! map smppam ssampamsmam TEXT FILE Le/er freq s 0.6 p 0.2 a 0.1 m 0.1 Huffman coding is one of the fundamental ideas that people in computer science and data communica5ons are using all the 5me - Donald Knuth
Not quite Huffman s algorithm The basic idea is to put the frequent items near the root (short codes) and the less frequent at the leaves. A simple idea is the top-down approach: A B C G H A: 6; B: 4; C: 4; D: 0; E: 0; F: 0; G: 1; H: 2 AAAAABBAHHBCBGCCC
Not quite Huffman s algorithm The basic idea is to put the frequent items near the root (short codes) and the less frequent at the leaves. A simple idea is the top-down approach: A B C G H A: 6; B: 4; C: 4; D: 0; E: 0; F: 0; G: 1; H: 2 Pre/y good, but NOT opmmal!
Huffman s algorithm: Bottom up construction Build the tree from the bottom up! Start with a forest of trees, all with just one node 6 4 4 0 0 0 1 2 A B C D E F G H
Huffman s algorithm: Bottom up construction Build the tree from the bottom up! Start with a forest of trees, all with just one node Merge trees in the forest two at a time to get a single tree 6 4 4 1 2 A B C G H
Huffman s algorithm: Bottom up construction Build the tree from the bottom up! Start with a forest of trees, all with just one node Merge trees in the forest two at a time to get a single tree What should be the merge criterion? 6 4 4 1 2 A B C G H
Huffman s algorithm: Bottom up construction T1 6 4 4 1 2 A B C G H T1 now represents the meta symbol GH What is the count associated with T1? A. Max(count (G), count (H)) B. (count (G) + count (H))/2 C. (count (G) + count (H))
Huffman s algorithm: Bottom up construction Choose the two smallest trees in the forest and merge them Repeat until all nodes are in the tree 7 T2 T1 C 6 4 A B G H
Huffman s algorithm: Bottom up construction Build the tree from the bottom up! Start with a forest of trees, all with just one node Choose the two smallest trees in the forest and merge them Repeat until all nodes are in the tree 17 T4 T3 T2 A B T1 C G H
You Try It! Letter Count u 40 c 20 s 15 d 15 y 6 a 4 Build the tree and write down the codes for each of the symbols Then encode the string cya using this code Rules for building the tree in a deterministic way:
Huffman s algorithm: Building the Huffman Tree 18 0. Determine the count of each symbol in the input message. 1. Create a forest of single-node trees containing symbols and counts for each non-zero-count symbol. 2. Loop while there is more than 1 tree in the forest: 2a. Remove the two lowest count trees 2b. Combine these two trees into a new tree (summing their counts). 2c. Insert this new tree in the forest, and go to 2. 3. Return the one tree in the forest as the Huffman code tree.
Huffman Algorithm: Forest of Trees 19 T1 6 4 4 1 2 A B C G H What is a good data structure to use to hold the forest of trees? A. BST B. Sorted array C. Linked list D. Something else
Huffman Algorithm: Forest of Trees 20 T1 6 4 4 1 2 A B C G H What is a good data structure to use to hold the forest of trees? A. BST: Supports min, insert and delete in O(log N) B. Sorted array: Not good for dynamic data C. Linked list: If unordered then good for insert (constant time) but min would be O(N). If ordered then delete, min are constant time but insert would be O(N) D. Something else: Heap (new data structure?)
What is a Heap? 21 Think of a Heap as a binary tree that is as complete as possible and satisfies the following property: At every node x Key[x]<= Key[children of x] So the root has the value
22 Heap vs. BST vs. Sorted Array Operations BST (Balanced) Sorted Array Heap Search O(log N) O(log N) Selection O(log N) O(1) Min and Max O(log N) O(1) Min or Max O(log N) O(1) Predecessor/ Successor O(log N) O(1) Rank O(log N) O(log N) Output in sorted order O(N) O(N) Insert O(log N) O(N) Delete O(log N) O(N) Extract min or extract max Ref: Tim Roughgarden (Stanford)
The suitability of Heap for our problem 23 In the Huffman problem we are doing repeated inserts and extract-min! Perfect setting to use a Heap data structure. The C++ STL container class: priority_queue has a Heap implementation. Priority Queue and Heap are synonymous
Priority Queues in C++ A C++ priority_queue is a generic container, and can hold any kind of thing as specified with a template parameter when it is created: for example HCNodes, or pointers to HCNodes, etc. 24 #include <queue> std::priority_queue<hcnode> p; You can extract object of highest priority in O(log N) To determine priority: objects in a priority queue must be comparable to each other By default, a priority_queue<t> uses operator< defined for objects of type T: if a < b, b is taken to have higher priority than a
25 Priority Queues in C++ The C++ priority_queue is synonymous to which of the following Heap data structures: A. Max-Heap B. Min-Heap C. BST D. Sorted Array
Priority Queues in C++ 26 #ifndef HCNODE_HPP #define HCNODE_HPP class HCNode { public: HCNode* parent; // pointer to parent; null if root HCNode* child0; // pointer to "0" child; null if leaf HCNode* child1; // pointer to "1" child; null if leaf unsigned char symb; // symbol int count; // count/frequency of symbols in subtree // for less-than comparisons between HCNodes bool operator<(hcnode const &) const; }; #endif
27 In HCNode.cpp: #include HCNODE_HPP /** Compare this HCNode and other for priority ordering. * Smaller count means higher priority. */ bool HCNode::operator<(HCNode const & other) const { // if counts are different, just compare counts return count > other.count; }; #endif What is wrong with this implementation? A. Nothing B. It is non-deterministic (in our algorithm) C. It returns the opposite of the desired value for our purpose
28 In HCNode.cpp: #include HCNODE_HPP /** Compare this HCNode and other for priority ordering. * Smaller count means higher priority. * Use node symbol for deterministic tiebreaking */ bool HCNode::operator<(HCNode const & other) const { // if counts are different, just compare counts if(count!= other.count) return count > other.count; // counts are equal. use symbol value to break tie. // (for this to work, internal HCNodes // must have symb set.) return symb < other.symb; }; #endif