Design and Analysis of Algorithms

Similar documents
International Journal of Advanced Research in Computer Science and Software Engineering

Class Notes CS Creating and Using a Huffman Code. Ref: Weiss, page 433

Binary Trees and Huffman Encoding Binary Search Trees

Arithmetic Coding: Introduction

Data Structures Fibonacci Heaps, Amortized Analysis

Information, Entropy, and Coding

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

encoding compression encryption

CSE 326: Data Structures B-Trees and B+ Trees

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Image Compression through DCT and Huffman Coding Technique

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *

Binary Heaps. CSE 373 Data Structures

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

CS711008Z Algorithm Design and Analysis

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

Practice with Proofs

Lecture 1: Course overview, circuits, and formulas

Full and Complete Binary Trees

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

The Mean Value Theorem

Analysis of Compression Algorithms for Program Data

Heaps & Priority Queues in the C++ STL 2-3 Trees

GRAPH THEORY LECTURE 4: TREES

Algorithms and Data Structures

Binary Search Trees CMPSC 122

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Binary Heap Algorithms

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

Chapter 4: Computer Codes

Cpt S 223. School of EECS, WSU

From Last Time: Remove (Delete) Operation

Analysis of Algorithms I: Optimal Binary Search Trees

Lecture 1: Data Storage & Index

Number Theory. Proof. Suppose otherwise. Then there would be a finite number n of primes, which we may

K-Cover of Binary sequence

Lecture 5 - CPA security, Pseudorandom functions

Questions 1 through 25 are worth 2 points each. Choose one best answer for each.

Critical points of once continuously differentiable functions are important because they are the only points that can be local maxima or minima.

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

6 March Array Implementation of Binary Trees

The following themes form the major topics of this chapter: The terms and concepts related to trees (Section 5.2).

PES Institute of Technology-BSC QUESTION BANK

Physical Data Organization

Solutions for Practice problems on proofs

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heap-order property

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

Sample Questions Csci 1112 A. Bellaachia

Data Structures and Algorithm Analysis (CSC317) Intro/Review of Data Structures Focus on dynamic sets

Rolle s Theorem. q( x) = 1

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

An Implementation of a High Capacity 2D Barcode

How To Solve The \"Problems Of The Power Of The Universe\" Problem In A Computer Program

- 1 - Handout #22 May 23, 2012 Huffman Encoding and Data Compression. CS106B Spring Handout by Julie Zelenski with minor edits by Keith Schwarz

MAC Sublayer. Abusayeed Saifullah. CS 5600 Computer Networks. These slides are adapted from Kurose and Ross

DATA STRUCTURES USING C

Review of Hashing: Integer Keys

COMPSCI 105 S2 C - Assignment 2 Due date: Friday, 23 rd October 7pm

Converting a Number from Decimal to Binary

The Taxman Game. Robert K. Moniot September 5, 2003

FCE: A Fast Content Expression for Server-based Computing

APP INVENTOR. Test Review

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

14.1 Rent-or-buy problem

CS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010

HUFFMAN CODING AND HUFFMAN TREE

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

Motivation Suppose we have a database of people We want to gure out who is related to whom Initially, we only have a list of people, and information a

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Binary Coded Web Access Pattern Tree in Education Domain

Algorithms and Data Structures

Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser Quiz 1.

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Lossless Grey-scale Image Compression using Source Symbols Reduction and Huffman Coding

Optimal Binary Search Trees Meet Object Oriented Programming

Lecture 6: Binary Search Trees CSCI Algorithms I. Andrew Rosenberg

SMALL INDEX LARGE INDEX (SILT)

Classification/Decision Trees (II)

Streaming Lossless Data Compression Algorithm (SLDC)

Reading.. IMAGE COMPRESSION- I IMAGE COMPRESSION. Image compression. Data Redundancy. Lossy vs Lossless Compression. Chapter 8.

6.080 / Great Ideas in Theoretical Computer Science Spring 2008

Comparison of different image compression formats. ECE 533 Project Report Paula Aguilera

1. The RSA algorithm In this chapter, we ll learn how the RSA algorithm works.

MATH10040 Chapter 2: Prime and relatively prime numbers

Section 1.4 Place Value Systems of Numeration in Other Bases

Binary Search Trees (BST)

x a x 2 (1 + x 2 ) n.

EE602 Algorithms GEOMETRIC INTERSECTION CHAPTER 27

Stateful Inspection Firewall Session Table Processing

The use of binary codes to represent characters

Lecture 2: Data Structures Steven Skiena. skiena

CSE 123: Computer Networks Fall Quarter, 2014 MIDTERM EXAM

Data Structure [Question Bank]

Multimedia Systems WS 2010/2011

1. Give the 16 bit signed (twos complement) representation of the following decimal numbers, and convert to hexadecimal:

International Journal of Software and Web Sciences (IJSWS)

Lecture 4 Online and streaming algorithms for clustering

Transcription:

Design and Analysis of Algorithms CSE 5311 Lecture 17 Greedy algorithms: Huffman Coding Junzhou Huang, Ph.D. Department of Computer Science and Engineering CSE5311 Design and Analysis of Algorithms 1

Data Compression Suppose we have 1000000000 (1G) character data file that we wish to include in an email. Suppose file only contains 26 letters {a,,z}. Suppose each letter a in {a,,z} occurs with frequency f a. Suppose we encode each letter by a binary code If we use a fixed length code, we need 5 bits for each character The resulting message length is 5( f + f + L f ) a b z Can we do better? CSE5311 Design and Analysis of Algorithms 2

Huffman Coding The basic idea Instead of storing each character in a file as an 8-bit ASCII value, we will instead store the more frequently occurring characters using fewer bits and less frequently occurring characters using more bits On average this should decrease the filesize (usually ½) Huffman codes can be used to compress information Like WinZip although WinZip doesn t use the Huffman algorithm JPEGs do use Huffman as part of their compression process CSE5311 Design and Analysis of Algorithms 3

Data Compression: A Smaller Example Suppose the file only has 6 letters {a,b,c,d,e,f} with frequencies a.45 000 0 b.13 001 101.12 100 Fixed length 3G=3000000000 bits Variable length c 010 d.16 011 111 e.09 100 1101 f.05 101 1100 Fixed length Variable length (. 45 1+.13 3+.12 3+.16 3+.09 4 +.05 4) = 2.24G CSE5311 Design and Analysis of Algorithms 4

How to decode? At first it is not obvious how decoding will happen, but this is possible if we use prefix codes CSE5311 Design and Analysis of Algorithms 5

Prefix Codes No encoding of a character can be the prefix of the longer encoding of another character, for example, we could not encode t as 01 and x as 01101 since 01 is a prefix of 01101 By using a binary tree representation we will generate prefix codes provided all letters are leaves CSE5311 Design and Analysis of Algorithms 6

Prefix codes A message can be decoded uniquely. Following the tree until it reaches to a leaf, and then repeat! Draw a few more tree and produce the codes!!! CSE5311 Design and Analysis of Algorithms 7

Some Properties Prefix codes allow easy decoding Given a: 0, b: 101, c: 100, d: 111, e: 1101, f: 1100 Decode 001011101 going left to right, 0 01011101, a 0 1011101, a a 101 1101, a a b 1101, a a b e An optimal code must be a full binary tree (a tree where every internal node has two children) For C leaves there are C-1 internal nodes The number of bits to encode a file is where f(c) is the freq of c, d T (c) is the tree depth of c, which corresponds to the code length of c CSE5311 Design and Analysis of Algorithms 8

Optimal Prefix Coding Problem Input: Given a set of n letters (c 1,, c n ) with frequencies (f 1,, f n ). Construct a full binary tree T to define a prefix code that minimizes the average code length Average( T ) = = n f i i length T c 1 i ( ) CSE5311 Design and Analysis of Algorithms 9

Greedy Algorithms Many optimization problems can be solved using a greedy approach The basic principle is that local optimal decisions may may be used to build an optimal solution But the greedy approach may not always lead to an optimal solution overall for all problems The key is knowing which problems will work with this approach and which will not We will study The problem of generating Huffman codes CSE5311 Design and Analysis of Algorithms 10

Greedy algorithms A greedy algorithm always makes the choice that looks best at the moment My everyday examples: Driving in Los Angeles, NY, or Boston for that matter Playing cards Invest on stocks Choose a university The hope: a locally optimal choice will lead to a globally optimal solution For some problems, it works Greedy algorithms tend to be easier to code CSE5311 Design and Analysis of Algorithms 11

David Huffman s idea A Term paper at MIT Build the tree (code) bottom-up in a greedy fashion Origami aficionado CSE5311 Design and Analysis of Algorithms 12

Building the Encoding Tree CSE5311 Design and Analysis of Algorithms 13

Building the Encoding Tree CSE5311 Design and Analysis of Algorithms 14

Building the Encoding Tree CSE5311 Design and Analysis of Algorithms 15

Building the Encoding Tree CSE5311 Design and Analysis of Algorithms 16

Building the Encoding Tree CSE5311 Design and Analysis of Algorithms 17

The Algorithm Q :priority queue An appropriate data structure is a binary min-heap Rebuilding the heap is lg n and n-1 extractions are made, so the complexity is O( n lg n ) The encoding is NOT unique, other encoding may work just as well, but none will work better CSE5311 Design and Analysis of Algorithms 18

Correctness of Huffman s Algorithm Since each swap does not increase the cost, the resulting tree T is also an optimal tree CSE5311 Design and Analysis of Algorithms 19

CSE5311 Design and Analysis of Algorithms 20 Lemma 16.2 Without loss of generality, assume f[a] f[b] and f[x] f[y] The cost difference between T and T is 0 )) ( ) ( ])( [ ] [ ( ) ( ] [ ) ( ] [ ) ( ] [ ) ( ] [ ) ( ] [ ) ( ] [ ) ( ] [ ) ( ] [ ) ( ) ( ) ( ) ( ') ( ) ( ' ' ' = + = + = = x d a d x f a f x d a f a d x f a d a f x d x f a d a f x d x f a d a f x d x f c d c f c d c f T B T B T T T T T T T T T T C c T C c T B(T ) B(T), but T is optimal, B(T) B(T ) B(T ) = B(T) Therefore T is an optimal tree in which x and y appear as sibling leaves of maximum depth

Correctness of Huffman s Algorithm Observation: B(T) = B(T ) + f[x] + f[y] B(T ) = B(T)-f[x]-f[y] For each c C {x, y} d T (c) = d T (c) f[c]d T (c) = f[c]d T (c) d T (x) = d T (y) = d T (z) + 1 f[x]d T (x) + f[y]d T (y) = (f[x] + f[y])(d T (z) + 1) = f[z]d T (z) + (f[x] + f[y]) CSE5311 Design and Analysis of Algorithms 21

B(T ) = B(T)-f[x]-f[y] z:14 B(T ) = 45*1+12*3+13*3+(5+9)*3+16*3 = B(T) - 5-9 B(T) = 45*1+12*3+13*3+5*4+9*4+16*3 CSE5311 Design and Analysis of Algorithms 22

Proof of Lemma 16.3 Prove by contradiction. Suppose that T does not represent an optimal prefix code for C. Then there exists a tree T such that B(T ) < B(T). Without loss of generality, by Lemma 16.2, T has x and y as siblings. Let T be the tree T with the common parent x and y replaced by a leaf with frequency f[z] = f[x] + f[y]. Then B(T ) = B(T ) - f[x] f[y] < B(T) f[x] f[y] = B(T ) T is better than T contradiction to the assumption that T is an optimal prefix code for C CSE5311 Design and Analysis of Algorithms 23

Example: Huffman Coding As an example, lets take the string: duke blue devils We first to a frequency count of the characters: e:3, d:2, u:2, l:2, space:2, k:1, b:1, v:1, i:1, s:1 Next we use a Greedy algorithm to build up a Huffman Tree We start with nodes for each character e,3 d,2 u,2 l,2 sp,2 k,1 b,1 v,1 i,1 s,1 CSE5311 Design and Analysis of Algorithms 24

Example: Huffman Coding We then pick the nodes with the smallest frequency and combine them together to form a new node The selection of these nodes is the Greedy part The two selected nodes are removed from the set, but replace by the combined node This continues until we have only 1 node left in the set CSE5311 Design and Analysis of Algorithms 25

Example: Huffman Coding e,3 d,2 u,2 l,2 sp,2 k,1 b,1 v,1 i,1 s,1 CSE5311 Design and Analysis of Algorithms 26

Example: Huffman Coding e,3 d,2 u,2 l,2 sp,2 k,1 b,1 v,1 2 i,1 s,1 CSE5311 Design and Analysis of Algorithms 27

Example: Huffman Coding e,3 d,2 u,2 l,2 sp,2 k,1 2 2 b,1 v,1 i,1 s,1 CSE5311 Design and Analysis of Algorithms 28

Example: Huffman Coding e,3 d,2 u,2 l,2 sp,2 3 2 k,1 2 i,1 s,1 b,1 v,1 CSE5311 Design and Analysis of Algorithms 29

Example: Huffman Coding e,3 d,2 u,2 4 3 2 l,2 sp,2 k,1 2 i,1 s,1 b,1 v,1 CSE5311 Design and Analysis of Algorithms 30

Example: Huffman Coding e,3 4 4 3 2 d,2 u,2 l,2 sp,2 k,1 2 i,1 s,1 b,1 v,1 CSE5311 Design and Analysis of Algorithms 31

Example: Huffman Coding e,3 4 4 5 d,2 u,2 l,2 sp,2 2 3 i,1 s,1 k,1 2 b,1 v,1 CSE5311 Design and Analysis of Algorithms 32

Example: Huffman Coding 7 4 5 e,3 4 l,2 sp,2 2 3 d,2 u,2 i,1 s,1 k,1 2 b,1 v,1 CSE5311 Design and Analysis of Algorithms 33

Example: Huffman Coding 7 9 e,3 4 4 5 d,2 u,2 l,2 sp,2 2 3 i,1 s,1 k,1 2 b,1 v,1 CSE5311 Design and Analysis of Algorithms 34

Example: Huffman Coding 16 7 9 e,3 4 4 5 d,2 u,2 l,2 sp,2 2 3 i,1 s,1 k,1 2 b,1 v,1 CSE5311 Design and Analysis of Algorithms 35

Example: Huffman Coding Now we assign codes to the tree by placing a 0 on every left branch and a 1 on every right branch A traversal of the tree from root to leaf give the Huffman code for that particular leaf character Note that no code is the prefix of another code CSE5311 Design and Analysis of Algorithms 36

Example: Huffman Coding 16 e 00 d 010 7 9 u 011 l 100 e,3 4 4 5 sp 101 i 1100 d,2 u,2 l,2 sp,2 2 3 s 1101 k 1110 i,1 s,1 k,1 2 b,1 v,1 b 11110 v 11111 CSE5311 Design and Analysis of Algorithms 37

Example: Huffman Coding These codes are then used to encode the string Thus, duke blue devils turns into: 010 011 1110 00 101 11110 100 011 00 101 010 00 11111 1100 100 1101 When grouped into 8-bit bytes: 01001111 10001011 11101000 11001010 10001111 11100100 1101xxxx Thus it takes 7 bytes of space compared to 16 characters * 1 byte/char = 16 bytes uncompressed CSE5311 Design and Analysis of Algorithms 38

Example: Huffman Coding Uncompressing works by reading in the file bit by bit Start at the root of the tree If a 0 is read, head left If a 1 is read, head right When a leaf is reached decode that character and start over again at the root of the tree Thus, we need to save Huffman table information as a header in the compressed file Doesn t add a significant amount of size to the file for large files (which are the ones you want to compress anyway) Or we could use a fixed universal set of codes/freqencies CSE5311 Design and Analysis of Algorithms 39