Huffman Coding
David A. Huffman invented a greedy algorithm that constructs an optimal prefix code called a Huffman Code for lossless data compression. Here’s the pseudocode for Huffman Coding:
Let, C is a Set of n characters.
Frequency of a character k is defined by f|k|
Q is a min-priority que keyed on f ( Q is a list which sorted in ascending order by the frequency)
HUFFMAN( C )
1 n = | C |
2 Q = C
3 for i = 1 to n-1
4 do allocate new node z
5 left[z] = x = EXTRACT-MIN(Q)
6 right[z] = y = EXTRACT-MIN(Q)
7 f(z) = f(x) + f(y)
8 INSERT(Q , z)
9 enddo
10 return EXTRACT-MIN(Q)
Lets analyze the pseudocode line by line:
HUFFMAN( C ) : C is a Set of characters
1 ‘n’ is the total number of chars in ‘C’
2 insert all the chars to the min-priority que ‘Q’
3 n-1 times
4 create a new node z
5 left child of z is the least frequent char popped frm Q
6 now pop another char from Q to create the right child
7 frequency of z is the sum of frequencies of its childrn
8 insert the newly created object into the min -priority que
9 loop ends
10 return the root of the tree
Now, we will see an example of compressing a particular text file using the Huffman Coding algorithm. Suppose, we have a text file containing the words “COMPUTER SCIENCE AND ENGINEERING”. First, we have to calculate the frequencies of different characters and sort them in ascending order.
| O | 1 |
| M | 1 |
| P | 1 |
| U | 1 |
| T | 1 |
| S | 1 |
| A | 1 |
| D | 1 |
| R R | 2 |
| G G | 2 |
| C C C | 3 |
| I I I | 3 |
| N N N N N | 5 |
| E E E E E | 6 |
|
29 |
|
First, we will take the smallest 2 frequencies at once. And create a sub-tree as follow:
(2) OM
/ \
[1] [1]
O M
2 least frequent objects are merged together and the result of the merger of these 2 object is a new node whose frequency is the sum of the frequencies of the 2 objects that were merged. All the objects that are added to the tree are then removed from the table. And all the new objects that are created after merging are then added to the table. The table should be again sorted.
We will continue this process until the frequency table has only one object & that is the root of our desired tree. The following is the final tree that we will get after n-1 iterations:
‘Leaves’ are shown as rectangles[] and ‘internal nodes’ are shown as circles(). From this tree we can get the variable code for each number. Each right edge is labeled as 1 and left edge is labeled as 0. The codeword for a letter is the sequence of labels on the edges connecting the root to the leaf for that letter. If we want to get the variable code for ‘E’, we will traverse edges while going to the leave of ‘E’ from the root. The sequence of labels on the edges is the codeword for ‘E’, in our example it is 01. I f we check out all the codewords we will see that letter of higher frequency has got a smaller codeword.
We interpret the binary codeword for a character as the path from the root to the character. Where 0 means ‘go to the left child’ and 1 means ‘go to the right child’. If we want to decode the codeword 11001, from the root our path will be right-right-left-left-right. Following the path we will reach to ‘U’, so 11001 is the codeword for ‘U’.
About this entry
You’re currently reading “Huffman Coding,” an entry on anupom.toString( );
- Published:
- October 9, 2006 / 2:33 am
- Category:
- Algorithms
- Tags:
8 Comments
Jump to comment form | comments rss [?] | trackback uri [?]