Loss-less data compression using variable length code and greedy technique

Suppose, we have a 100-character data file and the file only contains 6 different characters from A to F. The frequency of those characters are:

A = 45
B = 15
C = 14
D = 16
E = 4
F = 6
————–
100

It means, the total number of A in the file is 45 and so on..

Fixed Length Code Representation:
Most of the binary character codes like ASCII, UTF-8 or GSM-7bit are fixed length codes. In ASCII we use 7 bits to represent 2^7 = 128 characters. For our file, to represent 6 characters using fixed length code, we need at least 3 bits for each character. The coding may look like as follow:

A = 000
B = 001
C = 010
D = 011
E = 100
F = 101

Using fixed length code, the size of our 100-character data file will be
3*100 = 300 bits.

Now we will see how we can reduce the file size using variable length code.

Variable Length Code Representation:
Prefix-free code (sometimes also called prefix code) is a certain type of variable length code. To know what is a prefix-free code, first we will see an example. The coding for our data file using prefix-free code may look like the following:

A = 0
B = 101
C = 100
D = 111
E = 1101
F = 1100

The specialty of prefix-free-code is, in this coding, no code-word is a prefix of any other code-word. As we can see here ‘101’ is a code-word, and ‘101’ is also not a prefix of any other code-words. And that is true for all other keywords too. To ensure data compression, we have followed another technique, and that is, we have assigned smaller code-words to the characters with higher frequencies. A nifty ‘Greedy Technique’! As in our file the number of occurrence of ‘A’ is the highest, ‘A’ has got the smallest code-word.

Using prefix-free code and greedy algorithm, the size of our 100-character data file will be
1*45 + 3*15 + 3*14 + 3*16 + 4*4 + 4*6 = 220 bits
(which is approximately 25% less).

Later we will see how to create a variable length code for a particular data file using Huffman Coding.


About this entry