Nowadays Unicode representation is widely adopted in the field of information technology for representing characters of different languages. In the Unicode representation of UTF-16 or UCS2, 16 bits are used for representing each character. The first 8 bits (MSB) of the 16 bit Unicode are used to represent the language/category, whereas the next 8 bits (LSB) are used for representing the character in the language/category. For example the character ‘A’ is represented in Unicode format as 0041hex, wherein the byte 00hex represents that the character is Basic Latin Script character and the byte 41hex represents the character ‘A’ in Basic Latin Script (also called ASCII). Hence, the number of bits required for representation of each character using Unicode standard UTF16 or UCS2 is 16 bits.
By making use of Unicode standards, a number of different languages can be represented digitally. However, considering Indian languages which are mostly based on Brahmi script, they do not have more than 128 widely used characters, allocating 16 bits for each character leads to unnecessary increase in the size of text data.
The existing compression techniques used in order to reduce the size of text data are based on Huffman compression. Huffman compression technique is used for reducing the size of text file by making use of symbol table. However, by making use of Huffman compression, each time a new symbol table needs to be generated. Moreover, the level of compression achieved using Huffman compression technique is not uniform and largely depends on the size of the text file being compressed.