The Unicode Standard refers to a code that provides a unique number for every character regardless of platform, program, or language. Unicode Standard is a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diver language of the modern world, including classical and historical written languages. The Unicode Standard is ubiquitous, supported by many operating systems and browsers, and has an increasing number of tools for support. The Unicode Standard is required by modem standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, and WML, and is the official way to implement ISO/IEC 10646.
Unicode is important because it is extensible and intended to be adequate for all characters and all languages. Before Unicode was invented, there were hundreds of different encoding systems for assigning numbers to characters. There was no single encoding adequate to encode English letters, punctuation and technical symbols. Moreover, prior encoding systems conflicted with one another such that any two encodings frequently assigned a same number for two different characters, or used different numbers for the same character. Servers that supported multiple encodings therefore risked data corruption.
The Unicode standard currently supports three encoding forms sufficient for all known character encoding requirements. Specifically, the majority of common-use characters fit into the first 64,000 code points, an area of the codespace that is called the basic multilingual plane, or BMP for short. There are about 6,700 unused code points for future expansion in the BMP, plus over 870,000 unused supplementary code points on the other planes. More characters are under consideration for addition to future versions of the standard. The Unicode Standard also reserves code points for private use.
The character encoding standards define the identity of each character and its numeric value, or code point, and how the code point is represented in bits. The three encoding forms allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode is extensible in that new characters can be added and assigned to new characters using the unused code points.
One problem with implementing the Unicode Standard is that the index is large and when new code points are added re-indexing the Unicode to account for the new code points can take hours of processing time. There is a need for a method of accounting for the new code points without requiring the hours of processing required for re-indexing.