Modern computer systems typically provide facilities for representing text, as for example on a monitor or other display device. Because a computer stores text in digital format, a character set encoding is used to map a character to a unique digital representation. Numerous character set encodings (or codepages) have been developed; some well-known examples include Unicode, ISO-10646, ASCII, ISCII, ISO-2022, and EUC. Character set encodings vary significantly in their scope; certain encodings are suited to particular languages and writing systems. At one extreme, the Unicode standard supports millions of characters using 16-bit encodings and incorporates most writing systems in contemporary use. By contrast, ASCII supports only 127 characters. In general, two distinct encodings will not support the same set of characters. Because many different encodings for printing and displaying text characters are in use, it is often necessary to convert text from one encoding to another. The growth in worldwide computer-based communications involving users working with different languages and writing systems has made more critical the need for effective means of conversion between encodings.
Conversion between encodings is sometimes a straightforward matter. For each character in a source encoding string, some method or mechanism is used in order to determine the representation for the character in the target encoding. This might involve something as simple as a table lookup or a shift sequence. Conversion presents difficulties, however, when a character in the source encoding has no defined mapping to a character in the target encoding. In such a situation, a “fallback” technique may be applied to the character that is unknown or invalid in the target encoding.
Perhaps the simplest fallback solution involves substituting a space or a default symbol, such as ‘?’ or ‘□’, in place of the unknown or invalid source character. This fallback technique may be called a “replacement fallback” approach. For example, the source string “Hello world” might be converted to “?Hello world?” in a target encoding that does not recognize or provide a mapping for the character ‘’. This solution, while easy to apply, will often be undesirable. In particular, the loss of information involved in the fallback conversion will generally make it impossible to recover the source when reversing the direction of the conversion.
In another common fallback technique, known as “best fit,” the invalid or unknown input character is converted to the character in the target encoding with the nearest graphical likeness. For example, a source character ‘Ä’ might be represented as ‘A’ in ASCII, which has no A-diaeresis character. As with the replacement fallback technique, best-fit will have drawbacks in many situations. It can lead to compromises in security. For example, if an account on a system is protected by the password “Bjorn”, an intruder could gain access to the account with the input “Björn” if the input is subjected to a best-fit conversion. Naive substitution of visually-similar characters may alter or obscure the intended meaning of a sequence of characters in undesirable ways. Decoding of encoded text back to the source may become impossible.
No single fallback mechanism can be devised that will be suitable or desirable in all encoding conversion situations. Nevertheless, in most encoding conversion systems only one fallback technique is provided. Where some ability to define or select among different fallback approaches has been provided, it has been on a very restricted and non-extensible basis.