Text symbols and scripted glyphs, sometimes referred to as characters, are stored and represented within digital systems in numeric coded form. To facilitate communication between two digital systems, e.g. two computers, it is useful to employ a shared format, or code, that is known to the two digital systems so that an encoded character is interpreted by the receiving digital system in the same way as it was intended by the sending digital system. The shared format may be a standardized format, having a specification that can be easily obtained or that is distributed with computer software and operating systems.
One widely-used code for representing a common set of characters in English and other Western languages is the ASCII (American Standard Code for Information Interchange), which has been in use in some form since the late 1950s. The ASCII definition allows for octet (8-bit) representation of the English alphabet (upper and lower case), the Arabic numerals (0-9), several punctuation and arithmetic symbols (e.g., #, $, +, ˜) and some other control characters (e.g., Line Feed, Escape). The ASCII set employs 256 code values to uniquely represent the corresponding characters.
With the proliferation of digital communication and computing platforms around the world, shortcomings of limited codes, such as ASCII, arise. For example, with only 256 possible code values, the 8-bit ASCII cannot possibly uniquely represent every character or symbol used in every written language, mathematics, commerce, etc. As a result, different character sets have been used in place of the traditional ASCII characters set to accommodate other national (local) alphabets. Also, extensions to the basic code sets have been employed to expand the possible repertoire of characters that can be uniformly stored and communicated between digital information platforms. Without standardization of a universal character set, the same code value could be used by two (inconsistent) codes to represent two different characters. If inconsistencies (or overlapped use of code values) exist between character sets used by different users, a document prepared by one user may contain errors or misinterpreted data when read by another user. As a starting point, a larger and more flexible character code space is employed.
One standard code used to cover a much larger character space than ASCII is Unicode. Unicode is a superset of codes closely associated with the Universal Character Set, and is conformant with the ISO/IEC 10646 international standard and others, defining a very large character repertoire. One main purpose for the development of Unicode was to allow a uniform character set sufficiently deep that no character duplication or code value overlap is necessary. That is, a code that can reasonably accommodate all current (and some past) languages and written symbols that are likely to be encountered. Thus, almost any character, in almost any written language, as well as multitudes of mathematical, logical and symbolic characters are defined in Unicode.
As in other codes, Unicode assigns a numeric value and a name to each of its characters, and includes information regarding the characters' case, directionality, and other properties. Unicode is concerned with the interpretation and processing of characters rather than their physical rendered form or display properties as would appear on a computer screen for example. A 16-bit encoding is used for the default Unicode encoding, providing about 65,000 available characters, with extensions (called surrogates) further allowing for about 1 million possible characters. A Unicode Consortium of computing and communication industry representatives and individuals has been established to provide a forum for implementing such a universal code.
Some character code sets, such as Unicode, provide assigned locations in the code tables for various families of characters. For example, locations or blocks are allocated for Basic Latin, Cyrillic, Greek, mathematical operators, musical symbols, Braille symbols, arrows, currency symbols, etc. These are distinct from the familiar “font” variants, which are not encoded by Unicode at this time. Lists of the assigned characters, e.g., in the Unicode Standard Version 3.0, can be found at www.unicode.org.
In addition to the assigned families of character locations, some locations are defined in the Unicode standard to be vendor-defined supplementary private use areas. In fact, about 7,800 code values are unused by the current Unicode standard to allow for future expansion in the basic coding space. For example, Unicode Supplementary Private Use Area-A (“Area-A”) is one such area that does not contain any character assignments. The locations occupied by Area-A are in the range F0000-FFFFD.
Even if a large code set is defined and standardized, a problem remains in converting and translating text and characters to and from languages having special rules, e.g., mutation rules. A mutation rule is usually context-dependent, and defines varying presentation forms of a character or a word as a function of context and environment, e.g., gender, tense, plural/singular form, isolated/initial/medial/final forms. Simple substitution of one character for another during translation can result in errors when translating languages having mutation rules, because one language does not generally have or follow the same mutation rules as another language. Better internationalization, or localization, capabilities are needed for cross- and multi-lingual software environments.
In the context of providing programs to a wider international customer base, converting user interfaces and other aspects of a program from one language to another is a challenge to programmers and software vendors. Mere translation of words by looking them up in electronic dictionaries is usually inadequate, and can lead to errors and unacceptable output, as language translation involves more than simple word or phrase substitutions. Internationalization or localization include schemes intended to eliminate such errors and inconsistencies, and provide for proper local forms of computer program output and interfaces.
Current systems do not handle conversion from one language to another well. For example, when implementing a computer program in different languages, programmers and software vendors must normally manually convert the user interface, output messages, etc. to the various languages to avoid errors. Mutation rules and other localization nuances make it impossible or impractical to convert computer output from one language to another by mere word or phrase substitution, such as is available using a dictionary. Therefore, improved and generalized ways to handle computer program output and data in multiple language environments is needed.