A computer keyboard and display are common devices for providing computer input and output, respectively. A keyboard is language-specific such that the keys on the keyboard can be pressed to directly input only those characters in the keyboard's language that are assigned to those keys. For inputting other characters, a user has to press a combination of keys on the keyboard.
Many languages have alphabets that are too large to accommodate on a keyboard. Many languages have modifiers, which when applied to a character in the language's alphabet produce additional characters in the language's alphabet. Furthermore, the alphabets of many languages do not use characters to form words in the manner of the English language, but have a collection of characters that represent words. Thus, providing computer input in many languages is not as simple as pressing the letter-keys on the keyboard but an indirect process of pressing a combination of keys to generate characters not available as keys on the keyboard.
A display is also language-specific. In order to display characters in certain languages, certain scripts, fonts, and rendering directions can be selected for proper rendering of the characters. Generally, for rendering characters of a given language, specific fonts are installed and presented on the computer display.
Text in digital content is encoded in a variety of ways for capturing from an input method, storage in a data storage system, and rendering on a user interface in a desirable manner. For example, Unicode is a common encoding standard for encoding and handling of text characters from a set of language scripts in digital textual content.
Presently, Unicode encodes or codifies over one hundred thousand characters from over one hundred languages. Generally, in Unicode and other character encoding standards, each character represented in the standard is assigned an encoded value called a code point. For example, in Unicode standard UTF-8 uses one byte—eight bits—to encode the code points of the characters represented in UTF-8. UTF-16 similarly uses two bytes for encoding the code points of the characters represented therein.
A code point in an encoding standard is unique to a specific character in a specific language represented in that encoding standard. For example, in Unicode, a code point comprises an alphanumeric string that can be generated on commonly used keyboard configurations, such as an English language QWERTY keyboard.
As an example, to provide a Unicode code point, the user or an application generally supplies an indication that the alphanumeric string following the indication is a Unicode code point and should be translated using a Unicode table to generate a character. For example, in some implementations, to provide a Unicode code point using a QWERTY keyboard, a user can press the ALT key, keep the ALT key depressed while entering the code point, and release the ALT key when the code point entry is complete.
An application called an input method application, or simply “input method” or “IMA”, intercepts the Unicode code point that the user enters. The IMA looks up a Unicode table to find the character that matches the code point that the user entered. The IMA supplies the character to a target application for which the user is supplying the input.
A character variant is an alternate representation of a character within the given encoding standard. There are two types of character variants—Variants with Different Code Points (VDCP) and Glyphs with Same Code Point (GSCP). A variant is a VDCP type variant when the original character and the variant have different code points, with the same or different language tags, within the encoding standard. A variant is a GSCP type variant when the glyph—a manner of scripting or otherwise visually representing the character—of the original character and the variant has the same code point but different language tags in the encoding standard.