1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, systems, and products for converting byte sequences in IMEs into Unicode code points.
2. Description Of Related Art
“Unicode” is standard encoding format for characters. Computers internally operate only with numbers. Computers store letters and other characters by assigning a number for each character. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. One example of a well known encoding system is the American Standard Code for Information Interchange, known as ‘ASCII.’ Another well known encoding system is the IBM system known as the Extended Binary-Coded Decimal Interchange Code, or ‘EBCDIC.’ Other encoding formats include the CCITT encoding system, of the Comite Consultatif International Telephonique et Telegraphique, and the International Standard Organization system known as ‘ISO 8859-1.’
No single encoding system, or encoding format, however, could contain enough characters. The European Union, for example, alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use. In addition, these encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters or use different numbers for the same character. Unicode provides a unique encoding number for every character, independent of the platform, independent of the program, independent of the language.
Unicode is an encoding system, or encoding format, for characters. Roughly speaking, characters represent indivisible marks that people use in writing systems to convey information. In western alphabets, for example, the Latin small letter ‘a’ is the name of a character. Characters encoded by Unicode include, not only marks used in writing, but also formatting marks, control characters, and characters usually combined with other characters such as diacritical marks or vowel marks. Formatting marks give an indication of how adjacent characters are to be rendered but do not themselves correspond to what one ordinarily thinks of as a written mark. Control characters have meaning in computing but do not correspond to written marks.
A Unicode “code point” is a numeric value assigned to a character. In the Unicode encoding format, each character receives a unique Unicode code point. Unicode code points have values in the hexadecimal range 000000 to 10FFFF, requiring therefore 21 bits of computer storage for a single Unicode code point. Computers tend to administer computer storage in terms of 8-bit bytes, so it is well to explain a little further how Unicode code points are encoded.
There are three kinds of Unicode encoding formats defined in standards commonly known as UTF-8, UTF-16, and UTF-32. UTF-8 represents Unicode code points in “code units” of 8 bits. UTF-16 represents Unicode code points in code units of 16 bits. UTF-32 represents Unicode code points in code units of 32 bits. In UTF-32, therefore, each Unicode code point is stored in a single code unit. For emphasis and clarity, in this specification, “code units” are often referred to as “character code units.”
For UTF-8 and UTF-16, however, a Unicode representation of a character requires both at least one code unit, often more than one, and a rule describing a mapping between sequences of code units and Unicode code points. More particularly, In UTF-8, code points in the range hexadecimal 0000 through 007F are stored in a single code unit (one byte). Other code points in UTF-8 are represented by a sequence of two or more code units, each byte in the range 0000 through 00FF. In UTF-16, code points in the range hexadecimal 0000 through FFFF are stored in a single 16-bit code unit. Other code points in UTF-16 are represented by a pair of surrogates, each stored in one code unit.
The single code unit mapping in UTF-8, hex 00 through 7F, correspond to the original 128 values of traditional ASCII and in fact have generally the same values as ASCII code, a historical accident. Although UTF-32 is the modem powerful standard of Unicode, it is probably worthwhile to point out that UTF-16 is almost identical in representational power with UTF-32, because, as a practical matter, the frequency of characters with code points larger than hexadecimal FFFF is small. Readers interested in more detail regarding Unicode or multi-code unit Unicode encodings are directed to the book that sets forth the current standard, “The Unicode Standard, Version 3.0,” ISBN 0-201-61633-5, by the Unicode Consortium, and to the Unicode Consortium's website at http://www.unicode.org.
By use of Unicode, Java supports multilingual applications. Java uses Unicode for storage of character data. Developers can create single binary applications that provide basic enablement for a wide variety of scripts, Latin, Greek, Japanese, Korean, Chinese, and so on.
Java Input Methods Editors (“IMEs”) are software components that interpret user operations such as typing keys, speaking, or writing using a pen device to generate text input for applications. The most common input methods are the ones that let users type text in Chinese, Japanese, or Korean, languages that use thousands of different characters, on a regular-sized keyboard. The text is typed in a form that can be handled by regular-sized keyboards, for example, in a Romanized form, and then converted into the intended form. Typically a sequence of several characters needs to be typed and then converted in one group, and conversion may have to be retried because there maybe several possible translations.
While this “composition” process is going on, the text, not having been officially handed off to the application, still logically belongs to the IME, but nevertheless needs to be displayed to the user. A “Java Input Method Framework” or “IMF” cooperates with an IME to provide at least two ways to display composition to a user. The IMF enables text editing components to display text in the context of the document that it will eventually belong to, but in a style, such as highlighted or underscored, that indicates that the text still needs to be converted or confirmed by the IME. This is called “on-the-spot editing.”
An IMF also provides a separate alternative window to display text for applications not equipped to deal with the text until it is confirmed and officially handed over to the application. This second approach is called “root-window editing.” Readers interested in more detail regarding Java IMEs are directed to the “Input Method Framework Design Specification” published by Sun Microsystems, Inc., at http://java.sun.com/products/jdk/1.2/docs/guide/intl/spec.html.
Usefulness of IMEs in software development environments, however, is not without difficulties. There are few tools for verifying that a Java application correctly handles arbitrary Unicode character data. It is typical in Java development environments for a developer or a tester to be required to establish a national language environment, for example, in a Japanese version of Windows, in order to ascertain whether an application supports a particular script. Discovery of enablement problems therefore are delayed until translation verification testing or system verification testing. In addition, enablement problems are difficult to debug because developers must have the correct national language environment in order to reproduce problems. Moreover, some Unicode characters are not available on standard keyboard layouts, although at least some Unicode characters are significant for legacy purposes and data interchange.
It would be advantageous to have a Java IME capable of providing testers, developers, and users with a mechanism for entering Unicode characters into Java applications, any Java application, independent of any underlying national language environment in the operating system of the computer on which the Java application is installed and independent of any particular keyboard layout. Such an IME would assist in identification of enablement problems early in the software development cycle and provide a useful mechanism for recreating enablement problems.