This invention relates in general to inputting text of an Asian language for operation by a program module, such as a word processing program, and in particular to using an Input Method Editor (IME) to convert an input string representing text of the Asian language into the proper characters for that language.
Providing text to a program module, such as a word processing program, is straightforward when a written language has one small character set. For example, the English language system uses twenty-six alphabet characters. Typical keyboards for conventional desktop computers have approximately 101 keys, so each English language alphabet character is assigned to a different key. To enter a word into an electronic document, an author depresses the keys that correspond to the letters of the words. The keystrokes are sent from the keyboard to the word processing program running on the computer.
In contrast to the English language system, some language systems, including East Asian languages, such as Japanese, Chinese, and Korean, have significantly more characters than there are keys on a keyboard. For example, the Japanese language system uses thousands of pictographic, Chinese-derived Kanji characters. The large number of Kanji characters precludes assigning each Kanji character to a different key. The process is further complicated because Japanese text also incorporates three other character sets. The most common is Hiragana, a character set of 46 phonetic syllable characters. Katakana (46 phonetic syllable characters) and Romaji (the 26 character Latin alphabet) are used for words whose origins are neither Japanese nor Chinese. Thus, Japanese computer users require front-end input processing to select the desired character from the appropriate character set for entry into an electronic document. Similarly, other East Asian language computer users, such as a Chinese user, also require front-end input processing to support the entry of characters into an electronic document.
Focusing on electronic document processing issues for Japanese users, typists can work modally, switching from character set to character set and specifying characters by a series of one or more keystrokes. However, the sheer size of the Kanji character set makes this approach impractical for typists to master. Instead, typists use a front-end processor, commonly known as an Input Method Editor (IME), to produce Japanese text from phonetic input. Typically, these front-end input processors convert Romaji alphabet strings into their sound-alike kana (Hiragana and/or Katakana) characters, or accept text directly entered in a kana character set, and then process the kana into Japanese text in a separate step.
Japanese IME conversion is error-prone for two main reasons: homophones and ambiguous word breaks. First, Japanese, like English, contains words that sound alike and might even be appropriate in the same context; for an English example, xe2x80x9cI want these twoxe2x80x9d and xe2x80x9cI want these too.xe2x80x9d Second, Japanese typists typically do not delimit words; the IME must decide how to group the kana characters into words. Because of this possibility for conversion error, the IME must allow the user to choose among alternate conversions after she has proofread the IME""s conversion.
From a user""s perspective, the traditional method for Japanese IME operation involves three basic steps. First, the user types a phonetic phrase, in kana or Romaji. This phrase is typically very short because the typist knows that shorter phrases are more successfully converted. Second, the user stops typing and hits the xe2x80x9cconvertxe2x80x9d key. Third, the user proofreads the conversion.
If the conversion is inaccurate, the user can depress the convert key again. The IME reconverts to the next most likely character set. If this is still not the desired character set, the user hits the convert key a third time. On the third conversion attempt, the IME presents a prioritized list of possible conversions. If the desired conversion is absent from the list, the user might manually select desired Japanese pictographs using another conversion mechanism. Once satisfied, the user approves the conversion and returns to typing. The converted text is then given xe2x80x9cdeterminedxe2x80x9d status, i.e., the input string is discarded and the converted text is maintained.
This IME model has two main drawbacks: reduced typing speed and increased learning time. Speed is compromised because the typist must use extra keystrokes to convert text. Additionally, the input rhythm for inputting characters broken because the typist must proofread at each conversion, or lose the opportunity to choose among alternate conversions. Learning time is increased because prior IME systems typically require user training and experience to gain optimum performance from the IME.
The xe2x80x9cIME ""97xe2x80x9d front-end input processor marketed by Microsoft Corporation of Redmond, Washington offers an improved solution. With this option, text is automatically converted when the IME detects a viable phrase, and automatically determined if the user continues typing for several lines without converting. However, alternate conversions are unavailable for determined text as in the traditional IME model described above.
Accordingly, there is a need in the art for a method for an IME that operates as an automated background process and avoids the editing difficulties of xe2x80x9cdeterminedxe2x80x9d text. There is a further need for a background input processor for converting kana to Japanese text and for generating alternate conversions for converted text positions to support efficient error conversion.
Generally described, the present invention meets the needs of Asian computer users for both background text processing and convenient and flexible error corrections. An Input Method Editor (IME) can convert an input string representing text of an East Asian language, such as Japanese, Chinese or Korean, into the proper characters for that language. The present invention is equally applicable to other large-character-set languages comprising of nonphonetic characters.
The present invention provides a computer-implemented method for converting phonetically-coded input into the proper characters of a selected language for use by a program module, such as a word processor, running on a computer system. The input string is converted into a language text string automatically, i.e., without explicit conversion events prompted by the user.
The present invention also can support a reconversion operation to address inaccurate text conversions. For example, when two or more distinct phrases contain the same phonetic syllables, the automatic conversion may produce an incorrect section of text. The user may correct these conversion mistakes by accessing alternate conversions of any section of text at any time. When text is selected for reconversion, a corresponding all-phonetic string is identified. This phonetic string is used to generate the list of alternate conversions for the selected text. To produce a corrected conversion, the user may select among the alternate conversions provided, or perform a manual conversion by explicitly selecting characters.
For an IME system compatible with the Japanese language, phonetically-coded Japanese character strings are typically entered in Romaji (the same character set used by English) and immediately converted to kana, usually Hiragana. For example, a user typing the letter xe2x80x9ckxe2x80x9d will see xe2x80x9ckxe2x80x9d displayed on the screen of a display device. When she follows xe2x80x9ckxe2x80x9d with xe2x80x9caxe2x80x9d, forming the syllable xe2x80x9ckaxe2x80x9d, the corresponding kana character replaces the xe2x80x9ckxe2x80x9d in the user""s display device. The user sees a constant shift of Romaji to kana characters on the display device. The phonetic input in its intermediate, pre-conversion state will be referred to as a phonetically-coded string, such as a kana string or kana characters. The invention, however, is also applicable to non-Romaji phonetic input methods. For example, voice recognition software and kana keyboards could both produce comparable phonetically-coded strings.
The present invention can initiate the conversion of a kana string, called the xe2x80x9cactive portion,xe2x80x9d as each kana character is received. An xe2x80x9cactive portionxe2x80x9d comprises phonetically-coded characters, such as kana characters, corresponding to the text immediately behind the insertion point, where the insertion point is typically represented by the cursor of a word processing program. An analysis is conducted to determine how much of the active portion may be confidently converted to Japanese text, and what amount should remain in kana form. A conversion is sufficiently confident when it exceeds a predetermined threshold for predicted accuracy.
The automatic conversion of the active portion may be more particularly described as the identification of the longest substring within the active portion that is eligible for conversion. A substring is eligible for conversion if it both includes the character positioned farthest from the insertion point, and it exceeds a threshold probability of accurate conversion. The threshold typically decreases as the length of the substring decreases. The automatic conversion can be assisted by analysis of the context of the active portion, i.e., the converted text on one or both sides of the active portion.
From the user""s perspective, kana shifts to Japanese text at a point behind the cursor in response to a conversion operation. Also, converted text inside the active portion may shift to a more probable conversion. In contrast, text outside the active portion does not change, unless selected for reconversion.
Additionally, the active portion can be fully converted whenever the user moves the insertion point, typically shown by a cursor. The entire active portion kana string is converted to the most probable character set for that string in response to a change in the location of the insertion point. The conversion can be assisted by analysis of the context of the active portion, i.e., the converted text on one or both sides of the active portion.
A conversion error can be corrected in response to selecting the converted character or characters corresponding to the error. In response to a reconvert command, a kana string corresponding to the selected text is identified. This kana string is used to generate alternate conversions for the selected text. These alternate conversions can be presented to the user in an alternate conversion list.
The user has several options for completing the conversion correction. First, a selection made from the alternate conversion list can replace the original conversion. Second, the kana string used to generate the alternate conversions can be edited to produce a new alternate conversion list. Third, individual characters can be specified to produce a xe2x80x9cmanualxe2x80x9d reconversion. For example, individual characters can be specified by selecting characters from a dictionary program module or by drawing selected characters with an input device, such as a mouse in a writing recognition program.
In both maintaining the active portion, and in creating an alternate conversion list, an all-kana string is identified. If some or all of the original input kana characters are not available on a memory storage device coupled to the computer system, those kana characters can be created through a reverse-conversion process. For example, an IME can locate all or part of the phonetic string on a memory storage device, generate any missing phonetic characters using a reverse-conversion operation, and produce a complete phonetic string for the active portion.
These and other aspects, features and advantages of the present invention may be more clearly understood and appreciated from a review of the following detailed description of the disclosed embodiments and by reference to the appended drawings and claims.