Chinese, Japanese, and Korean (CJK) writing systems each employ large numbers of characters that are either of Chinese origin or that mimic Chinese characters in appearance. For this reason, various strategies have been devised to enable Chinese-type characters to be input into a computer (or looked up) using a keyboard having a limited number of keys. Such conventional input means are typically referred to as input methods. Input methods have been designed for a variety of input devices, such as keyboards, graphic tablets with styluses, and numeric keypads.
The operation of keyboard-based input methods for inputting a target character is typically based on one of three main principles: 1) typing a sequence of keys corresponding to shapes that the target character contains; 2) typing a sequence of keys corresponding to the sound of the target character or word; or 3) typing a sequence of keys corresponding to the strokes that constitute the basic form of the target character. Once the sequence of keys has been typed, a list of candidate characters or words is typically displayed, such as in a text application (e.g., word processor or electronic dictionary) or in a floating input window. A user can then select a desired candidate character or word, usually by typing a number corresponding to the candidate desired, and the character or word becomes part of the text being written. Sometimes, morphological or syntactic information is used by conventional systems in an attempt to reduce the candidate list or to “guess” the intended word.
Alternative forms of input include optical character recognition, in which text on a printed page is scanned in and automatically interpreted; handwriting recognition, in which an input stylus is used to draw characters by hand, at which point recognition software automatically interprets the handwritten strokes and converts them into characters; and speech-to-text conversion, in which spoken audio data is converted to text. It is also possible to convert text to speech using the appropriate software.
Although advances have been made in keeping with the development of new input technologies, a large number of deficiencies may still remain. One or more embodiments of the present invention were conceived in light of deficiencies, problems and limitations in conventional input methods and in other linguistic services, as described below.
Electronic dictionaries and input methods generally do not give the user control over the data sources that store the lexicon of words employed. The resulting dearth of lexical data can make it difficult to find or enter proper names and technical terms, for example. Place names, proper names, and technical terms are frequently absent from such lexicons and can often prove frustrating to input.
The lexical data sources used for input methods and electronic dictionaries are generally very limited and usually predetermined by the vendor. With conventional systems or input methods it may not be possible to combine data sources from different vendors, nor is it possible to select the kind of data that will be displayed during input. Also, conventional systems may not accommodate lexical data sources having different data structures. Entries are typically displayed verbatim as a monolithic text block as contained in the original dictionary that the electronic dictionary or input method is linked to.
Conventional lexical service systems may also lack modularity. Specifically, conventional systems may not readily enable one to access or link to third party linguistic services of a different kind. Thus, there may be no connection, for example, between input methods and dictionaries, or between speech-to-text modules and input methods.
Conventional systems and methods may provide little or no easy means to quickly check the correctness of a character during input. Some conventional programs do indicate characters or words that are prone to confusion, but the highlighted words are pre-marked. One may not be free to easily access a character or word dictionary of one's choosing, nor is it easy to switch from one lexical source to the other to obtain different data.
Conventional systems generally permit character search by radical or phonetic pronunciation. Searching characters this way can be cumbersome. Viable alternatives to conventional character lookup systems and methods are discussed in a co-pending patent application entitled “System and Method For Classification and Retrieval of Chinese-Type Characters and Character Components” filed by Warren Daniel Child on the same date as the present application and which is referred herein to as the “character lookup application”, and which is incorporated herein by reference in its entirety.
Conventional systems and methods may provide little or no way to easily distinguish input candidates by word type. When faced with many homonymous word candidates, a user may typically be required to look through a long list of candidates to pick the target word he or she wants. There may be no way to readily distinguish dissimilar words, as between different types of nouns (common or proper) or different parts of speech, even though doing so would be a great aid to the user in choosing the word desired.
New word (user word) registration functions in conventional systems or software are generally deficient. Some systems may nominally provide users with the ability to register their own words. The process can often be tedious, however, and the user is typically required to manually enter words on their own, with little or no help from the system.
Conventional systems and methods typically provide insufficient control over automatic parsing and registration functions. Although some systems identify novel character combinations not included in their dictionaries, they generally do not distinguish words from phrases and do not enable the user to edit the final registration entry. As a result, tedious false candidates may begin to clutter the system, hampering ease of text input.
Conventional systems and methods provide little or no flexibility in degree of tone marking during search and input. Foreign users of Chinese input methods often struggle with search and input because they may be uncertain of a word or character's tone; even native speakers can have trouble because of dialectical differences in tone realization. Never using tones, however, has the problem of often generating too many candidates. For example, U.S. Pat. No. 5,594,642 appears to describe an input method framework that would permit tone or toneless input, but does not appear to describe how to accomplish this, suggesting that developers handle the issue. Further, the specification of U.S. Pat. No. 5,594,642 does not appear to provide a mapping to the often useful approach of using partial tone designation as disclosed in one or more embodiments of the present invention.
Conventional systems and methods may provide little or no control over the encoding employed. Also, conventional systems and methods may provide little or no ability to access lexical data in contexts other than the original one intended. Thus, for example, input methods and dictionaries cannot be used to mouse-over a word on the screen and obtain lexical information about it. As a result, while a significant quantity of data may be stored in a conventional system, it cannot be readily accessed to find out information about words already entered into text. This lack of accessibility can be a waste of potential resources.
Conventional systems and methods may not provide coherent interfaces so that external natural language processing (NLP) systems can share lexical data. Consequently, handwriting recognition, optical character recognition, speech-to-text conversion, text-to-speech conversion, and keyboard input conventionally all tend to operate as separate systems, and each tend to have their own data stores. Also, in contrast to embodiments described herein, conventional systems and methods may not provide a system for revenue sharing between an OS developer, lexical data providers, and an (input method) IM developer to cooperate and share jointly in revenue generated by implementing a synthesized system enabling the modular incorporation of various forms of lexical data from different sources. Furthermore, conventional systems and methods may not provide a level of data security that would be necessary or desirable to implement a revenue sharing system, as described above.