1. Field of the Invention
The present invention relates in general to linguistic data, and in particular to storage and use of the linguistic data for text processing and text input.
2. Description of the State of the Art
The growing use of mobile devices and different types of embedded systems challenges the developers and manufacturers of these devices to create products that require minimal memory usage, yet perform well. A key element of these products is the user interface, which typically enables a user to enter text which is processed by the product.
One application of linguistic data is to facilitate text entry by predicting word completions based on the first characters of a word that are entered by a user. Given a set of predictions that are retrieved from the linguistic data, the user may select one of the predictions, and thus not have to enter the remaining characters in the word.
The prediction of user input is especially useful when included in a mobile device, since such devices typically have input devices, including keyboards, that are constrained in size, Input prediction minimizes the number of keystrokes required to enter words on such devices.
Input prediction is also useful when text is entered using a reduced keyboard. A reduced keyboard has fewer keys than characters that can be entered, thus keystroke combinations are ambiguous. A system that uses linguistic data for input prediction allows the user to easily resolve such ambiguities. Linguistic data can also be used to disambiguate individual keystrokes that are entered using a reduced keyboard.
Existing solutions for storage of linguistic data used for text input and processing typically rely on hash tables, trees, linguistic databases or plain word lists. The number of words covered by these linguistic data formats is limited to the words which have been stored.
The linguistic data which is used in existing text input prediction systems is typically derived from a body of language, either text or speech, known as a corpus. A corpus has uses such as analysis of language to establish its characteristics, analysis of human behavior in terms of use of language in certain situations, training a system to adapt its behavior to particular linguistic circumstances, verifying empirically a theory concerning language, or providing a test set for a language engineering technique or application to establish how well it works in practice. There are national corpora of hundreds of millions of words and there are also corpora which are constructed for particular purposes. An example of a purpose-specific corpus is one comprised of recordings of car drivers speaking to a simulation of a voice-operated control system that recognizes spoken commands. An example of a national corpus is the English language.