Electronic devices and the ways in which users interact with them are evolving rapidly. Changes in size, shape, input mechanisms, feedback mechanisms, functionality, and the like have introduced new challenges and opportunities relating to how a user enters information, such as text. Statistical language modeling can play a central role in input prediction and/or recognition, such as keyboard input prediction and speech (or handwriting) recognition. Effective language modeling can thus play a critical role in the overall quality of an electronic device as perceived by the user.
In some examples, statistical language modeling is used to convey the probability of occurrence in the language of possible strings of n words. Given a vocabulary of interest for an expected domain of use, determining the probability of occurrence of possible strings of n words can be done using a word n-gram model, trained to provide the probability of the current word given the n−1 previous words. Training data can be obtained from machine-readable text databases having representative documents in the expected domain.
Due to the finite size of such databases, however, many occurrences of n-word strings can be seen infrequently, yielding unreliable prediction results for all but the smallest values of n. Relatedly, sometimes it is cumbersome or impractical to gather a sufficiently large amount of training data. Further, the sizes of resulting language models may exceed what can reasonably be deployed onto portable electronic devices. Though it is possible to prune training data sets and/or n-gram language models to an acceptable size, pruned models tend to have reduced predictive power. Grammatically incorrect predictions are particularly problematic, as bad predictions are often more distracting than the lack of a prediction.