Electronic devices and the ways in which users interact with them are evolving rapidly. Changes in size, shape, input mechanisms, feedback mechanisms, functionality, and the like have introduced new challenges and opportunities relating to how a user enters information, such as text. Statistical language modeling can play a central role in many text prediction and recognition problems, such as speech or handwriting recognition and keyboard input prediction. An effective language model can be critical to constrain the underlying pattern analysis, guide the search through various (partial) text hypotheses, and/or contribute to the determination of the final outcome. In some examples, statistical language modeling has been used to convey the probability of occurrence in the language of all possible strings of n words.
Given a vocabulary of interest for the expected domain of use, determining the probability of occurrence of all possible strings of n words has been done using a word n-gram model, which can be trained to provide the probability of the current word given the n−1 previous words. Training has typically involved large machine-readable text databases, comprising representative documents in the expected domain. Even so, due to the finite size of such databases, many occurrences of n-word strings can be seen infrequently, yielding unreliable parameter values for all but the smallest values of n. Compounding the problem, in some applications it can be cumbersome or impractical to gather a large enough amount of training data. In other applications, the size of the resulting model may exceed what can reasonably be deployed. In some instances, training data sets and n-gram models can be pruned to an acceptable size, which can negatively impact the predictive power of the resulting pruned models.
In such situations, it has often been expedient to rely on a character m-gram model. Just as a word n-gram can be based on strings of n words, a character m-gram can be based on strings of m characters, where typically m>n. Thus a character m-gram can be trained to provide the probability of the current character given the m−1 previous characters encountered. Because the number of characters in the alphabet is typically much smaller than the number of words in the vocabulary, a character m-gram can be much more compact than a word n-gram for usual values of m and n. Thus, a proper estimation can be performed with a lot less data, which makes such models particularly popular for embedded applications.
While character m-grams are typically more compact and easier to estimate than word n-grams, they can also be less predictive due to the much coarser granularity involved. On the other hand, character m-grams tend to be more robust, in the sense that they generalize better to out-of-vocabulary words. In text prediction applications in particular, a character model can be coupled with a large domain-appropriate lexicon to provide whole word completions and predictions, rather than semantically meaningless partial fragments. This combination, however, can still suffer from an inherent lack of predictive power due to the character restriction of character m-grams.
Accordingly, using either a word n-gram model or a character m-gram model for particular applications can limit overall prediction accuracy, either due to an inherent lack of predictive power due to the character restriction in the case of character m-gram models, or due to a de facto lack of predictive power from excessive pruning in the case of word n-gram models.