Recent advances in computing power and related technology have fostered the development of a new generation of powerful software applications including web-browsers, word processing and speech recognition applications. The latest generation of web-browsers, for example, anticipate a uniform resource locator (URL) address entry after a few of the initial characters of the domain name halve been entered. Word processors offer improved spelling and grammar checking capabilities, word prediction, and language conversion. Newer speech recognition applications similarly offer a wide variety of features with impressive recognition and prediction accuracy rates. In order to be useful to an end-user, these features must execute in substantially real-time. To provide this performance, many applications rely on a tree-like data structure to build a simple language model.
Simplistically, a language model measures the likelihood of any given sentence. That is, a language model can take any sequence of items (words, characters, letters, etc.) and estimate the probability of the sequence. A common approach to building a prior art language model is to utilize a prefix tree-like data structure to build an N-gram language model from a known training set of text.
The use of a prefix tree data structure (a.k.a. a suffix tree, or a PAT tree) enables a higher level application to quickly traverse the language model, providing the substantially real-time performance characteristics described above. Simplistically, the N-gram language model counts the number of occurrences of a particular item (word, character, etc.) in a string (of size N) throughout a text. The counts are used to calculate the probability of the use of the item stings. Traditionally, a tri-gram (N-gram where N=3) approach involves the following steps:                (a) characters (C) are segmented into words (W) using a pre-defined lexicon, wherein each W is mapped in the tree to one or more C's;        (b) predict the probability of a sequence of words (W1, W2, . . . WM) from the previous two words:P(W1, W2, W3, . . . WM)≈ΠP(Wi−1, Wi−2)  (1)        
The N-gram language model is limited in a number of respects. First, the counting process utilized in constructing the prefix tree is very time consuming. Thus, only small N-gram models (typically bi-gram, or tri-gram) can practically be achieved. Second, as the string size (N) of the N-gram language model increases, the memory required to store the prefix tree increases by 2N. Thus, the memory required to store the N-gram language model, and the access time required to utilize a large N-gram language model is prohibitively large for N-grams larger than three (i.e., a ti-gram).
As a consequence of these computational and architectural limitations, prior art implementations of N-gram language models tend to be very rigid. That is, prior art N-gram language models tend to use a standard (small) lexicon, a simplistic segmentation algorithm, and will typically only rely on the previous two words to predict the current word (in a tri-gram model).
A small lexicon limits the ability of the model to identify words to those contained in the lexicon. If a word is not in the lexicon, it does not exist as far as the model is concerned. A simplistic segmentation algorithm typically errors in favor of larger words over smaller words. Thus, the model is unable to accurately predict smaller words contained within larger lexiconically acceptable strings. Moreover, the lexicon and segmentation algorithm that converts the characters to words may be error-prone (e.g., it is well accepted that all known segmentation algorithms make errors), and that such errors are then propagated through the model thereby limiting its accuracy and predictive attributes.
Finally, limiting the model to at most the previous two words for context (in a tri-gram language model) is also limiting in that a greater context might be required to accurately predict the likelihood of a word. The limitations on these three aspects of the language model often result in poor predictive qualities of the language model.
Thus, an improved method and apparatus for generating and managing a language model data structure is required, unencumbered by the deficiencies and limitations commonly associated with prior art language modeling techniques. Just such a solution is provided below.