Recent advances in computing power and related technology have fostered the development of a new generation of powerful software applications including web-browsers, word processing and speech recognition applications. The latest generation of web-browsers, for example, anticipate a uniform resource locator (URL) address entry after a few of the initial characters of the domain name have been entered. Word processors offer improved spelling and grammer checking capabilities, word prediction, and language conversion. Newer speech recognition applications similarly offer a wide variety of features with impressive recognition and prediction accuracy rates. In order to be useful to an end-user, these features must execute in substantially real-time. To provide this performance, many applications rely on a tree-like data structure to build a simple language model.
Simplistically, a language model measures the likelihood of any given sentence. That is, a language model can take any sequence of items (words, characters, letters, etc.) and estimate the probability of the sequence. A common approach to building a prior art language model is to utilize a prefix tree-like data structure to build an N-gram language model from a known training set of a textual corpus.
The use of a prefix tree data structure (a.k.a. a suffix tree, or a PAT tree) enables a higher-level application to quickly traverse the language model, providing the substantially real-time performance characteristics described above. Simplistically, the N-gram language model counts the number of occurrences of a particular item (word, character, etc.) in a string (of size N) throughout a text. The counts are used to calculate the probability of the use of the item strings. Traditionally, a tri-gram (N-gram where N=3) approach involves the following steps:                (a) a textual corpus is dissected into a plurality of items (characters, letters, numbers, etc.);        (b) the items (e.g., characters (C)) are segmented (e.g., into words (W)) in accordance with a small, pre-defined lexicon and a simple, pre-defined segmentation algorithm, wherein each W is mapped in the tree to one or more C's;        (c) train a language model on the dissected corpus by counting the occurrence of strings of characters, from which the probability of a sequence of words (W1, W2, . . . WM) is predicted from the previous two words:P(W1, W2, W3, . . . WM)≈ΠP(Wi|Wi-1, Wi-2)   (1)         
The N-gram language model is limited in a number of respects. First, the counting process utilized in constructing the prefix tree is very time consuming. Thus, only small N-gram models (typically bi-gram, or tri-gram) can practically be achieved. Second, as the string size (N) of the N-gram language model increases, the memory required to store the prefix tree increases by 2N. Thus, the memory required to store the N-gram language model, and the access time required to utilize a large N-gram language model is prohibitively large for N-grams larger than three (i.e., a tri-gram).
Prior art N-gram language models tend to use a fixed (small) lexicon, a simplistic segmentation algorithm, and will typically only rely on the previous two words to predict the current word (in a tri-gram model).
A fixed lexicon limits the ability of the model to select the best words in general or specific to a task. If a word is not in the lexicon, it does not exist as far as the model is concerned. Thus, a small lexicon is not likely to cover the intended linguistic context.
The segmentation algorithms are often ad-hoc and not based on any statistical or semantic principles. A simplistic segmentation algorithm typically errors in favor of larger words over smaller words. Thus, the model is unable to accurately predict smaller words contained within larger lexiconally acceptable strings.
As a result of the foregoing limitations, a language model using prior art lexicon and segmentation algorithms tend to be error prone. That is, any errors made in the lexicon or segmentation stage are propagated throughout the language model, thereby limiting its accuracy and predictive attributes.
Finally, limiting the model to at most the previous two words for context (in a tri-gram language model) is also limiting in that a greater context might be required to accurately predict the likelihood of a word. The limitations on these three aspects of the language model often result in poor predictive qualities of the language model.
Thus, a system and method for lexicon, segmentation algorithm and language model joint optimization is required, unencumbered by the deficiencies and limitations commonly associated with prior art language modeling techniques. Just such a solution is provided below.