Recent advances in computing power and related technology have fostered the development of a new generation of powerful software applications including web-browsers, word processing and speech recognition applications. The latest generation of web-browsers, for example, anticipate a uniform resource locator (URL) address entry after a few of the initial characters of the domain name have been entered. Word processors offer improved spelling and grammar checking capabilities, word prediction, and language conversion. Newer speech recognition applications similarly offer a wide variety of features with impressive recognition and prediction accuracy rates. In order to be useful to an end-user, these features must execute in substantially real-time. To provide this performance, many applications rely on a tree-like data structure to build a simple language model.
Simplistically, a language model measures the likelihood of any given sentence. That is, a language model can take any sequence of items (words, characters, letters, etc.) and estimate the probability of the sequence. The estimation performed by language models are typically made using a simple lexicon (e.g., a word list) and a segmentation algorithm or rules. A common approach to building a prior art language model is to utilize a prefix tree-like data structure to build an N-gram language model from a known training set of a textual corpus.
The use of a prefix tree data structure (a.k.a. a suffix tree, or a PAT tree) enables a higher-level application to quickly traverse the language model, providing the substantially real-time performance characteristics described above. Simplistically, the N-gram language model counts the number of occurrences of a particular item (word, character, etc.) in a string (of size N) throughout a text. The counts are used to calculate the probability of the use of the item strings. Development of a typical tri-gram (N-gram where N=3) language model, for example, generally includes the steps of:                (a) dissecting a received textual corpus into a plurality of items (characters, letters, numbers, etc.);        (b) the items (e.g., characters (C)) are segmented (e.g., into words (W)) in accordance with a small, pre-defined lexicon and a simple, pre-defined segmentation algorithm, wherein each W is mapped in the tree to one or more C's;        (c) train a language model on the dissected corpus by counting the occurrence of strings of characters, from which the probability of a sequence of words (W1, W2, . . . WM) is predicted from the previous two words:P(W1, W2, W3, . . . WM)≈πP(Wi|Wi−1, Wi−2)   (1)        
The N-gram language model is limited in a number of respects. First, the counting process utilized in constructing the prefix tree is very time consuming. Thus, only small N-gram models (typically bi-gram, or tri-gram) can practically be achieved. Second, as the string size (N) of the N-gram language model increases, the memory required to store the prefix tree increases by 2N. Thus, the memory required to store the N-gram language model, and the access time required to utilize a large N-gram language model is prohibitively large for N-gram larger than three (i.e., a tri-gram).
As a consequence of these computational and architectural limitations, prior art implementations of N-gram language models tend to be very rigid. That is, prior art N-gram language models tend to use a standard (small) lexicon, a simplistic segmentation algorithm, and will typically only rely on the previous two words to predict the current word (in a tri-gram model).
A small lexicon limits the ability of the model to identify words to those contained in the lexicon. If a word is not in the lexicon, it does not exist as far as the model is concerned. Moreover, a basic multipurpose lexicon is not likely to represent the linguistic complexity or syntactic behavior of a particular application or style or writing. Thus, a small lexicon is not likely to cover the intended linguistic content of a given application.
The segmentation algorithms are often ad-hoc and not based on any statistical or semantic principles. A simplistic segmentation algorithm typically errors in favor of larger words over smaller words. Thus, the model is unable to accurately predict smaller words contained within larger, lexiconically acceptable strings.
As a result of the foregoing limitations, a language model using prior art lexicon and segmentation algorithms tends to be error prone. That is, any errors made in the lexicon or segmentation stage are propagated throughout the language model, thereby limiting its accuracy and predictive attributes.
In addition to the fundamental problems of a limited lexicon and a simplistic segmentation algorithm, the N-gram approach is fundamentally constrained by limiting the predictive features to at most the previous N-1 words. In the instance of a tri-gram (N=3) language mode (LM), the LM is limited to only the previous two words for context. These inherent limitations in the prior art of language modeling fundamentally constrain the accuracy of such language models.
In application, the prior art approach to language modeling may provide acceptable results in many alphabet-based languages with an accepted lexicon and well-defined segmentation. The aforementioned limitations inherent in such prior art language models are further exacerbated, however, when applied to numerical or character-based languages such as, for example, many Asian languages. The Chinese language, for example, is a character-based language with an expansive lexicon that is not well-defined, where single characters may form a word or may be combined with another character to form a multi-character word with a distinct and unique meaning in the language, and where there is limited punctuation to provide clues as to sentence structure, and the like. In such a language, lexicon and segmentation clues are difficult, at best, to come by. Prior art language modeling techniques provide very poor results when applied to such languages.
One proposed solution to improving the performance of a language model applied to character-based languages is to simply throw more data at the model, i.e., trade size for accuracy. The though is that more data provides a larger lexicon and basis for maximum match-based segmentation algorithms (to be defined more fully below) to refine the language model. An obvious consequence to this solution however, and a significant limitation in and of itself, is that to simply throw more data at the model significantly increases the memory requirements required to support the language model. Aside from the cost of providing the additional memory, larger language models place a greater computational burden on the host system/application utilizing the language model. The memory and computational consequences of the prior art solution typically result in a modest improvement in predicative capability. Moreover, as above, a huge data set does not necessarily provide an improved language model on a per-application basis.
Thus, a system and method for the joint optimization of language model performance and size is required, unencumbered by the deficiencies and limitations commonly associated with prior art language modeling techniques. Just such a solution is provided below.