The present invention relates to system and methods for compressing language models and, more particularly, to system and methods for providing lossless compression of n-gram language models used in a real-time decoding speech recognition system.
Conventional language models which are commonly used in automatic speech (or handwriting) real-time decoders are 3-gram statistical language models. In general, n-gram language modelling involves determining a set of potential choices, using probabilities, of a current word based on a number of immediately preceding words. Specifically, n-gram language modelling looks to a particular history, i.e., a sequence of (n-1) words, to select the most probable words from a given list of words based upon the one or more preceding words. In a preferred embodiment of the present invention, trigrams (i.e., n-grams with n=3) are the basis of the language models to be compressed. In the context of word language models, a trigram is a string of three consecutive words (denoted by w1 w2 w3). Similarly, a bigram is a string of two consecutive words, and a unigram is a single word.
A trigram language model is one which assigns a probability for predicting a future word (denoted by w3) given the past two words (denoted by w1 and w2). Such a model provides the probability of any given word to follow a word string of the form ". . . w1 w2 ", for all possible words w1 and w2 in a given vocabulary. This is demonstrated by the following equation: P(w3 w2 w1), which represents the probability that w3 occurs given that the previous two words were w1w2. A detailed explanation on n-gram language models may be found, for example, in F. Jelinek and R. L. Mercer, "Interpolated Estimation of Markov Source Parameters From Sparse Data," Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, Eds., 1980, North-Holland, Amsterdam.
N-gram language models (such as trigrams) are trained using large text corpora. Such training involves inputting training data and tracking every sequence of three words (i.e., trigrams) in the training data. Such training provides counts for all 3-grams, 2-grams and unigrams identified in the training text. The count of a given n-gram is the number of occurrences of a given n-gram in the training data. As stated above, this language model data is then used to assign language model probabilities to strings of words that have a close match to a spoken utterance. This n-gram data is then stored in a decoder in such a way as to allow fast access to the stored probabilities for a list of alternative word strings produced by an acoustic decoding model.
In order to access the stored language model data, a conventional method typically used involves storing preceding tuples of words (i.e. word histories) together with addresses that point to memory locations containing sets of word-probability pairs (i.e., next candidate words with conditional probabilities), with both the address and the word-probability pair each occupying 4 bytes of memory storage. By way of example, assume that a table of pairs of words w1 land w2 (i.e., bigrams) that were met in some training corpus are stored. For each such bigram w1 w2 in the table, an address A12 pointing to a set of word-probability pairs (w3, Probl12(w3)) is stored. The term w3 denotes a word that followed the bigram w1 w2 in the training text (assuming that the given bigram has a count exceeding some threshold amount, e.g., 3). The term Prob12(w3) denotes a conditional (log) probability of w3 following w1 w2 (which is estimated from the training corpus). As indicated above, each such address A12 and word-probability pair can be stored in blocks of 4-bytes. This 4-byte grouping scheme affords an efficient use of 32-bit memory processing chips which exist in some workstations and personal computers (PCs).
Experiments by the present inventors indicate that the storage of the 3-gram language model components (as represented above) requires approximately 5*n megabytes (MB) of memory, where n is a number of trigrams. For example, 8 trillion trigrams are generated for a speech recognition system which is trained to recognize a vocabulary size of 20,000 words. Having to store all the n-grams produced during training in accordance with the above method (as well as other known methods) requires a significant quantity of memory. Therefore, estimating and storing the probabilities for each of these trigrams is not practical.
Consequently, methods for filtering data are employed to reduce the number of n-grams which must be stored. One such method involves filtering out (i.e., pruning) those n-grams having low counts or those providing a small contribution to likelihood of the data. By way of example, assume that the size of a training corpus contains several hundred million words. With the pruning method, only bigrams with counts exceeding 10-20, for example, are stored. This would produce approximately 10-20 million of allowed bigrams. This pruning method, however, results in a substantial increase in the rate of error for the recognizer. Decoding experiments performed by the present inventors demonstrated a 10-15% improvement in the decoding accuracy using preserved (not filtered) language data as compared to using filtered language data. In practical applications, however, it is necessary to reduce the rate of storage of these language models to a certain number of 3-grams and 2-grams.
One approach to controlling the storage size of the language model without affecting the performance of the real-time recognition system is to compress the language model, i.e., storing the n-gram models in a compressed format. As demonstrated above, there are several distinct components of the language model such as an index area (W12) (e.g., bigram w1 w2 and address A12), words (e.g., W3) and probability records (e.g., Prob12(w3)). By reducing the storage requirements for each of these components, the language model may be significantly compressed.
Generally, there are conventional techniques for compressing index areas and records. For example, a typical method for address compression is the inverted file compression technique disclosed in "Managing Gigabytes," by Ian H. Witten, New York, 1994, p. 82, ch. 3.3. Further, conventional methods for compressing and storing records are the variable length coding, bitmap coding, prefix coding and Ziv-Lempel coding techniques which are disclosed in "Data Compression: Methods and Theory" by James A. Storer, Computer Science Press, 1988.
These general purpose compression methods, however, are not suitable for compressing n-gram language models in a real-time decoder because such methods do not facilitate a fast random access and decompression of the n-gram records, which is required for performing real-time decoding (e.g., processing and decoding a language model in real-time). In particular, the following requirements should be met when decoding in real-time. First, local data which is frequently accessed should fit into pages of memory or CACHE. Pages refer to blocks of contiguous locations (commonly ranging from 1K bytes to 8K bytes in length) in the main memory or secondary storage (such as a disk) of a digital computer, which constitute the basic unit of information that is transferred between the main memory and the secondary storage when required. For example, it is not feasible to separate word records and probability records and compress them separately since corresponding word numbers and probabilities should be located in close proximity so as to expedite processing for real-time decoding.
Further, compression methods that are based on histograms of word and probability distributions are difficult to implement for real-time decoding. The difficulty in using such histograms for words and probabilities lies in the fact that these histograms have different characteristics and may only be implemented if word records are separated from probability records. Consequently, as indicated above, it is not possible to separate word records and probability records when storing language models (since they should be located in close proximity).
In addition, the fetched data should be stored as a multiple of bytes since standard "seek" and "read" file commands operate with bytes and not fractions of bytes (see, for example, B. W. Kernigham and D. M. Ritchie, "The C Programming Language", Prentice-Hall, Inc., London, 1978). In addition, some operating systems (OS) work faster with 4-byte blocks of data.
Another requirement for decoding in real-time is that sets of language model records should be stored with each record having the same length. This allows a binary search to be performed for certain records of the language model.
These requirements cannot be met by utilizing the aforementioned conventional compression methods. For example, the inverse file compression technique is not applicable since the current n-gram language model uses a binary search on indexes. This requires that the index data be stored with equal lengths of data, which is not possible with standard inverse index methods. Further, the conventional variable length coding technique is not applicable since searching through records having variable lengths (e.g., W3) and fetching such records (e.g., Prob12 (w3)) is too slow for real-time decoding applications. The present invention addresses these problems and provides methods for significantly compressing n-gram language models without increasing the decoding time of such compressed models.