In computer systems, compression of data structures reduces memory requirements and processing time. For example, a continuous speech recognition system requires a large language model (LM). For large vocabulary systems, the LM is usually an N-gram language model. By far, the LM is the biggest data structure stored in a memory of a large vocabulary automated speech recognition (ASR) system.
However, in many small sized speech recognition systems, such as desktop computers and hand-held portable devices, memory limits the size of the LM that can be used. Therefore, reducing the memory requirements for the LM, without significantly affecting the performance, would be a great benefit to the systems.
As shown in FIG. 1, the LM can be stored as a back-off N-gram 100, see Katz, “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer,” IEEE Transactions on Acoustic, Speech, and Signal Processing, Vol. 35, No. 3, pp. 400–401, 1987. The N-gram 100 includes unigrams 101, bigrams 102, and trigrams 103. The back-off word trigram LM 100 shows a search for the trigram “the old man.”
In the N-gram, probabilities are stored as a tree structure. The tree structure originates from a hypothetical root node, not shown, which branches out into the unigram nodes 101 at a first level of the tree, each of which branches out to the bigram nodes 102 at a second level, and so forth.
Each node in the tree has an associated word identifier (id) 111. The word id represents the N-gram for that word, with a context represented by the sequence of words from the root of the tree up to, but not including, the node itself. For vocabularies with fewer than 65,536 words, the ids generally use a two byte representation as shown at the bottom.
In addition, each node has an associated probability (prob) 112 and boundaries (bounds) 114, and each non-terminal node has an associated back-off weight (weight) 113. All these values are floating-point numbers that can be compressed into two bytes, as shown at the bottom. Therefore, each unigram entry requires six bytes of storage, each bigram entry requires eight bytes, and each trigram entry requires four bytes.
The information for all nodes at a particular level in the tree is stored in sequential arrays as shown in FIG. 1. Each array in the ith level of the tree represents sequential entries of child nodes of the parent nodes in the (i−1)th level of the tree. The largest index of each entry is the boundary value for the entry that is stored in the parent node of that entry.
Because entries are stored consecutively, the boundary value of a parent node in the (i−1)th level, together with the boundary value of the sequentially previous parent node at the same level specifies the exact location of the children of that node at the ith level.
To locate a specific child node, a binary search of the ids of the word is performed between two specified boundary values. The binary search for the example in FIG. 1 is for the phrase “the old man.”
Lossy compression of the language model has been described by Whittaker et al., “Language Model Compression Techniques,” Proceedings of EUROSPEECH, 2001, and and Whittaker et al., “Language Model Quantization Analysis,” Proceedings of EUROSPEECH, 2001. They described the lossy compression of the language model (LM) through pruning and quantization of probabilities and backoff weights.
It is desired to further compress the language model using lossless compression so that large vocabulary ASR is enabled for small-memory devices, without an increase in me word error rate.