Vector quantization (VQ) techniques have been used to encode speech and image signals for the purpose of data bandwidth compression. The success achieved by vector quantization is due to the high degree of redundancy or structure typically found in voiced speech and most images. Vector quantization, in contrast with scalar quantization that typically expands bandwidth requirements because of the sampling and quantization of continuous signals, operates over the span of the signal in time or space in order to take advantage of the inherent structure of the signal.
In speech recognition systems, vector quantization has been used for preprocessing of speech data as a means for obtaining compact descriptors through the use of a relatively sparse set of code-book vectors to represent large dynamic range floating point vector elements.
Previous applications of vector quantization have been for coding of structured variable data. The present invention applies vector quantization to the encoding of unstructured but fixed data for the purpose of reducing the memory requirements for hidden Markov models used in a speech recognition system.
Modern hidden Markov model (HMM) speech recognition systems (e.g. Kai-Fu Lee, "Automatic Speech Recognition," Kluwer Academic Publishers, Boston/Dordrecht/London, 1989) are based on the recognition of phonemes as the basic unit of speech. However, phonemes are highly dependent on left-right context. A triphone model for context-dependent modeling was first proposed by Bahl et al. (Bahl, L. R., Baker, J. L., Cohen, P. S., Jelineck, F., Lewis, B. L., Mercer, R. L., "Recognition of a Continuously Read Natural Corpus," IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, April, 1978) and was applied by Schwartz et al. by creating a phoneme model for each left/right phonemic context. (Schwartz, R. M., Chow, X. L., Roucos, S., Krasuer, M., Makhoul, J., "Improved Hidden Markov Modeling of Phonemes for Continuos Speech Recognition," IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, April 1984; and Schwartz, R., Chow, Y., Kimball, O., Roucos, S., Krasner, M., Makhoul, J., "Context-Dependent Modeling for Acoustic-Phonetic Recognition of Continuos Speech," IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, April, 1985.) K. F. Lee (op. cit.) improved upon this model by averaging the context-dependent models with context-independent ones and by using an information-theoretic measure to cluster similar contexts into "generalized" triphone models.
FIG. 1 is a functional block diagram of a simple single code-book system for speech recognition system based on the use of a cepstral vector quantization (VQ) code-book.
The speech signal is transformed into an electrical signal by microphone 50 and fed into an analog-to-digital converter (ADC) 52 for scalar quantizing, typically converting samples of input data at a 16 khz rate. The quantized signal is applied to LPC processor 54 that extracts linear predictive coding (LPC) coefficients from the quantized data that are descriptive of the vocal track formant all-pole filter used to create the input sound. LPC processor 54 output is applied to cepstral processor 56 that transforms the LPC coefficients into informationally equivalent cepstral coefficients. The set of cepstral coefficients are treated as a vector and applied to VQ encoder 58 that encodes the cepstral vector by finding the closest vector in its code-book. It should be noted that elements 54 and 56 are representative of a large class of speech feature extraction processors that could be used, as indicated by feature processor 55, i.e., mel-spectral coefficients, and fourier spectral coefficients.
The output of VQ encoder 58 is applied to phoneme probability processor 60 that calculates the probability of the observed set of phonemes representing an input word, based upon the prestored hidden Markov models (HMMs) stored in HMM memory 62. Search engine 64 searches for the most probable candidate sentence given the phone probabilities and outputs the most probable candidate sentence identification.
In many practical speech recognition systems multiple VQ code-books are used which may include for example, differenced cepstral coefficients, power and differenced power values.
FIG. 2a shows a graph model example of a triphone HMM for the phoneme /ae/ as given by K. F. Lee (op. cit.). The model has three states as represented by the nodes labeled 1, 2, and 3. The process begins at node 0 and always proceeds with transition probability of 1.00 to node 1. There is a self-loop at node 1 that also has an associated transition probability, 0.68, that represents the probability that the phoneme will remain in state 1 having been most recently in state 1. Thus, the two transition probabilities (1.00 and 0.68) are associated with the beginning (B) state of the triphone model. The middle (M) portion of the model has a transition probability of 0.32 that the process will move from state 1 to state 2, a self-loop transition probability of 0.80 that being in state 2 it will remain in state 2 and a state 2 to state 3 transition probability of 0.20. The final end (E) portion has a self-loop probability of 0.70 and an output transition probability 0.3 that state 3 will be exited. Thus, at each of the three internal nodes (k=1, 2, and 3) there is a finite probability that by the next time interval elapses, the state will remain unchanged with probability P.sub.k,k, and will change with probability P.sub.k,k+1 from state k to k+1, such that P.sub.k,k +P.sub.k,k+1 =1.
In addition to the transition probabilities, there are a number of probability density functions associated with each state. For each triphone state several VQ code words, each from distinct code-books are used to replace the input vector. Each set of three code words represent (point to) three probability density functions (pdfs). In the example of FIG. 2, a three code-book, three pdf HMM is shown. FIG. 2b shows the cepstrum pdfs for each of the three states; FIG. 2c shows the differenced cepstrum coefficient pdfs; and FIG. 2d shows the power (combined power and differenced power) pdfs. Typically, these pdfs and transition probabilities are combined, under the assumption that they are independent, so that the probability of emitting multiple symbols can be computed as a product of the probability of producing each symbol. Thus, the probability that the Markov process is in state i, having generated output y.sub.t.sup.c, from code-book c is given by ##EQU1## where a.sub.ji is the transition probability from state j to i; b.sub.ji is a value taken from pdf bji representing the probability of emitting symbol y.sub.t.sup.c at time t from code-book c when taking the transition from state j to i, and .alpha..sub.j (t-1) is the probability that HMM is in state j at time t-1.
The storage of the pdfs places a large memory requirement on the speech recognition system. In a HMM speech recognition having 1024 models, 3 states per model (B,M,E), 3 code-books per state, 256 values per pdf, and 4 bytes per value requires a memory capacity of approximately 9.times.10.sup.6 bytes (1024.times.3.times.3.times.256.times.4). FIG. 3 is a memory map of such a pdf memory.
The large memory capacity required to store the pdfs has placed a serious burden on the design of speech recognition systems because it is the largest single requirement for storage and because real-time performance requires that pdf storage be in main memory.
One attempt at reducing the memory burden was implemented by F. Alleva of Carnegie Mellon University (Alleva, F., Hon, H., Huang, X., Hwang, M., Rosenfeld, R. Weide, R., "Applying Sphinx-II to DARPA Wall Street Journal CSR Task", Proc. of the DARPA Speech and NL Workshop, Feb. 1992, Morgan Kaufman Pub., San Mateo, Calif.). Alleva used scalar quantization by reducing each stored value to 1 byte instead of 4 bytes. This resulted in a 4/1 reduction.
Another failed attempt has been made to apply VQ techniques directly to packets of pdfs. The attempt failed because it resulted in an unacceptable increase in distortion.