The present invention is related to the field of efficient numerical encoding of physical data for use in an automatic recognition system. A particular application is the field of speech encoding for storage or transmission and for recognition. The invention addresses problems of efficient numerical encoding of physically derived data and efficient computation of likelihood scores during automatic recognition.
A high level of detailed technical and mathematical skill is common of practitioners in the art. This application presumes familiarity with known techniques of speech recognition and related techniques of numerically encoding physical data, including physical waveform data. This application briefly reviews some basic types of prior art encoding and recognition schemes in order to make the description of the invention understandable. This review should not be seen as comprehensive, and the reader is referred to the references cited herein as well as to other prior art documents. This review also should not be seen as limiting the invention to the particular examples and techniques described herein and in no case should the invention be limited except as described in the attached claims and all allowable equivalents.
Two earlier co-assigned U.S. applications, 08/276,742 now U.S. Pat. No. 5,825,978 issued Oct. 20, 1998 entitled METHOD AND APPARATUS FOR SPEECH RECOGNITION USING OPTIMIZED PARTIAL MIXTURE TYING .sub.(287-41) and 08/375,908 now U.S. Pat. No. 5,864,810 issued Jan. 26, 1999 entitled METHOD AND APPARATUS FOR ADAPTING A SPEECH RECOGNIZER TO A PARTICULAR SPEAKER .sub.(287-40), discuss techniques useful in speech encoding and recognition and are fully incorporated herein by reference.
For purposes of clarity, this discussion refers to devices, concepts, and methods in terms of specific examples. However, the method and apparatus of the present invention may operate with a wide variety of types of digital devices including devices different from the specific examples described below. It is therefore not intended that the invention be limited except as provided in the attached claims.
Basics of Encoding and Recognition PA0 HMM-Based Signal Recognition PA0 Vector Quantized Speech Recognition
FIGS. 1A and 1B illustrate a basic process for encoding physical data, such as speech wave form data, into numerical values and then performing vector quantization (VQ) on those values. A physical signal 2 is sampled at some interval. For speech data, the interval is generally defined by a unit of time t, which in an example system is 10 milliseconds (ms). A signal processor 5 receives the physical data and generates a set of numerical values representing that data. In some known speech recognition systems, a cepstral analysis is preformed and an observed vector Xt (10) consisting of a set of cepstral values (C.sub.1 to C.sub.13) is generated for each interval of time t. In one system, each of the 13 values is a real number and may be represented in a digital computer as 32 bits. Thus, in this specific example, each 10 ms interval of speech (sometimes referred to as a frame) is encoded as an observed vector X of thirteen 32-bit values or 416 bits of data. Other types of signal processing are possible, such as, for example, where the measured interval does not represent time, where the measurement of the interval is different, where more or fewer values or different values are used to represent the physical data, where cepstral coefficients are not used, where cepstral values and their first and/or second derivatives are also encoded, or where different numbers of bits are used to encode values. In speech encoded for audio playback, rather than recognition, different coefficients are typically used.
In some systems, the Xt vectors may be used directly to transmit or store the physical data, or to perform recognition or other types of processing. In the system just described, transmission would require 416*100 bits per second (bps) or 41.6 kbps of continuous transmission time. Recognition system based on original full cepstral vectors often employ Continuous Density Hidden Markov Models (CDHMMs), possibly with the probability functions of each model approximated by mixtures of Gaussians. The 08/276,742 patent, incorporated above, discussed a method for sharing mixtures in such a system to enhance performance.
However, often it is desirable to perform further encoding of the vectors in order to reduce the number of bits needed to represent the vectors and in order to simplify further processing. One known method for doing this is called vector quantization, a type of which is shown in FIG. 1B.
Vector quantization (VQ) takes advantage of the fact that in most physical systems of interest, the values (C.sub.1 to C.sub.13) that make up a particular vector Xt are not independent but instead have a relationship one to another, such that the individual value of C.sub.3 for example, will have some non-random correlation to other values in that vector.
VQ also takes advantage of the fact that in most physical systems of interest, not all possible vectors will be observed. When encoding human speech in a particular language for example, many ranges of vector values (representing sounds that are not part of human speech) will never be observed, while other ranges of vectors will be common. Such relationships can be understood geometrically by imagining a continuous 13-dimensional space, which, though hard to visualize, shares many properties with real 3-dimensional space. In this continuous 13-dimensional space, every possible vector X will represent a point in the space. If one were to measure a large number of X vectors for a physical system of interest, such as human speech in a particular language, and plot a point for each measured X, the points plotted in space would not be evenly or randomly distributed, but would instead form distinct clusters. Areas of space that represented common sounds in human speech would have many points while areas of space that represented sounds that were never part of human speech would have no points.
In standard VQ, an analogous procedure is used to plot clusters and use those clusters to divide the space into a finite number of volumes. In the 13-dimensional example described above, a sample of human speech data is gathered, processed, and plotted in the 13-dimensional space and 13-dimensional volumes are drawn around dense clusters of points. The size and shape of a particular volume may be determined by the density of points in a particular region. In many systems, a predetermined number of volumes, such as 256, are drawn in the space in such a way as to completely fill the space. Each volume is assigned an index number (also referred to as a codeword) and a "central" point (or centroid) is computed for each volume, either geometrically from the volume or taking into account the actual points plotted and finding a central point. The codewords, the descriptions of the volumes to which they relate, and the centroids to which they are mapped, are sometimes referred to in the art as a codebook. Some systems use multiple codebooks, using a separate codebook for each feature that is quantized. Some systems also use different codebooks for different speakers or groups of speakers, for example using one codebook or set of codebooks for male speakers and another for female speakers.
Once the volumes are determined from training data, new speech data may be encoded by mathematically plotting the 13 value vector in the 13-dimensional space, determining which volume the point falls in (or which centroid the point is closest to) and storing for that point the VQ index value (in one example, simply an 8-bit number from 0 to 255) for that volume, thus Xt is encoded as VQt. When it is time to unencode the data, the 8-bit VQ is used to look-up the centroid for that volume and the (416-bit) value of the centroid can be used as an approximation of the actual observed vector Xt. First and second derivatives can be computed from these decoded centroids or those values can initially be encoded and stored similarly to the centroids possibly using separate codebooks.
After the physical data is encoded, it may be presented to an automatic recognition system, such as a speech recognition system. State-of-the-art speech recognizers are based on statistical techniques, with Hidden Markov Models (HMMs) being the dominant approach. The typical components of a speech recognition and understanding system are the front-end processor, the decoder with its acoustic and language models, and the language understanding component.
The front-end processor typically performs a short-time Fourier analysis and extracts a sequence of observation (or acoustic) vectors. Many choices exist for the acoustic vectors, but the cepstral coefficients have exhibited the best performance to date. The decoder is based on a communication theory view of the recognition problem, trying to extract the most likely sequence of words W=[w.sub.1,w.sub.2, . . . ,w.sub.N ] given the series of acoustic vectors X This can be done using Bayes' rule: ##EQU1##
The probability P (W) of the word sequence W is obtained from the language model, whereas the acoustic model determines the probability P (W.vertline.X).
In HMM-based recognizers, the probability of an observation sequence for a given word is obtained by building a finite-state model, possibly by concatenating models of the elementary speech sounds or phones. The state sequence S=[s.sub.1,s.sub.2, . . . ,s.sub.T ] is modeled as a Markov chain, and is not observed. At each state s.sub.t and time t, an acoustic vector is observed based on the distribution b.sub.s.sub..sub.t =P(X.sub.t.vertline.s.sub.t), which is called output distribution.
If the front-end processor quantizes the acoustic vectors as described above, the output distributions take the form of discrete probability distributions. If the acoustic vector generated is instead passed to the acoustic model before quantization, then continuous-density output distributions are used, with the multivariate-mixture Gaussians of the following form a common choice: ##EQU2##
where p(.omega..sub.i.vertline.s) is the weight of the i-th mixture component in state s, and N(x;.mu.,.SIGMA.) is the multivariate Gaussian with mean .mu. and covariance .SIGMA.. In work prior to the present invention, continuous-density HMMs (CDHMMs) with mixture components that are shared across HMM states were used because continuous density HMMs were generally believed to exhibit superior recognition performance over their discrete-density counterparts.
In CDHMM speech recognition, even with some mixture tying, it is known that computing probabilities can be extremely computationally intensive. While such systems have been shown to perform accurately, with recognition accuracy approaching 90-95% for a recognition vocabulary of five to twenty words, the amount of processing required has generally made them unsuitable for some applications.
It is known that systems using VQ, however, can compute probabilities much more quickly, by using discrete HMMs. In a VQ system, a probability for each model (P.sub.HMMn) can be computed during training for each centroid and stored in a table indexed by the VQ index. Determining the probabilities for a particular observed vector of speech then is reduced to determining the VQ index for that volume and looking up the probabilities in a table. While discrete HMM systems have been shown to perform very quickly, their error rate is generally two times higher than the continuous-density HMMs and this high error rate is not acceptable in many applications.
The degradation in accuracy of the discrete-density HMMs can be attributed to the low resolution with which the space of observation features (the acoustic space) is represented. A typical discrete-density HMM uses a VQ codebook with 256 codewords to represent a 13-dimensional space. Increasing the codebook size is not a feasible solution: the computation and memory requirements of the vector quantizer are proportional either to the number of codewords, if a linear vector quantizer is used, or to their logarithm (i.e. the number of bits), when a tree-structured vector quantizer is used. Most significant, however, is the cost of storing the look-up tables with the precomputed probabilities. The number of parameters for a discrete-density HMM is proportional to the number of codewords in the quantizer. For medium to large vocabulary applications, there are millions of parameters in discrete-density HMMs, and hence increasing the codebook size is not a feasible solution.
One particular need for efficient, accurate speech recognition has arisen in the field of client/server recognition applications over a network such as the Internet (or WWW).
What is needed is a new type of encoding and modeling system for physical data such as speech that allows for efficient transmission of observed features and improved accuracy of recognition.
There is a voluminous scientific literature related to speech recognition, some of which is referenced in the previously cited co-assigned patents. Literature more directly related to aspects of the invention is listed below. The listing of a reference below is not to be construed as a statement by applicants that the reference constitutes prior art for the purposes of evaluation the patentability of the present invention.
[1] D. Goddeau, W. Goldenthal and C. Weikart, "Deploying Speech Applications over the Web," Proceedings Eurospeech, pp. 685-688, Rhodes, Greece, September 1997. PA1 [2] L. Julia, A. Cheyer, L. Neumeyer, J. Dowding and M. Charafeddine, "http://www.speech.sri.com/demos/atis.html," Proceedings AAAI'97, Stanford, Calif., March 1997. PA1 [3] E. Hurley, J. Polifroni and J. Glass, "Telephone Data Collection Using the World Wide Web," Proceedings ICSLP, pp. 1898-1901, Philadelphia, Pa., October 1996. PA1 [4] S. Bayer, "Embedding Speech in Web Interfaces," Proceedings ICSLP, pp. 1684-1687, Philadelphia, Pa., October 1996. PA1 [5] M. Sokolov, "Speaker Verification on the World Wide Web," Proceedings Eurospeech, pp. 847-850, Rhodes, Greece, September 1997. PA1 [6] C. Hemphill and Y. Muthusamy, "Developing Web-Based Speech Applications," Proceedings Eurospeech, Rhodes, Greece, September 1997. PA1 [7] The Aurora Project, announced at Telecom 95, "http://gold.ity.int/TELECOM/wt95", Geneva, October 1995. See also "http://fipa.comtec.cojp/fipa/yorktown/nyws029.htm". PA1 [8] D. Stallard, "The BBN SPIN System", presented at the Voice on the Net Conference, Boston, Mass., September 1997. PA1 [9] S. J. Young, "A Review of Large-Vocabulary Continuous-Speech Recognition," IEEE Signal Processing Magazine, pp. 45-57, September 1996. PA1 [10] S. B. Davis and P. Mermelstein, "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences," IEEE Trans. Acoustics Speech and Signal Processing, Vol. ASSP-28(4), pp. 357-366, August 1980. PA1 [11] V. Digalakis and H. Murveit, "Genones: Optimizing the Degree of Mixture Tying in a Large Vocabulary Hidden Markov Model Based Speech Recognizer," IEEE Trans. Speech Audio Processing, pp. 281-289, July 1996. PA1 [12] A. Gersho and R. M. Gray, "Vector Quantization and Signal Compression," Kluwer Academic Publishers, 1991. PA1 [13] J. Makhoul, S. Roucos and H. Gish, "Vector Quantization in Speech Coding," Proceedings of the IEEE, Vol. 73, No. 11, pp. 1551-1588, November 1985. PA1 [14] P. Price. "Evaluation of spoken language systems: The ATIS domain," Proceedings of the Third DARPA Speech and Natural Language Workshop, Hidden Valley, Pa., June 1990, Morgan Kaufmann. PA1 [15] "Quantization of Cepstral Parameters for Speech Recognition over the World Wide Web", V. Digalakis, L. Neumeyer and M. Perakakis, ICASSP'98. PA1 [16] "Quantization of Cepstral Parameters for Speech Recognition over the World Wide Web", V. Digalakis, L. Newmeyer and M. Perakakis, Submitted to Journal of Selected Areas in Communications.
In references 15 and 16 listed above, some of the present inventors discussed and evaluated various coding techniques in order to transmit and recognize speech in a client-server speech recognition application over the World Wide Web (WWW) including approaches, such as linear and non-linear scalar quantization algorithms, and a more advanced algorithm that comprises a part of the present invention based on product codes.