FIG. 1 is a functional block diagram of the "front-end" of a voice processing system suitable for use in the encoding (sending) end of a vocoder system or as a data acquisition subsystem for a speech recognition system. (In the case of a vocoder system, a pitch extraction subsystem is also required.)
The acoustic voice signal is transformed into an electrical signal by microphone 11 and fed into an analog-to-digital converter (ADC) 13 for quantizing data typically at a sampling rate of 16 kHz (ADC 13 may also include an anti-aliasing filter). The quantized sampled data is applied to a single zero pre-emphasis filter 15 for "whitening" the spectrum. The pre-emphasized signal is applied to unit 17 that produces segmented blocks of data, each block overlapping the adjacent blocks by 50%. Windowing unit 19 applies a window, commonly of the Hamming type, to each block supplied by unit 17 for the purpose of controlling spectral leakage. The output is processed by LPC unit 21 that extracts the LPC coefficients {a.sub.k } that are descriptive of the vocal tract formant all pole filter represented by the z-transform transfer function ##EQU1## where EQU A(z)=1+a.sub.1 z.sup.-1 +a.sub.2 z.sup.-2 . . . +z.sub.m z.sup.-m( 1)
.sqroot..alpha. is a gain factor and, 8.ltoreq.m.ltoreq.12 (typically).
Cepstral processor 23 performs a transformation on the LPC coefficient parameters {a.sub.k } to produce a set of informationally equivalent cepstral coefficients by use of the following iterative relationship ##EQU2## where a.sub.O =1 and a.sub.k =0 for k&gt;m. The set of cepstral coefficients, {c(k)}, define the filter in terms of the logarithm of the filter transfer function, or ##EQU3## For further details, refer to Markel, J. D. and Gray, Jr., A. H., "Linear Prediction of Speech," Springer, Berlin Heidelberg New York, 1976, pp. 229-233.
The output of cepstral processor 23 is a cepstral data vector, C=[c.sub.1 c.sub.2 . . . c.sub.P ], that is applied to VQ 20 for the vector quantization of the cepstral data vector C into a VQ vector, C.
The purpose of VQ 20 is to reduce the degrees of freedom that may be present in the cepstral vector C. For example, the P-vector components, {C.sub.k }, of C are typically floating point numbers so that each may assume a very large range of values (far in excess of the quantization range at the output of ADC 13). This reduction is accomplished by using a relatively sparse code-book represented by memory unit 27 that spans the vector space of the set of C vectors. VQ matching unit 25 compares an input cepstral vector C.sub.i with the set of vectors {C.sub.j } stored in unit 27 and selects the specific VQ vector C.sub.i =[c.sub.1 c.sub.2 . . . c.sub.P ].sub.i.sup.T that is nearest to cepstral vector C. Nearness is measured by a distance metric. The usual distance metric is of the quadratic form EQU d(C.sub.i, C.sub.j)=(C.sub.i -C.sub.j).sup.T W(C.sub.i -C.sub.j)(4)
where W is a positive definite weighting matrix, often taken to be the identity matrix, I. Once the closest vector, C.sub.j, of code-book 27 is found, the index, j, is sufficient to represent it. Thus, for example, if the cepstral vector C has 12 components, [c.sub.1 c.sub.2 . . . c.sub.12 ].sup.T, each represented by a 32-bit floating point number, the 384 bit C-vector is typically replaced by the index i=1, 2, . . . , 256 requiring only 8 bits. This compression is achieved at the price of increased distortion (error) represented by the difference between vectors C and C.
Obviously, generation of the entries in code-book 27 is critical to the performance of VQ 20. One commonly used method, commonly known as the LBG algorithm, has been described (Linde, Y., Buzo, A., and Gray, R. M., "An Algorithm for Vector Quantization," IEEE Trans. Commun., COM-28, No. 1 (January 1980), pp. 84-95). It is an iterative procedure that requires an initial training sequence and an initial set of VQ code-book vectors.
FIG. 2 is a flow diagram of the basic LBG algorithm. The process begins in step 90 with an initial set of code-book vectors, {C.sub.j }.sub.O, and a set of training vectors, {C.sub.ti }. The components of these vectors represent their coordinates in the multidimensional vector space. In the encode step 92, each training vector is compared with the initial set of code-book vector. Step 94 measures an overall error based on the distance between the coordinates of each training vector and the code-book vector to which it has been assigned in step 92. Test step 96 checks to see if the overall error is within acceptable limits, and, if so, ends the process. If not, the process moves to step 98 where a new set of code-book vectors, {C.sub.j }.sub.k, is generated corresponding to the centroids of the coordinates of each subset of training vectors previously assigned in step 92 to a specific code-book vector. The process then advances to step 92 for another iteration.
FIG. 3 is a flow diagram of a binary tree variation on the LBG training algorithm in which the size of the initial code-book is progressively doubled until the desired code-book size is attained as described by Rabiner, L., Sondhi, M., and Levinson S., "Note on the Properties of a Vector Quantizer for LPC Coefficients," BSTJ, Vol. 62, No. 8, October 1983 pp. 2603-2615. The process begins at step 100 and proceeds to step 102, where two (M=2) candidate code vectors (centroids) are established. In step 104, each vector of the training set {T}, is assigned to the closest candidate code vector and then the average error {distortion, d(M)) is computed using the candidate vectors and the assumed assignment of the training vectors into M clusters. Step 108 compares the normalized difference between the computed average distortion, d(M), with the previously computed average distortion, d.sub.old. If the normalized absolute difference does not exceed a preset threshold, .epsilon., d.sub.old is set equal to d(M) and a new candidate centroid is computed in step 112 and a new iteration through steps 104, 106 and 108 is performed. If threshold is exceeded, indicating a significant increase in distortion or divergence over the prior iteration, the prior computed centroids in step 112 are stored and if the value of M is less than the maximum preset value M*, test step 114 advances the process to step 116 where M is doubled. Step 118 splits the existing centroids last computed in step 112 and then proceeds to step 104 for a new set of inter-loop iterations. If the required number of centroids (code-book vectors) is equal to M*, step 114 causes the process to terminate.
The present invention may be practiced with other VQ code-book generating (training) methods based on distance metrics. For example, Bahl, et al. describe a "supervised VQ" wherein the code-book vectors (centroids) are chosen to best correspond to phonetic labels (Bahl, L. R., et al., "Large Vocabulary Natural Language Continuous Speech Recognition", Proceeding of the IEEE CASSP 1989, Glasgow). Also, the k-means method, or a variant thereof, may be used in which an initial set of centroids is selected from widely spaced vectors of the training sequence (Grey, R. M., "Vector Quantization", IEEE ASSP Magazine, April 1984, Vol. 1, No. 2, p. 10).
Once a "training" procedure such as outlined above has been used to generate a VQ code-book, it may be used for the encoding of data.
For example, in a speech recognition system, such as the SPHINX described in Lee, K., "Automatic Speech Recognition, The Development of the SPHINX System," Kluwer Academic Publishers, Boston/Dordrecht/London, 1989, the VQ code-book contains 256 vectors entries. Each cepstral vector has 12 component elements.
The vector code to be assigned by VQ 20 is properly determined by measuring the distance between each code-book vector, C.sub.j, and the candidate vector, C.sub.i. The distance metric used is the unweighted (W=I) Euclidean quadratic form EQU d(C.sub.i, C.sub.j)=(C.sub.i -C.sub.j).sup.T .multidot.(C.sub.i -C.sub.j)(5 )
which may be expanded as follows: EQU d(C.sub.i, C.sub.j)=C.sub.i.sup.T .multidot.C.sub.i +C.sub.j.sup.T .multidot.C.sub.j -2C.sub.j.sup.T .multidot.C.sub.i ( 6)
If the two vector sets, {C.sub.i } and {C.sub.j } are normalized so that C.sub.i.sup.T .multidot.C.sub.i and C.sub.j.sup.T .multidot.C.sub.j are fixed values for all i and j, the distance is minimum when C.sub.j.sup.T .multidot.C.sub.i is maximum. Thus, the essential computation for finding the value C.sub.j that minimizes d(C.sub.i, C.sub.j) is the value of j that maximizes ##EQU4##
The ability of speech encoding and recognition systems to function reliably is affected by the training data available as well as the environment in which the speech data is acquired. The quality of training data may be improved by increasing the variety of speakers used and the quantity of data used. However, variations in the acoustic environment which includes the acoustical properties of the rooms in which the speech sound is generated, the microphone and signal conditioning equipment, and the placement of the speaker will, in general, affect the performance of the speech recognition apparatus. Also, the presence of noise within the acoustical environment, such as created by typewriter, fan, and extraneous voice sources, will contribute to the unreliability of speech recognition.
Using increasing amounts of speech recognition training data that is representative of each of the possible combinations of acoustic environment variations and noise combinations should improve speech recognition but, as a practical matter, the ability to predict all of the combinations of environmental and noise characteristics that may be encountered is limited and the number of possible combinations are so large that if is desirable to find an adaptive robust means for adjusting the recognition process as the actual noise and acoustical environment is encountered.
It has been demonstrated that the two major factors that degrade the performance of speech recognition systems using desktop microphone in normal office environments are noise and unknown filtering (Liu, et al, "Efficient Joint Compensation of Speech for the Effects of Additive Noise an Linear Filtering," IEEE, ICASSP-92, Mar. 23-26, 1992, San Francisco, Calif., Vol. 1, pp. I-257-I-260.) It has also been shown that the simultaneous joint compensation for the effects of additive noise and linear filtering is needed to achieve maximal robustness with respect to these acoustical signal differences between training and testing environments [Acero, et al., "Environmental Robustness in Automatic Speech Recognition," IEEE ICASSP-90, April 1990, pp. 849-852]. This precludes the cascading of processes for dealing with additive noise and convolutional distortion.
The conventional (prior art) method for dealing with noise uses a spectral subtraction technique in which the noise spectrum is estimated during the "silence" interval between speech segments and is subtracted from the noise spectrum of the noisy speech. These methods generally lead to problems in estimating the speech spectrum because they can introduce negative spectral values [Hermansky et al., "Recognition of Speech in Additive and Convolutional Noise Based on RASTA Spectral Processing," IEEE ICASSP-93, April 1993, Minneapolis, Mich., Vol. II, pp. II-83-II-86]. Ad hoc procedures are required to eliminate negative coefficients.
Hermansky et al. further argue that the correction can be done in the spectral domain by filtering out the very low frequency components because they are due to relatively stationary features not attributable to speech, and filtering out non-speech high frequency components reflecting activity occurring faster than the ability of humans to manipulate their speech articulators. This technique may be suited for stationary linear convolutional noise and stationary additive noise effects by providing a coarse correction which is the same regardless of the spoken utterance but does not provide for adaptation to a non-stationary acoustic and noise environment.
An approach to cepstral correction based on classifying (grouping) the speech frames by their respective signal-to-noise ratio for both training and testing data is described by Liu et al. (op. cit.). Once the speech frames are grouped, the mean cepstral vector for each group is computed and a histogram of the number of frames per group is constructed for both the training set and the testing set. The histograms are then aligned using dynamic time-warping techniques in order to determine a correspondence between the groups of training data and the test data. The corrections are found by subtracting the mean vector of the test data from the mean vector of the training data for each corresponding group.
New data is corrected by determining the signal-to-noise ratio of each new frame of speech and applying the correction, previously computed, that corresponds to the determined signal to noise ratio. This technique requires the accumulation of a large amount of test data because every signal-to-noise ratio group must be determined together. This requirement results in slow adaptation. Also, if one or more signal-to-noise ratio groups do not have many exemplars, the dynamic time warping alignment and subsequently determined corrections may be inadequately computed.
The present invention, using test data more efficiently, increases the rate of adaptation by using a multistage correction process wherein an initial coarse correction is applied and then refined. Both the coarse and refined corrections adapt and model the actual current acoustic and noise conditions.