In low bit rate voice coders, degradation of voice quality is often due to inaccurate voicing decisions. The difficulty in correctly making these voicing decisions lies in the fact that no single speech parameter or classifier can reliably distinguish voiced speech from unvoiced speech. In order to make the voice decision, it is known in the art to combine multiple speech classifiers in the form of a weighted sum. This method is commonly called discriminant analysis. Such a method is illustrated in D. P. Prezas, et al., "Fast and Accurate Pitch Detection Using Pattern Recognition and Adaptive Time-Domain Analysis," Proc. IEEE Int. Conf. Acoust., Speech and Signal Proc., Vol. 1, pp. 109-112, April 1986. As described in that article, a frame of speech is declared voice if a weighted sum of classifiers is greater than a specified threshold, and unvoiced otherwise. The weights and threshold are chosen to maximize performance on a training set of speech where the voicing of each frame is known.
A problem associated with the fixed weighted sum method is that it does not perform well when the speech environment changes. The reason is that the threshold is determined from the training set which is different from speech subject to background noise, non-linear distortion, and filtering.
One method for adapting the threshold value to changing speech environment is disclosed in the paper of H. Hassanein, et al., "Implementation of the Gold-Rabiner Pitch Detector in a Real Time Environment Using an Improved Voicing Detector," IEEE Transactions on Acoustic, Speech and Signal Processing, 1986, Tokyo, Vol. ASSP-33, No. 1, pp. 319-320. This paper discloses an empirical method which compares three different parameters against independent thresholds associated with these parameters and on the basis of each comparison either increments or decrements by one an adaptive threshold value. The three parameters utilized are energy of the signal, first reflection coefficient, and zero-crossing count. For example, if the energy of the speech signal is less than a predefined energy level, the adaptive threshold is incremented. On the other hand, if the energy of the speech signal is greater than another predefined energy level, the adaptive threshold is decremented by one. After the adaptive threshold has been calculated, it is subtracted from an output of a elementary pitch detector. If the results of the subtraction yield a positive number, the speech frame is declared voice; otherwise, the speech frame is declared on unvoice. The problem with the disclosed method is that the parameters themselves are not used in the elementary pitch detector. Hence, the adjustment of the adaptive threshold is ad hoc and is not directly linked to the physical phenomena from which it is calculated. In addition, the threshold cannot adapt to rapidly changing speech environments.