In many real time processes, a problem exists in attempting to estimate the present state of the process in a changing environment from present and past samples of the process. One example of such a process is the generation of speech by the human vocal tract. The sound produced by the vocal tract can have a fundamental frequency--voiced state or no fundamental frequency--unvoiced state. Further, a third state may exist if no sound is being produced--silence state. The problem of determining these three states is referred to as the voicing/silence decision. In low bit rate voice coders, degradation of voice quality is often due to inaccurate voicing decisions. The difficulty in correctly making these voicing decisions lies in the fact that no single speech parameter or classifier can reliably distinguish voiced speech from unvoiced speech. In order to make the voicing decision, it is known in the art to combine multiple speech classifiers in the form of a weighted sum. Such a method is illustrated in D. P. Prezas, et al., "Fast and Accurate Pitch Detection Using Pattern Recognition and Adaptive Time-Domain Analysis," Proc. IEEE Int. Conf. Acoust., Speech and Signal Proc., Vol. 1, pp. 109-112, April 1986. As described in that article, a frame of speech is declared voiced if a weighted sum of speech classifiers is greater than a specified threshold; and unvoiced otherwise. Mathematically, this relationship may be expressed as a'x+b&gt;0 where "a" is a vector comprising the weights, "x" is a vector comprising the classifiers, and "b" is a scalar representing the threshold value. The weights are chosen to maximize performance on a training set of speech where the voicing of each frame is known. These weights form a decision rule which provides significant speech quality improvements in speech coders compared to those using a single parameter.
A problem associated with the fixed weighted sum method is that it does not perform well when the speech environment changes. Such changes in the speech environment may be a result of a telephone conversation being carried on in a car via a mobile telephone or maybe due to different telephone transmitters. The reason that the fixed weighted sum methods do not perform well in changing environments is that many speech classifiers are influenced by background noise, non-linear distortion, and filtering. If voicing is to be determined for speech with characteristics different from that of the training set, the weights, in general, will not yield satisfactory results.
One method for adapting the fixed weighted sum method to changing speech environment is disclosed in the paper of J. P. Campbell, et al., "Voiced/Unvoiced Classification of Speech with Application to the U.S. Government LPC-10E Algorithm," IEEE International Conference on Acoustics, Speech and Signal Processing, 1986, Tokyo, Vol. 9.11.4, pp. 473-476. This paper discloses the utilization of different sets of weights and threshold values each of which has been predetermined from the same set of training data with different levels of white noise being added to the training data for each set of weights and threshold value. For each frame, the speech samples are processed by a set of weights and a threshold value after the results of one of these sets is chosen on the basis of the value of a signal-to-noise-ratio, SNR. The range of possible values that the SNR can have is subdivided into subranges with each subrange being assigned to one of the sets. For each frame, the SNR is calculated; the subrange is determined; and then, the detector associated with this subrange is used to determine whether the frame is unvoiced/voiced. The problem with this method is that it is only valid for the training data plus white noise and cannot adapt to a wide range of speech environments and speakers. Therefore, there exists a need for a voiced detector that can reliably determine whether speech is unvoiced or voiced for a varying environment and different speakers.