1. Field of the Invention
This invention relates to a method and apparatus for voiced/unvoiced decision for judging whether an input speech signal is voiced or unvoiced and a speech encoding method employing the method for voiced/unvoiced decision.
2. Description of the Related Art
There are presently known a variety of encoding methods for compressing audio signals, including both speech signals and acoustic signals, by exploiting statistical characteristics of the audio signals in the time domain and in the frequency domain and characteristics of the human hearing mechanism. These encoding methods may roughly be divided into encoding in the time domain, encoding in the frequency domain and analysis/synthesis encoding.
For encoding speech signals, decision information includes information as to whether the input speech signal is voiced or unvoiced. The voiced sound is the sound accompanying vibration of vocal chords, while the unvoiced sound is the sound not accompanying vibration of vocal chords.
In general, the process of deciding or discriminating the voiced (V) sound and the unvoiced (UV) sound (V/UV decision) is carried out by a method accompanying pitch extraction, according to which the unvoiced/voiced (V/UV) decision is made using, for example, peaks of the autocorrelation function as characteristics of periodicity/non-periodicity. However, since no effective decision can be given when the input sound is non-periodic but is a voiced sound, the energy of the speech signal or the number of zero-crossings, for example, are also used as other parameters.
Meanwhile, since the voiced/unvoiced (V/UV) decision is made conventionally by a decisive rule of executing a logical operation of the results of decision of the respective parameters, it is difficult to give comprehensive decision on the input parameters in their entirety. For example, under a rule which states: `if the frame averaged energy is larger than a pre-set threshold value and the autocorrelation peak value of the residual is larger than a pre-set threshold value, the sound is voiced` the sound is not judged to be voiced if the frame averaged energy significantly exceeds the threshold value but the autocorrelation peak value of the residual is smaller even by a small amount than the threshold value.
In addition, a particular input speech is in need of a rule proper to it, such that, for accommodating all possible sorts of the input speech, a corresponding large number of rules need to be used, thus entailing complication.
On the other hand, the V/UV decision employing spectral similarity, that is results of band-based V/UV decision, used in, for example, multiband excitation encoding (MBE), presupposes correct pitch detection. In fact, however, it is extremely difficult to perform pitch detection correctly to a high precision.