The present invention relates to a background noise/speech classification method of deciding whether an input signal belongs to a background noise period or a speech period, in encoding/decoding the speech signal, a voiced/unvoiced classification method of deciding whether an input signal belongs to a voiced period or an unvoiced period, a background noise decoding method of obtaining comfort background noise by decoding.
The present invention relates to a speech encoding method of compression-encoding a speech signal and a speech encoding apparatus, particularly including processing of obtaining a pitch period in encoding the speech signal.
High-efficiency, low-bit-rate encoding for speech signals is an important technique for an increase in channel capacity and a reduction in communication cost in mobile telephone communications and local communications. A speech signal can be divided into a background noise period in which no speech is present and a speech period in which speech is present. A speech period is a significant period for speech communication, but the bit rate in a background noise period can be decreased as long as the comfort of the speech communication is maintained. By decreasing the bit rate in each background noise period, the overall bit rate can be decreased to attain a further increase in channel capacity and a further reduction in communication cost.
In this case, if background noise/speech classification fails, for example, a speech period is classified as a background noise period, the speech period is encoded at a low bit rate, resulting in a serious deterioration in speech quality. In contrast to this, if a background noise period is classified as a speech period, the overall bit rate increases, resulting in a decrease in encoding efficiency. For this reason, an accurate background noise/speech classification technique must be established.
According to a conventional background noise/speech classification method, a change in power information of a signal is monitored to perform background noise period/speech period classification. For example, according to J. F. Lynch Jr et al., "Speech/Silence Segmentation for Real-time Coding via Rule Based Adaptive Endpoint Detection", Proc. ICASSP, '87, pp. 31.7.1-31.7.4 (reference 1), background noise/speech classification is performed by using a speech metrics and a background noise metrics which are calculated from the frame power of an input signal.
In the method of performing background noise/speech classification by using only the power information of a signal, no problem is posed in a state in which background noise is scarcely heard. This is because, in such a case, the signal power in a speech period is sufficiently larger than the signal power in a background noise period, and hence the speech period can be easily identified. In reality, however, large background noise is present in some case. In such a state, accurate background noise/speech classification cannot be realized. In addition, background noise is not always white noise. For example, background noise whose spectrum is not flat, e.g., sounds produced when cars or a train passes by or other people talk, may be present. According to the conventional background noise/speech classification method, proper classification is very difficult to perform in the presence of such background noise.
A speech signal can be divided into a voiced period having high periodicity and corresponding to a vowel and an unvoiced period having low periodicity and corresponding to a consonant. The signal characteristics in a voiced period clearly differ from those in an unvoiced period. If, therefore, encoding methods and bit rates suited for these periods are set, a further improvement in speech quality and a further decrease in bit rate can be attained.
In this case, if voiced/unvoiced classification fails, and a voiced period is classified as an unvoiced period, or an unvoiced period is classified as a voiced period, the speech quality seriously deteriorates or the bit rate undesirably increases. For this reason, it is important to establish an accurate voiced/ unvoiced classification method.
For example, a conventional voiced/unvoiced classification method is disclosed in J. P. Campbell et al., "Voiced/Unvoiced Classification of Speech with Applications to the U.S. Government LPC-10E Algorithm", Proc. ICASSP, '86, vol. 1, pp. 473-476 (reference 2). According to reference 2, a plurality of types of acoustical parameters for speech are calculated, and the weighted average value of these acoustical parameters is obtained. This value is then compared with a predetermined threshold to perform voiced/unvoiced classification.
It is, however, clear that the voiced/unvoiced classification performance is greatly influenced by the balance between a weighting value used for each acoustical parameter for weighted average calculation and a threshold. It is difficult to determine optimal weighting values and an optimal threshold.
A conventional background noise decoding method will be described next. In a background noise period, encoding is performed at a very low bit rate to decrease the overall bit rate, as described above. For example, according to E. Paksoy et al., "Variable Rate Speech Coding with Phonetic Segmentation", Proc. ICASSP, '93, pp. II-155-158 (reference 3), background noise is encoded at a bit rate as low as 1.0 kbps. On the decoding side, the background noise information is decoded by using the decoded parameter expressed at such a low bit rate.
In such a speech decoding method for a background noise period, since a decoded parameter is expressed at a very low bit rate, the update cycle of each parameter is prolonged. If, for example, the update cycle of a decoded parameter for a gain is prolonged, a change in gain in a background noise period cannot properly follow. As a result, a change in gain becomes discontinuous. If background noise information is decoded by using such a gain, a discontinuous change in gain becomes offensive to the ear, resulting in a great deterioration in subjective quality.
As described above, according to the conventional background noise/speech classification method using only the power information of a signal, accurate background noise/speech classification cannot be realized in the presence of large background noise. In addition, it is very difficult to perform proper classification in the presence of background noise whose spectrum is not that of white noise, e.g., sounds produced when cars or a train passes by or other people talk.
In the conventional voiced/unvoiced classification method using the technique of comparing the weighted average value of acoustical parameters with a threshold, classification becomes unstable and inaccurate depending on the balance between a weighting value used for each acoustical parameter and a threshold.
In the conventional speech decoding method for a background noise period, since a decoded parameter for background noise is expressed at a very low bit rate, the update cycle of each parameter is prolonged. If, therefore, the update cycle of a decoded parameter for a gain is long, in particular, a change in gain in a background noise period cannot properly follow, and a change in gain becomes discontinuous. As a result, a great deterioration in subjective quality occurs.