The present invention relates to a background noise/speech classification method of deciding whether an input signal belongs to a background noise period of a speech period, in encoding/decoding the speech signal a voiced/unvoiced classification method of deciding whether an input signal belongs to a voiced period or an unvoiced period, a background noise decoding method of obtaining comfort background noise by decoding.
The present invention relates to a speech encoding method of compression-encoding a speech signal and a speech encoding apparatus, particularly including processing of obtaining a pitch period in encoding the speech signal.
High-efficiency, low-bit-rate encoding for speech signals is an important technique for an increase in channel capacity and a reduction in communication cost in mobile telephone communications and local communications. A speech signal can be divided into a background noise period in which no speech is present and a speech period in which speech is present. A speech period is a significant period for speech communication, but the bit rate in a background noise period can be decreased as long as the comfort of the speech communication is maintained. By decreasing the bit rate in each background noise period, the overall bit rate can be decreased to attain a further increase in channel capacity and a further reduction in communication cost.
In this case, if background noise/speech classification fails, for example, a speech period is classified as a background noise period, the speech period is encoded at a low bit rate, resulting in a serious deterioration in speech quality. In contrast to this, if a background noise period is classified as a speech period, the overall bit rate increases, resulting in a decrease in encoding efficiency. For this reason, an accurate background noise/speech classification technique must be established.
According to a conventional background noise/speech classification method, a change in power information of a signal is monitored to perform background noise period/speech period classification. For example, according to J. F. Lynch Jr et al., xe2x80x9cSpeech/Silence Segmentation for Real-time Coding via Rule Based Adaptive Endpoint Detectionxe2x80x9d, Proc. ICASSP, ""87, pp. 31.7.1-31.7.4 (reference 1), background noise/speech classification is performed by using a speech metrics and a background noise metrics which are calculated from the frame power of an input signal.
In the method of performing background noise/speech classification by using only the power information of a signal, no problem is posed in a state in which background noise is scarcely heard. This is because, in such a case, the signal power in a speech period is sufficiently larger than the signal power in a background noise period, and hence the speech period can be easily identified. In reality, however, large background noise is present in some case. In such a state, accurate background noise/speech classification cannot be realized. In addition, background noise is not always white noise. For example, background noise whose spectrum is not flat, e.g., sounds produced when cars or a train passes by or other people talk, may be present. According to the conventional background noise/speech classification method, proper classification is very difficult to perform in the presence of such background noise.
A speech signal can be divided into a voiced period having high periodicity and corresponding to a vowel and an unvoiced period having low periodicity and corresponding to a consonant. The signal characteristics in a voiced period clearly differ from those in an unvoiced period. If, therefore, encoding methods and bit rates suited for these periods are set, a further improvement in speech quality and a further decrease in bit rate can be attained.
In this case, if voiced/unvoiced classification fails, and a voiced period is classified as an unvoiced period, or an unvoiced period is classified as a voiced period, the speech quality seriously deteriorates or the bit rate undesirably increases. For this reason, it is important to establish an accurate voiced/unvoiced classification method.
For example, a conventional voiced/unvoiced classification method is disclosed in J. P. Campbell et al., xe2x80x9cVoiced/Unvoiced Classification of Speech with Applications to the U.S. Government LPC-10E Algorithmxe2x80x9d, Proc. ICASSP, ""86, vol. 1, pp. 473-476 (reference 2). According to reference 2, a plurality of types of acoustical parameters for speech are calculated, and the weighted average value of these acoustical parameters is obtained. This value is then compared with a predetermined threshold to perform voiced/unvoiced classification.
It is, however, clear that the voiced/unvoiced classification performance is greatly influenced by the balance between a weighting value used for each acoustical parameter for weighted average calculation and a threshold. It is difficult to determine optimal weighting values and an optimal threshold.
A conventional background noise decoding method will be described next. In a background noise period, encoding is performed at a very low bit rate to decrease the overall bit rate, as described above. For example, according to E. Paksoy et al., xe2x80x9cVariable Rate Speech Coding with Phonetic Segmentationxe2x80x9d, Proc. ICASSP, ""93, pp. II-155-158 (reference 3), background noise is encoded at a bit rate as low as 1.0 kbps. On the decoding side, the background noise information is decoded by using the decoded parameter expressed at such a low bit rate.
In such a speech decoding method for a background noise period, since a decoded parameter is expressed at a very low bit rate, the update cycle of each parameter is prolonged. If, for example, the update cycle of a decoded parameter for a gain is prolonged, a change in gain in a background noise period cannot properly follow. As a result, a change in gain becomes discontinuous. If background noise information is decoded by using such a gain, a discontinuous change in gain becomes offensive to the ear, resulting in a great deterioration in subjective quality.
As described above, according to the conventional background noise/speech classification method using only the power information of a signal, accurate background noise/speech classification cannot be realized in the presence of large background noise. In addition, it is very difficult to perform proper classification in the presence of background noise whose spectrum is not that of white noise, e.g., sounds produced when cars or a train passes by or other people talk.
In the conventional voiced/unvoiced classification method using the technique of comparing the weighted average value of acoustical parameters with a threshold, classification becomes unstable and inaccurate depending on the balance between a weighting value used for each acoustical parameter and a threshold.
In the conventional speech decoding method for a background noise period, since a decoded parameter for background noise is expressed at a very low bit rate, the update cycle of each parameter is prolonged. If, therefore, the update cycle of a decoded parameter for a gain is long, in particular, a change in gain in a background noise period cannot properly follow, and a change in gain becomes discontinuous. As a result, a great deterioration in subjective quality occurs.
It is the principal object of the present invention to provide a background noise/speech classification method capable of properly performing background noise period/speech period classification regardless of the magnitude and characteristics of background noise.
It is another object of the present invention to a voiced/unvoiced classification method capable of performing stable, accurate voiced period/unvoiced period classification.
It is still another object of the present invention to provide a background noise decoding method capable of obtaining background noise with excellent subjective quality by decoding even if a decoded parameter of the background noise is expressed at a very low bit rate.
It is still another object of the present invention to provide a speech encoding method and apparatus which can properly obtain a frame period of a speech signal with a small calculation amount, and express a pitch period with a small information amount.
According to the present invention, there is provided a background noise/speech classification method including calculating power and spectral information of an input signal as feature amounts, and comparing the calculated feature amounts with estimated feature amounts constituted by pieces of estimated power and estimated spectral information in a background noise period, thereby deciding whether the input signal belongs to speech or background noise.
More specifically, calculated feature amounts are compared with estimated feature amounts to analyze power and spectral fluctuation amounts. If both the analysis results on the power and spectral fluctuation amounts indicate that the input signal is background noise, it is decided that the input signal is background noise. Otherwise, it is decided that the input signal is speech. For example, the spectral information is updated by an LSP coefficient.
When background noise/speech classification is performed by using spectral information as well as power information, even a speech period with small power can be accurately decided because the spectrum in a background noise period clearly differs from that in a speech period.
In this background noise/speech classification method, estimated feature amounts are preferably updated by different methods depending on whether it is decided that an input signal belongs to background noise or speech. In addition, the update amount to be set when it is decided that an input signal belongs to background noise is preferably set to be smaller than that to be set when it is decided that the input signal belongs to speech. With this setting, even if an input signal has a long speech period and undergoes a change to xe2x80x9cbackground noisexe2x80x9d after the long speech period, since the estimated feature amounts are hardly influenced by the feature amounts in the speech period, background noise can be easily identified.
A spectral fluctuation amount can be accurately analyzed by comparing a predetermined threshold with the spectral distortion between a spectral envelope obtained from the spectral information of an input signal and a spectral envelope obtained from estimated spectral information in a background noise period. With this operation, more accurate background noise/speech classification can be realized.
In this case, if the threshold is changed in accordance with estimated power information, e.g., the threshold is increased when the estimated power is small and vice versa, decision errors caused by changes in spectral fluctuation due to changes in estimated power can be reduced. More accurate background noise/speech classification can therefore be realized.
In the present invention, when a decision result indicating that an input signal belongs to speech or background noise changes from xe2x80x9cspeechxe2x80x9d to xe2x80x9cbackground noisexe2x80x9d, the decision result may be forcibly changed to xe2x80x9cspeechxe2x80x9d only for a specific period (to be referred to as a hangover period). In this case, the hangover period is changed in accordance with pieces of estimated power and estimated spectral information in a background noise period. For example, when estimated frame power or the formant spectral power of a spectral envelope obtained from estimated spectral information is large, the hangover period is prolonged to prevent omission of the end of a sentence which occurs when the background noise power is large or the background noise spectrum is not that of white noise.
In a voiced/unvoiced classification method according to the present invention, a voiced appearance probability table and an unvoiced appearance probability table in which voiced and unvoiced appearance probabilities are respectively written in correspondence with speech feature amounts are prepared, and voiced and unvoiced probabilities are obtained by referring to the voiced appearance probability table and the unvoiced appearance probability table by using a feature amount calculated from input speech as a key, thereby deciding on the basis of the voiced and unvoiced probabilities whether the input speech belongs to speech or background noise.
In this case, for example, voiced/unvoiced decision is manually performed on actual speech data to prepare a voiced appearance probability table and an unvoiced appearance probability table on the basis of the decision results. Since most likelihood speech quality can be determined by using these tables, the classification performance is not influenced by an empirically determined weighting value or threshold, unlike the conventional method. Stable, accurate voiced/unvoiced classification can therefore be realized.
In a background noise decoding method of the present invention, an excitation signal for driving a synthesis filter for synthesizing background noise, a gain by which the excitation signal is to be multiplied, and information of the synthesis filter are decoded to smooth the gain to be used when background noise information is decoded. When background noise information is decoded in this manner, since the gain changes smoothly, the subjective quality of background noise obtained by decoding is improved.
In smoothing a gain in this manner, the gain is gradually increased when the gain increases, whereas the gain is quickly decreased when the gain decreases. With this operation, an unnecessary increase in gain due to smoothing of the gain can be prevented, and the subjective quality is improved more effectively.
The present invention provides a speech encoding processing method including dividing an input speech signal into frames each having a predetermined length, obtaining the pitch period of the input speech signal, obtaining the pitch period of a future frame with respect to the current frame to be encoded, and encoding the pitch period.
The present invention provides a speech encoding method including dividing an input speech signal into frames each having a predetermined length, dividing a speech signal of each frame into subframes, and obtaining the pitch period of the speech signal, the predictive pitch period of a subframe in the current frame being obtained by using the pitch periods of at least two frames of the current frame to be encoded and past and future frames with respect to the current frame, and the pitch period of the subframe in the current frame being obtained by using the predictive pitch period.
As described above, according to the present invention, the pitch period of a future frame with respect to the current frame is obtained. The predictive pitch period of a subframe in the current frame is obtained by interpolation using the pitch periods of both the current and previous frames, and the pitch period of the subframe in the current frame is obtained by using this predictive pitch period. Even if the pitch period varies within a frame, therefore, the pitch period of a subframe can be accurately obtained with a small calculation amount and can be expressed with a small information amount.
In addition, since a predicted subframe pitch period approximates to the actual pitch period with a considerable accuracy, no problem is posed even if the search range for the pitch period of a subframe is limited to, e.g., eight candidates. Assume that the search range for a subframe pitch period is set to eight candidates. In this case, since the pitch period of each frame is expressed by seven bits, and the pitch period of each subframe is expressed by three bits, if four subframes constitute one frame, the pitch periods of the subframes in each frame can be expressed with an information amount of 7 bits+3 bits *4=19 bits, unlike the prior art in which 28 bits are required per frame. In addition, since the search range for subframe pitch periods is as small as eight candidates, the calculation amount can be greatly reduced.
In the present invention, the pitch period of a subframe in the current frame, which is obtained in the above manner, may be encoded. When a pitch filter is to be used to emphasize the pitch period component of an input speech signal, a transfer function for the pitch filter may be determined by using the pitch period of a subframe in the current frame, which is obtained in the above manner. The pitch filter is known as a constituent element of a perceptual weighting filter or a post filter.
The present invention provides a speech encoding method including preparing an adaptive codebook storing a plurality of adaptive vectors generated by repeating a past excitation signal series at a period included in a predetermined range, and searching a predetermined search range for an adaptive vector with a period that minimizes the error between a target vector and a signal obtained by filtering the adaptive vector extracted from the adaptive codebook through a predetermined filter, an input speech signal is divided into frames each having a predetermined length. A speech signal of each frame is further divided into subframes, the predictive pitch period of a subframe in the current frame is obtained by using the pitch periods of at least two frames of the current frame to be encoded and past and future frames with respect to the current frame, and the search range for subframes in the current frame is determined by using the predictive pitch period.
In the present invention, when the pitch periods of frames are to be obtained, the pitch period analysis position may be adaptively determined in units of frames. More specifically, the pitch period analysis position is decided on the basis of the magnitude of the power of a speech signal, a prediction error signal, or the short-term power of a prediction error signal obtained through a low-pass filter. With this operation, a pitch period can be obtained more accurately, and hence an improvement in the quality of decoded speech can be attained.
A method of obtaining the pitch period of a subframe in the current frame may be selected in accordance with the continuance of pitch periods. If, for example, it is decided that a change in pitch period is continuous, a predicted subframe pitch period is obtained, and a range near this value is searched to obtain a subframe pitch period. In contrast to this, if it is decided that a change in pitch period is discontinuous, a subframe pitch period is obtained by searching all subframes. With this adaptive processing, an optimal subframe pitch period search method is selected in accordance with the continuance of pitch periods, the quality of decoded speech is improved.
Furthermore, a relative pitch pattern codebook storing a plurality of relative pitch patterns representing fluctuations in the pitch periods of a plurality of subframes may be prepared, and a change in pitch period of a subframe may be expressed with one relative pitch pattern selected from the relative pitch pattern codebook on the basis of a predetermined index, thereby further decreasing the number of bits of information expressing a subframe pitch period.
More specifically, the relative pitch pattern codebook stores, for example, relative pitch patterns with high appearance frequencies as vectors. These vectors are matched with the pitch periods of a plurality of subframes as vectors to express the pitch periods of a plurality of subframes by optimal relative pitch patterns. If, for example, three bits are required to express the pitch period of each subframe, 12 bits are required for four subframes. If, however, this four-dimensional vector is expressed by one relative pitch pattern having a size corresponding to seven bits in the relative pitch pattern, five bits can be reduced per frame.
According to the present invention, there is provided a computer-readable recording medium on which a program for performing speech encoding processing including processing of dividing an input speech signal into frames each having a predetermined length, and obtaining the pitch period of the input speech signal is recorded. A program for executing processing of obtaining the pitch period of a future frame with respect to the current frame to be encoded, and processing of encoding the pitch period is recorded on the recording medium.
According to the present invention, there is provided a computer-readable recording medium on which a program for performing speech encoding processing including processing of dividing an input speech signal into frames each having a predetermined length, further dividing the speech signal of each frame into subframes, and obtaining the pitch period of the input speech signal is recorded. A program for executing processing of obtaining the predictive pitch period of a subframe in the current frame by using the pitch periods of at least two frames of the current frame to be encoded and past and future frames with respect to the current frame, and obtaining the pitch period of a subframe in the current frame by using the predictive pitch period is recorded on the recording medium.
According to the present invention, there is provided a computer-readable recording medium which has an adaptive codebook storing a plurality of adaptive vectors generated by repeating a past excitation signal string at a period included in a predetermined range, and on which a program for performing speech encoding processing including processing of searching a predetermined range for an adaptive vector with a period that minimizes the error between a target vector and a signal obtained by filtering an adaptive vector extracted from the adaptive codebook through a predetermined filter is recorded. A program for executing processing of dividing an input speech signal into frames each having a predetermined length, further dividing the speech signal of each frame into subframes obtaining the predictive pitch period of a subframe in the current frame by using the pitch periods of at least two frames of the current frame to be encoded and past and future frames with respect to the current frame, and determining the search range for subframes in the current frame by using the predictive pitch period is recorded on the recording medium.
Additional object and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The object and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.