1. Field of the Invention
The present invention relates to a speech coding method and a speech coding apparatus for compressing a digital speech signal to a smaller quantity of information, and more particularly to the encoding of the excitation in the speech coding method and speech coding apparatus.
2. Description of Related Art
Conventional speech coding methods and speech coding apparatuses generally generate speech codes by dividing an input speech into spectrum envelope information and excitation, and by coding them separately on a frame by frame basis. As for the coding of the excitation, to maintain the coding quality of the input speech with various types of behavior including background noise, the so-called multi-mode coding has been studied which prepares a plurality of excitation modes with different expressions, and selects one of them frame by frame. Speech coding methods and speech coding apparatus for carrying out the conventional multi-mode coding are disclosed in Japanese patent application laid-open No. 3-156498/1991 or international publication No. WO98/40877.
FIG. 8 is a block diagram showing a configuration of a conventional speech coding apparatus disclosed in Japanese patent application laid-open No. 3-156498/1991. In this figure, the reference numeral 1 designates an input speech, 2 designates a linear prediction analyzing unit, 3 designates a linear prediction coefficient coding unit, 7 designates a multiplexer, 8 designates a speech code, and 47 designates an excitation coding section. In the excitation coding section 47, 48 designates a classifying unit, 49 and 50 each designate a switch, 51 designates a multi-pulse excitation coding unit, and 52 designates a vowel segment excitation coding unit.
Next, the operation of the conventional speech coding apparatus disclosed in Japanese patent application laid-open No. 3-156498 will be described.
The conventional speech coding apparatus with the configuration as shown in FIG. 8 carries out its processing for each frame with a fixed length, a 10 ms long frame, for example.
First, the input speech 1 is supplied to the linear prediction analyzing unit 2, the classifying unit 48 and the switch 49. The linear prediction analyzing unit 2 analyzes the input speech 1, and extracts the linear prediction coefficients constituting the spectrum envelope information of the speech. The linear prediction coefficient coding unit 3 encodes the extracted linear prediction coefficients, and supplies the code to the multiplexer 7. In addition, it outputs linear prediction coefficients which are quantized for the encoding of the excitation.
The classifying unit 48 analyzes the acoustic characteristic of the input speech 1, classifies it into a vowel signal and the other signal, and supplies the classified result to the switches 49 and 50. The switch 49 connects the input speech 1 to the vowel segment excitation coding unit 52 when the classified result by the classifying unit 48 is the vowel signal, and connects the input speech 1 to the multi-pulse excitation coding unit 51 when the classified result by the classifying unit 48 is other than the vowel signal.
The multi-pulse excitation coding unit 51 encodes the excitation by combining a plurality of pulse trains, and supplies the encoded result to the switch 50. The vowel segment excitation coding unit 52 calculates segment lengths with variable duration, encodes the excitation of the segments using a multi-pulse excitation model with improved pitch interpolation, and supplies the encoded result to the switch 50.
The switch 50 connects the encoded result fed from the vowel segment excitation coding unit 52 to the multiplexer 7 when the classified result by the classifying unit 48 is a vowel signal, and the encoded result fed from the multi-pulse excitation coding unit 51 to the multiplexer 7 when the classified result is not the vowel signal. The multiplexer 7 multiplexes the code supplied from the linear prediction coefficient coding unit 3 and the encoded result fed from the switch 50, and outputs a resultant speech code 8.
It is reported that the conventional speech coding apparatus disclosed in Japanese patent application laid-open No. 3-156498/1991 can represent the speech signal in a smaller quantity of information by selecting one of the previously prepared excitation models in accordance with the acoustic characteristics of the input speech 1, and by carrying out encoding using the selected excitation model.
FIG. 9 is a block diagram showing a configuration of another conventional speech coding apparatus disclosed in international publication No. WO98/40877. In this figure, the reference numeral 1 designates an input speech, 2 designates a linear prediction analyzing unit, 3 designates a linear prediction coefficient coding unit, 4 designates an adaptive excitation coding unit, 7 designates a multiplexer, 8 designates a speech code, 53 and 54 each designate a driving excitation coding unit, 55 and 56 each designate a gain coding unit, and 57 designates a minimum distortion selecting unit.
Next, the operation of the conventional speech coding apparatus disclosed in the international publication No. WO98/40877 will be described.
The conventional speech coding apparatus with the configuration as shown in FIG. 9 carries out its processing on a frame by frame basis, the frame consisting of a speech segment with the duration of about 5–50 ms. As for the encoding of the excitation, it carries out its processing for each sub-frame with the duration of half the frame. For the sake of simplicity, the two terms “frame” and “sub-frame” are not distinguished, and are called “frame” from now on.
First, the input speech 1 is supplied to the linear prediction analyzing unit 2, adaptive excitation coding unit 4 and driving excitation coding unit 53. The linear prediction analyzing unit 2 analyzes the input speech 1, and extracts the linear prediction coefficients constituting the spectrum envelope information of the speech. The linear prediction coefficient coding unit 3 encodes the linear prediction coefficients, supplies its code to the multiplexer 7, and outputs the linear prediction coefficients that are quantized for the coding of the excitation.
The adaptive excitation coding unit 4 stores previous excitation with a predetermined length as an adaptive excitation code book. Receiving an adaptive excitation code represented by a binary number of a few bits, the adaptive excitation codebook calculates a repetition period from the adaptive excitation code, and generates time-series vectors that cyclically repeats the previous excitation by using the repetition period. The adaptive excitation coding unit 4 produces a temporary synthesized signal by passing the individual time-series vectors, which are obtained by inputting the individual adaptive excitation codes into the adaptive excitation codebook, through the synthesis filter that uses the quantized linear prediction coefficients fed from the linear prediction coefficient coding unit 3. Then, the distortion is detected between the input speech 1 and the signal obtained by multiplying the temporary synthesized signal by a gain. The processing is carried out for all the adaptive excitation codes, and the adaptive excitation code that gives the minimum distortion is selected so that the time-series vector corresponding to the selected adaptive excitation code is output as the adaptive excitation. In addition, the signal obtained by subtracting from the input speech 1 a signal that is produced by multiplying the synthesized signal based on the adaptive excitation by an appropriate gain is output as a target signal to be encoded.
The driving excitation coding unit 54 stores a plurality of time-series vectors as a driving excitation codebook. The driving excitation codebook, receiving the driving excitation code represented by a binary number of a few bits, reads the time-series vector stored in the position corresponding to the driving excitation code and outputs it. The driving excitation coding unit 54 obtains the individual time-series vectors by supplying the driving excitation codebook with the individual adaptive excitation codes, and obtains the temporary synthesized signal by passing them through the synthesis filter using the quantized linear prediction coefficients fed from the linear prediction coefficient coding unit 3. Then, the driving excitation coding unit 54 detects the distortion between the signal, which is obtained by multiplying the temporary synthesized signal by the appropriate gain, and the target signal to be encoded supplied from the adaptive excitation coding unit 4. It carries out the processing for all the driving excitation codes, and selects the driving excitation code that gives the minimum distortion, and outputs the time-series vector corresponding to the selected driving excitation code as the driving excitation.
The gain coding unit 56 stores a plurality of gain vectors representing two gain values corresponding to the adaptive excitation and driving excitation as the gain codebook. The gain codebook, receiving the gain code represented by a binary number of a few bits, reads the gain vector stored in the position corresponding to the gain code, and outputs it. The gain coding unit 56 obtains the gain vectors by supplying the gain codebook with the individual gain codes, multiplies the adaptive excitation fed from the adaptive excitation coding unit 4 by the first element of the gain vector, multiplies the driving excitation fed from the driving excitation coding unit 54 by the second element of the gain vector, and generates the temporary excitation by adding the two signals. Then, it obtains the temporary synthesized signal bypassing the temporary excitation through the synthesis filter using the quantized linear prediction coefficients fed from the linear prediction coefficient coding unit 3, and detects the distortion between the temporary synthesized signal and the input speech 1 fed via the driving excitation coding unit 54. It carries out the processing for all the gain codes, and selects the gain code that gives the minimum distortion. The gain coding unit 56 supplies the minimum distortion selecting unit 57 with the selected gain code, the adaptive excitation code fed from the adaptive excitation coding unit 4 via the driving excitation coding unit 54, the driving excitation code fed from the driving excitation coding unit 54, the minimum distortion, and the temporary excitation corresponding to the selected gain code.
On the other hand, the driving excitation coding unit 53 stores a plurality of time-series vectors as a driving excitation codebook. The driving excitation codebook, receiving the driving excitation code represented by a binary number of a few bits, reads the time-series vector stored in the position corresponding to the driving excitation code, and outputs it. The driving excitation coding unit 53 obtains the individual time-series vectors by supplying the driving excitation codebook with the individual adaptive excitation codes, and obtains the temporary synthesized signal by passing them through the synthesis filter using the quantized linear prediction coefficients fed from the linear prediction coefficient coding unit 3. Then, the driving excitation coding unit 53 detects the distortion between the signal which is obtained by multiplying the temporary synthesized signal by the appropriate gain and the input speech signal 1. It carries out the processing for all the driving excitation codes, and selects the driving excitation code that gives the minimum distortion, and outputs the time-series vector corresponding to the selected driving excitation code as the driving excitation.
The gain coding unit 55 stores a plurality of gain values for the driving excitation as a first gain codebook. The gain codebook, receiving the gain code represented by a binary number of a few bits, reads the gain value stored in the position corresponding to the gain code, and outputs it. The gain coding unit 55 obtains the gain values by supplying the gain codebook with the individual gain codes, multiplies the gain value by the driving excitation fed from the driving excitation coding unit 53, and produces the resultant signal as the temporary excitation. Then, it obtains the temporary synthesized signal by passing the temporary excitation through the synthesis filter using the quantized linear prediction coefficients fed from the linear prediction coefficient coding unit 3, and detects the distortion between the temporary synthesized signal and the input speech 1 fed via the driving excitation coding unit 53. It carries out the processing for all the gain codes, and selects the gain code that gives the minimum distortion. The gain coding unit 55 supplies the minimum distortion selecting unit 57 with the excitation code that includes the selected gain code and the driving excitation code fed from the driving excitation coding unit 53, and with the minimum distortion, and the temporary excitation corresponding to the gain code selected.
The minimum distortion selecting unit 57 compares the minimum distortion supplied from the gain coding unit 55 with the minimum distortion supplied from the gain coding unit 56, selects the gain coding unit 55 or 56 that outputs the lesser distortion, and supplies the multiplexer 7 with the excitation code fed from the selected gain coding unit 55 or 56. The minimum distortion selecting unit 57 supplies the adaptive excitation coding unit 4 with the temporary excitation fed from the selected gain coding unit 55 or 56 as the final excitation. The adaptive excitation coding unit 4 updates the internal adaptive excitation codebook using the excitation fed from the minimum distortion selecting unit 57.
After that, the multiplexer 7 multiplexes the code of the linear prediction coefficients supplied from the linear prediction coefficient coding unit 3 and the excitation code output from the minimum distortion selecting unit 57, and outputs the resultant speech code 8.
Thus, it is reported that the conventional speech coding apparatus disclosed in the international publication No. WO98/40877 carries out encoding in both the two excitation modes, and selects the excitation mode that gives a smaller distortion, thereby making it possible to select the mode that provides the best encoding characteristics, and to improve the coding quality.
As documents relevant to such a speech coding apparatus, there are Japanese patent application laid-open Nos. 9-319396 and 2000-175598, for example. The former generates target speech vectors with a length corresponding to a delay parameter from the input speech, and carries out adaptive excitation search and driving excitation search. The latter selects a gain quantization table corresponding to the driving excitation from a plurality of gain quantization tables in accordance with the power information of the adaptive excitation signal.
With the foregoing configuration, the conventional speech coding apparatuses have the following problems
As for the conventional speech coding apparatus disclosed in Japanese patent application laid-open No. 3-156498, since it selects one of the plurality of excitation models which are prepared in advance in accordance with the acoustic characteristics of the input speech 1, it has a problem in that the subjective quality, that is, quality of the decoded speech produced by decoding resultant speech code by the speech decoding apparatus is not always optimum. In other words, since the classification in accordance with the acoustic characteristics of the input speech 1 always involves classifying error, an excitation model inappropriate for the input speech may be selected. In addition, although the classification of the input speech 1 is correct, it is not unlikely that an unselected excitation model could produce higher quality decoded speech rather than the selected excitation model when the speech decoding apparatus performs decoding. For example, when a vowel segment includes a lot of waveform distortion such as in transitions, it is probable that using multi-pulses can handle the variations better and produce more satisfactory encoded result than the vowel segment excitation coding unit 52.
As for the conventional speech coding apparatus disclosed in the international publication No. WO98/40877, it carries out encoding in the two excitation modes, and selects the excitation mode that provides the smaller distortion. Accordingly, although it can achieve the minimum coding distortion, it has a problem in that the subjective quality (speech quality) of the decoded speech is not always best which is obtained by decoding the resultant speech code by the speech decoding apparatus. The problem will be described in more detail with reference to FIG. 7.
FIG. 7(a) shows an input speech; FIG. 7(b) shows a decoded speech (a result of decoding the speech code by the speech decoding apparatus) when an excitation mode prepared to express noisy speech is selected; and FIG. 7(c) shows a decoded speech when an excitation mode prepared to express vowel-like speech is selected. Here, the input speech as shown in FIG. 7(a) is associated with a segment with a noisy characteristic, in which large and small amplitudes are mixed often in a frame.
In the example of FIG. 7, the distortion value between the signals of FIGS. 7(a) and 7(b), which is obtained as the power of the difference signal thereof, is greater than that between FIGS. 7(a) and 7(c). This is because a portion of the input speech that has large amplitude (see, FIG. 7(a) ) has a smaller difference from the corresponding portion of FIG. 7(c). However, the sound of FIG. 7(b) sounds better than that of FIG. 7(c) for human ear, because the latter provides a pulse-like corrupt sound. Thus, the conventional speech coding apparatus that selects the excitation mode with the minimum distortion can select the mode in which the subjective quality (speech quality) of the decoded speech is not optimum which is obtained by decoding the resultant speech code by the speech decoding apparatus.