1. Field of the Invention
The present invention relates to a speech decoding apparatus and speech decoding method for decoding digital speech data coded based on excitation parameter information in accordance with ITU-T Recommendation G.723.1 and CELP (Coded Excited Linear Prediction) coding.
2. Related Art
One of the Recommendations concerning speech coding technique is ITU-T Recommendation G.723.1, which recommends about speech codec of ITU-T Recommendation H.324 concerning videophone using primarily analogue lines. In this speech coding technique, speech signals are coded at dual rates of 6.3 kbps and 5.3 kbps to represent human vocal mechanism.
A conventional coding apparatus is explained below with reference to a function block diagram in FIG. 1.
In a coding section, a speech signal is input to LPC analysis section 1101 and perceptual weighting filter 1102. LPC analysis section 1101 executes linear prediction of the speech signal to represent human voice path (throat form). LSP quantizer 1104 quantizes a linear predicted result to obtain LSP information that is one of speech parameters.
On the other hand, perceptual weighting filter 1102 modifies a frequency characteristic of speech signal to improve perception. Pitch estimator 1103 computes a pitch of the speech signal passed through the filter 1102. Harmonic noise shaping filter 1105 adjusts a distortion of the speech signal so that a noise or the like that contained in the perceptual weighted speech signal processed in the filter 1102 is under the threshold. In other words, the filter 1105 adjusts a speech quality. Pitch predictor 1106 obtains the returned speech data previously processed in pitch predictor 1106. Pitch predictor 1106 computes a pitch of current speech signal using the previously processed speech data to generate pitch information (pitch length and index to determine voiced sound or voiceless sound) Based on the generated pitch information, excitation parameter generator 1107 generates an exited signal to output to pseudo decoder 1108. Excitation parameter generator 1107 computes energy of the exited signal as an excitation parameter (Mamp), anddetermines an index in which the exited signal is coded according to the excitation parameter (Mamp). Excitation parameter generator 1107 has a index table which is correspondingly registered index number and excitation parameter (Mamp). Pseudo decoder 1108 once decodes the index to obtain the exited signal and returns the exited signal to pitch predictor 1106 for pitch prediction of following speech data.
As described above, in the coding in accordance with ITU-T Recommendation G723.1, LSP information, pitch information and excitation parameter information (index) are generated and transmitted from a transmitting side to a receiving side via a line. The receiving side decodes the information received from the transmitting side to reproduce the speech signal.
In the decoder, the LSP information is input to LSP decoder 1121, the pitch information is input to pitch decoder 1122, and the excitation parameter information is input to excitation parameter decoder 1123. Synthesis filter 1124 is constructed with coefficient corresponding to the decoded LSP information. A signal synthesized from the pitch data decoded in pitch decoder 1122 and an excited signal decoded by excitation decoder 1123 is input to synthesis filter 1124. The speech signal synthesized in synthesis filter 1124 is subjected to a correction in perceptual weighting filter 1125 to improve perception.
As described above, in ITU-T Recommendation G723.1, speech signal is divided into a plurality of parameters for coding, while the speech signal is decoded based on these plurality of parameters.
This coding method is a kind of CELP (Code Excited Linear Prediction) coding. The coding in CELP has characteristics of both the coding in which a generation process of speech is coded and the waveform coding, in which the excitation parameter is generated in the same way as the coding in accordance with ITU-T Recommendation G723.1.
In the speech coding in accordance with ITU-T Recommendation G723.1, a speech volume difference occurs between at a receiving side and a transmitting side by a line deterioration or others in communicating a speech through a telephone line or the like. In other words, since a speech at one side is recorded higher while another speech at another side is recorded lower, the speeches coded then decoded become hard to listen.
The above problem is caused by a volume difference between original speeches. A control of a gain of low volume speech is expected to prevent the problem to be caused. As the gain control, the following methods are considered.
A speech signal existing together with high volume and low volume are reproduced as a waveform. The waveform of the speech signal is sampled and energy of each sample is computed. The energy of each sample is subjected to gain control. Specifically, the gain control is performed in order to increase energy of a low volume speech to the same level as a high volume speech while keeping the energy of the high volume speech the same level.
As described above, when a high volume speech and a low volume speech are present, the volume of decoded speech signal is made constant by controlling a gain of the low volume speech signal. It is considered to apply this-method to the case of speech decoding in accordance with ITU-T Recommendation G723.1.
However in this case, the following problems have been remained.
That is, it is necessary to sample a waveform of the reproduced speech signal. It is further necessary to perform this sampling at a high sampling frequency, resulting in a large number of samplings. Therefore, it is necessary to reserve a large memory capacity to save sampled data and a large amount of computations are required to process a large amount of sampled data for the gain control, resulting in a heavy load of a CPU and a low decoding rate.