FIGS. 1 and 2 show the configuration of an example of a conventional mobile phone.
In this mobile phone, a transmission process of coding speech into a predetermined code by a CELP method and transmitting the codes, and a receiving process of receiving codes transmitted from other mobile phones and decoding the codes into speech are performed. FIG. 1 shows a transmission section for performing the transmission process, and FIG. 2 shows a receiving section for performing the receiving process.
In the transmission section shown in FIG. 1, speech produced from a user is input to a microphone 1, whereby the speech is converted into an speech signal as an electrical signal, and the signal is supplied to an A/D (Analog/Digital) conversion section 2. The A/D conversion section 2 samples an analog speech signal from the microphone 1, for example, at a sampling frequency of 8 kHz, etc., so that the analog speech signal undergoes A/D conversion from an analog signal into a digital speech signal. Furthermore, the A/D conversion section 2 performs quantization of the signal with a predetermined number of bits and supplies the signal to an arithmetic unit 3 and an LPC (Linear Prediction Coefficient) analysis section 4.
The LPC analysis section 4 assumes a length, for example, of 160 samples of an speech signal from the A/D conversion section 2 to be one frame, divides that frame into subframes every 40 samples, and performs LPC analysis for each subframe in order to determine linear predictive coefficients α1, α2, . . . , αp of the P order. Then, the LPC analysis section 4 assumes a vector in which these linear predictive coefficient αp (p=1, 2, . . . , P) of the P order are elements, as a speech feature vector, to a vector quantization section 5.
The vector quantization section 5 stores a codebook in which a code vector having linear predictive coefficients as elements corresponds to codes, performs vector quantization on a feature vector α from the LPC analysis section 4 on the basis of the codebook, and supplies the codes (hereinafter referred to as an “A_code” as appropriate) obtained as a result of the vector quantization to a code determination section 15.
Furthermore, the vector quantization section 5 supplies linear predictive coefficients α1′, α2′, . . . , αp′, which are elements forming a code vector α′ corresponding to the A_code, to a speech synthesis filter 6.
The speech synthesis filter 6 is, for example, an IIR (Infinite Impulse Response) type digital filter, which assumes a linear predictive coefficient αp′ (p=1, 2, . . . , P) from the vector quantization section 5 to be a tap coefficient of the IIR filter and assumes a residual signal e supplied from an arithmetic unit 14 to be an input signal, to perform speech synthesis.
More specifically, LPC analysis performed by the LPC analysis section 4 is such that, for the (sample value) sn of the speech signal at the current time n and past P sample values sn−1, sn−2, . . . , sn−p adjacent to the above sample value, a linear combination represented by the following equation holds:sn+α1sn−1+α2sn−2+ . . . +αpsn−p=en  (1)and when linear prediction of a prediction value (linear prediction value) sn′ of the sample value sn at the current time n is performed using the past P sample values Sn−1, sn−2, . . . , sn−p on the basis of the following equation:sn′=−(α1sn−1+α2sn−2+ . . . +αpsn−p)  (2)a linear predictive coefficient αp that minimizes the square error between the actual sample value sn and the linear prediction value sn′ is determined.
Here, in equation (1), {en} ( . . . , en−1, en, en+1, . . . ) are probability variables, which are uncorrelated with each other, in which the average value is 0 and the variance is a predetermined value σ2.
Based on equation (1), the sample value sn can be expressed by the following equation:sn=en−(α1sn−1+α2sn−2+ . . . +αpsn−p)  (3)When this is subjected to Z-transformation, the following equation is obtained:S=E/(1+α1z−1+α2z−2+ . . . +αpz−p)  (4)where, in equation (4), S and E represent Z-transformation of sn and en in equation (3), respectively.
Here, based on equations (1) and (2), en can be expressed by the following equation:en=sn−sn′  (5)and this is called the “residual signal” between the actual sample value sn and the linear prediction value sn′.
Therefore, based on equation (4), the speech signal sn can be determined by assuming the linear predictive coefficient αp to be a tap coefficient of the IIR filter and by assuming the residual signal en to be an input signal of the IIR filter.
Therefore, as described above, the speech synthesis filter 6 assumes the linear predictive coefficient αp′ from the vector quantization section 5 to be a tap coefficient, assumes the residual signal e supplied from the arithmetic unit 14 to be an input signal, and computes equation (4) in order to determine an speech signal (synthesized speech data) ss.
In the speech synthesis filter 6, a linear predictive coefficient αp′ as a code vector corresponding to the code obtained as a result of the vector quantization is used instead of the linear predictive coefficient αp obtained as a result of the LPC analysis by the LPC analysis section 4. As a result, basically, the synthesized speech signal output from the speech synthesis filter 6 does not become the same as the speech signal output from the A/D conversion section 2.
The synthesized speech data ss output from the speech synthesis filter 6 is supplied to the arithmetic unit 3. The arithmetic unit 3 subtracts an speech signal s output by the A/D conversion section 2 from the synthesized speech data ss from the speech synthesis filter 6 (subtracts the sample of the speech data s corresponding to that sample from each sample of the synthesized speech data ss), and supplies the subtracted value to a square-error computation section 7. The A/D conversion section 7 computes the sum of squares (sum of squares of the subtracted value of each sample value of the k-th subframe) of the subtracted value from the arithmetic unit 3 and supplies the resulting square error to a least-square error determination section 8.
The least-square error determination section 8 has stored therein an L code (L_code) as a code indicating a long-term prediction lag, a G code (G_code) as a code indicating a gain, and an I code (I_code) as a code indicating a codeword (excitation codebook) in such a manner as to correspond to the square error output from the square-error computation section 7, and outputs the L_code, the G code, and the L code corresponding to the square error output from the square-error computation section 7. The L code is supplied to an adaptive codebook storage section 9. The G code is supplied to a gain decoder 10. The I code is supplied to an excitation-codebook storage section 11. Furthermore, the L code, the G code, and the I code are also supplied to the code determination section 15.
The adaptive codebook storage section 9 has stored therein an adaptive codebook in which, for example, a 7-bit L code corresponds to a predetermined delay time (lag). The adaptive codebook storage section 9 delays the residual signal e supplied from the arithmetic unit 14 by a delay time (a long-term prediction lag) corresponding to the L code supplied from the least-square error determination section 8 and outputs the signal to an arithmetic unit 12.
Here, since the adaptive codebook storage section 9 delays the residual signal e by a time corresponding to the L code and outputs the signal, the output signal becomes a signal close to a period signal in which the delay time is a period. This signal becomes mainly a driving signal for generating synthesized speech of voiced sound in speech synthesis using linear predictive coefficients. Therefore, the L code conceptually represents a pitch period of speech. According to the standards of CELP, the L code takes an integer value in the range 20 to 146.
A gain decoder 10 has stored therein a table in which the G code corresponds to predetermined gains β and γ, and outputs gains β and γ corresponding to the G code supplied from the least-square error determination section 8. The gains β and γ are supplied to the arithmetic units 12 and 13, respectively. Here, the gain β is what is commonly called a long-term filter status output gain, and the gain γ is what is commonly called an excitation codebook gain.
The excitation-codebook storage section 11 has stored therein an excitation codebook in which, for example, a 9-bit I code corresponds to a predetermined excitation signal, and outputs, to the arithmetic unit 13, the excitation signal which corresponds to the I code supplied from the least-square error determination section 8.
Here, the excitation signal stored in the excitation codebook is, for example, a signal close to white noise, and becomes mainly a driving signal for generating synthesized speech of unvoiced sound in the speech synthesis using linear predictive coefficients.
The arithmetic unit 12 multiplies the output signal of the adaptive codebook storage section 9 with the gain β output from the gain decoder 10 and supplies the multiplied value 1 to the arithmetic unit 14. The arithmetic unit 13 multiplies the output signal of the excited codebook storage section 11 with the gain γ output from the gain decoder 10 and supplies the multiplied value n to the arithmetic unit 14. The arithmetic unit 14 adds together the multiplied value 1 from the arithmetic unit 12 with the multiplied value n from the arithmetic unit 13, and supplies the added value as the residual signal e to the speech synthesis filter 6 and the adaptive codebook storage section 9.
In the speech synthesis filter 6, in the manner described above, the residual signal e supplied from the arithmetic unit 14 is filtered by the IIR filter in which the linear predictive coefficient αp′ supplied from the vector quantization section 5 is a tap coefficient, and the resulting synthesized speech data is supplied to the arithmetic unit 3. Then, in the arithmetic unit 3 and the square-error computation section 7, processes similar to the above-described case are performed, and the resulting square error is supplied to the least-square error determination section 8.
The least-square error determination section 8 determines whether or not the square error from the square-error computation section 7 has become a minimum (local minimum). Then, when the least-square error determination section 8 determines that the square error has not become a minimum, the least-square error determination section 8 outputs the L code, the G code, and the I code corresponding to the square error in the manner described above, and hereafter, the same processes are repeated.
On the other hand, when the least-square error determination section 8 determines that the square error has become a minimum, the least-square error determination section 8 outputs the determination signal to the code determination section 15. The code determination section 15 latches the A code supplied from the vector quantization section 5 and latches the L code, the G code, and the I code in sequence supplied from the least-square error determination section 8. When the determination signal is received from the least-square error determination section 8, the code determination section 15 supplies the A code, the L code, the G code, and the I code, which are latched at this time, to the channel encoder 16. The channel encoder 16 multiplexes the A code, the L code, the G code, and the I code from the code determination section 15 and outputs them as code data. This code data is transmitted via a transmission path.
Based on the above, the code data is coded data having the A code, the L code, the G code, and the I code, which are information used for decoding, in units of subframes.
Here, the A code, the L code, the G code, and the I code are determined for each subframe. However, for example, there is a case in which the A code is sometimes determined for each frame. In this case, to decode the four subframes which form that frame, the same A code is used. However, also, in this case, each of the four subframes which form that one frame can be regarded as having the same A code. In this way, the code data can be regarded as being formed as coded data having the A code, the L code, the G code, and the I code, which are information used for decoding, in units of subframes.
Here, in FIG. 1 (the same applies also in FIGS. 2, 5, 9, 11, 16, 18, and 21, which will be described later), [k] is assigned to each variable so that the variable is an array variable. This k represents the number of subframes, but in the specification, a description thereof is omitted where appropriate.
Next, the code data transmitted from the transmission section of another mobile phone in the above-described manner is received by a channel decoder 21 of the receiving section shown in FIG. 2. The channel decoder 21 separates the L code, the G code, the I code; and the A code from the code data, and supplies each of them to an adaptive codebook storage section 22, a gain decoder 23, an excitation codebook storage section 24, and a filter coefficient decoder 25.
The adaptive codebook storage section 22, the gain decoder 23, the excitation codebook storage section 24, and arithmetic units 26 to 28 are formed similarly to the adaptive codebook storage section 9, the gain decoder 10, the excited codebook storage section 11, and the arithmetic units 12 to 14 of FIG. 1, respectively. As a result of the same processes as in the case described with reference to FIG. 1 being performed, the L code, the G code, and the I code are decoded into the residual signal e. This residual signal e is provided as an input signal to a speech synthesis filter 29.
The filter coefficient decoder 25 has stored therein the same codebook as that stored in the vector quantization section 5 of FIG. 1, so that the A code is decoded into a linear predictive coefficient αp′ and this is supplied to the speech synthesis filter 29.
The speech synthesis filter 29 is formed similarly to the speech synthesis filter 6 of FIG. 1. The speech synthesis filter 29 assumes the linear predictive coefficient αp′ from the filter coefficient decoder 25 to be a tap coefficient, assumes the residual signal e supplied from an arithmetic unit 28 to be an input signal, and computes equation (4), thereby generating synthesized speech data when the square error is determined to be a minimum in the least-square error determination section 8 of FIG. 1. This synthesized speech data is supplied to a D/A (Digital/Analog) conversion section 30. The D/A conversion section 30 subjects the synthesized speech data from the speech synthesis filter 29 to D/A conversion from a digital signal into an analog signal, and supplies the analog signal to a speaker 31, whereby the analog signal is output.
In the code data, when the A codes are arranged in frame units rather than in subframe units, in the receiving section of FIG. 2, linear predictive coefficients corresponding to the A codes arranged in that frame can be used to decode all four subframes which form the frame. In addition, interpolation is performed on each subframe by using the linear predictive coefficients corresponding to the A code of the adjacent frame, and the linear predictive coefficients obtained as a result of the interpolation can be used to decode each subframe.
As described above, in the transmission section of the mobile phone, since the residual signal and linear predictive coefficients, as an input signal provided to the speech synthesis filter 29 of the receiving section, are coded and then transmitted, in the receiving section, the codes are decoded into a residual signal and linear predictive coefficients. However, since the decoded residual signal and linear predictive coefficients (hereinafter referred to as “decoded residual signal and decoded linear predictive coefficients”, respectively, as appropriate) contain errors such as quantization errors, these do not match the residual signal and the linear predictive coefficients obtained by performing LPC analysis on speech.
For this reason, the synthesized speech data output from the speech synthesis filter 29 of the receiving section becomes deteriorated sound quality in which distortion, etc., is contained.