A number of conventional speech encoding apparatuses generate speech codes by separating input speech into spectrum envelope information and sound source information, and by encoding them frame by frame with a specified length. The most typical speech encoding apparatuses are those that use a CELP (Code Excited Linear Prediction) scheme.
FIG. 1 is a block diagram showing a configuration of a conventional CELP speech encoding apparatus. In FIG. 1, the reference numeral 1 designates a linear prediction analyzer for analyzing the input speech to extract linear prediction coefficients constituting the spectrum envelope information of the input speech. The reference numeral 2 designates a linear prediction coefficient encoder for encoding the linear prediction coefficients the linear prediction analyzer 1 extracts, and for supplying the encoding result to a multiplexer 6. It also supplies the quantized values of the linear prediction coefficients to an adaptive excitation encoder 3, fixed excitation encoder 4 and gain encoder 5.
The reference numeral 3 designates the adaptive excitation encoder for generating temporary synthesized speech using the quantized values of the linear prediction coefficients the linear prediction coefficient encoder 2 outputs. It selects adaptive excitation code that will minimize the distance between the temporary synthesized speech and input speech and supplies it to the multiplexer 6. It also supplies the gain encoder 5 with an adaptive excitation signal (time series vectors formed by cyclically repeating the past excitation signal with a specified length) corresponding to the adaptive excitation code. The reference numeral 4 designates the fixed excitation encoder for generating temporary synthesized speech using the quantized values of the linear prediction coefficients the linear prediction coefficient encoder 2 outputs. It selects the fixed excitation code that will minimize the distance between the temporary synthesized speech and a target signal to be encoded (signal obtained by subtracting the synthesized speech based on the adaptive excitation signal from the input speech), and supplies it to the multiplexer 6. It also supplies the gain encoder 5 with the fixed excitation signal consisting of the time series vectors corresponding to the fixed excitation code.
The reference numeral 5 designates a gain encoder for generating a excitation signal by multiplying the adaptive excitation signal the adaptive excitation encoder 3 outputs and the fixed excitation signal the fixed excitation encoder 4 outputs by the individual elements of gain vectors, and by summing up the products of the multiplications. It also generates temporary synthesized speech from the excitation signal using the quantized values of the linear prediction coefficients the linear prediction coefficient encoder 2 outputs. Then, it selects the gain code that will minimize the distance between the temporary synthesized speech and input speech, and supplies it to the multiplexer 6. The reference numeral 6 designates the multiplexer for outputting the speech code by multiplexing the code of the linear prediction coefficients the linear prediction coefficient encoder 2 encodes, the adaptive excitation code the adaptive excitation encoder 3 outputs, the fixed excitation code the fixed excitation encoder 4 outputs and the gain code the gain encoder 5 outputs.
FIG. 2 a block diagram showing an internal configuration of the fixed excitation encoder 4. In FIG. 2, the reference numeral 11 designates a fixed excitation codebook, 12 designates a synthesis filter, 13 designates a distortion calculator and 14 designates a distortion estimator.
Next, the operation will be described.
The conventional speech encoding apparatus carries out its processing frame by frame with a length of about 5-50 ms.
First, encoding of the spectrum envelope information will be described.
Receiving the input speech, the linear prediction analyzer 1 analyzes the input speech to extract the linear prediction coefficients constituting the spectrum envelope information of the speech.
When the linear prediction analyzer 1 extracts the linear prediction coefficients, the linear prediction coefficient encoder 2 encodes the linear prediction coefficients, and supplies the code to the multiplexer 6. In addition, it supplies the quantized values of the linear prediction coefficients to the adaptive excitation encoder 3, fixed excitation encoder 4 and gain encoder 5.
Next, encoding of the sound source information will be described.
The adaptive excitation encoder 3 includes an adaptive excitation codebook for storing past excitation signals with a specified length. It generates the time series vectors by cyclically repeating the past excitation signals in response to the internally generated adaptive excitation codes, each of which is represented by a few bit binary number.
Subsequently, the adaptive excitation encoder 3 multiplies the individual time series vectors by an appropriate gain factor. Then, it generates the temporary synthesized speech by passing the individual time series vectors through a synthesis filter that uses the quantized values of the linear prediction coefficients the linear prediction coefficient encoder 2 outputs.
The adaptive excitation encoder 3 further detects as the encoding distortion, the distance between the temporary synthesized speech and the input speech, for example, selects the adaptive excitation code that will minimize the distance, and supplies it to the multiplexer 6. At the same time, it supplies the gain encoder 5 with a time series vector corresponding to the adaptive excitation code as the adaptive excitation signal.
In addition, the adaptive excitation encoder 3 supplies the fixed excitation encoder 4 with the signal which is obtained by subtracting the synthesized speech based on the adaptive excitation signal from the input speech, as the target signal to be encoded.
Next, the operation of the fixed excitation encoder 4 will be described.
The fixed excitation codebook 11 of the fixed excitation encoder 4 stores the fixed code vectors consisting of multiple noise-like time series vectors. It sequentially outputs the time series vectors in response to the individual fixed excitation codes which are each represented by a few-bit binary number output from the distortion estimator 14. The individual time series vectors are multiplied by an appropriate gain factor, and supplied to the synthesis filter 12.
The synthesis filter 12 generates a temporary synthesized speech composed of the gain-multiplied individual time series vectors using the quantized values of the linear prediction coefficients the linear prediction coefficient encoder 2 outputs.
The distortion calculator 13 calculates as the encoding distortion, the distance between the temporary synthesized speech and the target signal to be encoded the adaptive excitation encoder 3 outputs, for example.
The distortion estimator 14 selects the fixed excitation code that will minimize the distance between the temporary synthesized speech and the target signal to be encoded the distortion Calculator 13 calculates, and supplies it to the multiplexer 6. It also provides the fixed excitation codebook 11 with an instruction to supply the time series vector corresponding to the selected fixed excitation code to the gain encoder 5 as the fixed excitation signal.
The gain encoder 5 includes a gain codebook for storing gain vectors, and sequentially reads the gain vectors from the gain codebook in response to the internally generated gain codes, each of which is represented by a few-bit binary number.
Subsequently, the gain encoder 5 generates the excitation signal by multiplying the adaptive excitation signal the adaptive excitation encoder 3 outputs and the fixed excitation signal the fixed excitation encoder 4 outputs by the elements of the individual gain vectors, and by summing up the resultant products of the multiplications.
Then, the excitation signal is passed through a synthesis filter using the quantized values of the linear prediction coefficients the linear prediction coefficient encoder 2 outputs, to generate temporary synthesized speech.
Subsequently, the gain encoder 5 detects as the encoding distortion, the distance between the temporary synthesized speech and the input speech, for example, selects the gain code that will minimize the distance, and supplies it to the multiplexer 6. In addition, the gain encoder 5 supplies the excitation signal corresponding to the gain code to the adaptive excitation encoder 3. In response to the excitation signal corresponding to the gain code the gain encoder 5 selects, the adaptive excitation encoder 3 updates its adaptive excitation codebook.
The multiplexer 6 multiplexes the linear prediction coefficients the linear prediction coefficient encoder 2 encodes, the adaptive excitation code the adaptive excitation encoder 3 outputs, the fixed excitation code the fixed excitation encoder 4 outputs, and the gain code the gain encoder 5 outputs, thereby outputting the multiplexing result as the speech code.
Next, a conventional technique that improves the foregoing CELP speech encoding apparatus will be described.
Japanese patent application laid-open No. 5-108098/1993 (Reference 1), and Ehara et al., “An Improved Low Bit-rate ACELP Speech Coding”, page 1,227 of Information and System 1 of the Proceeding of the 1999 IEICE General Conference of the Institute of Electronics, Information and Communication Engineers of Japan, (Reference 2) each disclose a CELP speech encoding apparatus that includes fixed excitation codebooks as multiple fixed excitation generators, for the purpose of providing high-quality speech even at a low bit rate. These conventional configurations include a fixed excitation codebook for generating a plurality of noise-like time series vectors and a fixed excitation codebook for generating a plurality of non-noise-like (pulse-like) time series vectors.
The non-noise-like time series vectors are time series vectors consisting of a pulse train with a pitch period in the Reference 1, and time series vectors with an algebraic excitation structure consisting of a small number of pulses in the Reference 2.
FIG. 3 is a block diagram showing an internal configuration of the fixed excitation encoder 4 including a plurality of fixed excitation codebooks. The speech encoding apparatus has the same configuration as that of FIG. 1 except for the fixed excitation encoder 4.
In FIG. 3, the reference numeral 21 designates a first fixed excitation codebook for storing multiple noise-like time series vectors; 22 designates a first synthesis filter; 23 designates a first distortion calculator; 24 designates a second fixed excitation codebook for storing multiple non-noise-like time series vectors; 25 designates a second synthesis filter; 26 designates a second distortion calculator; and 27 designates a distortion estimator.
Next, the operation will be described.
The first fixed excitation codebook 21 stores the fixed code vectors consisting of the multiple noise-like time series vectors, and sequentially outputs the time series vectors in response to the individual fixed excitation codes the distortion estimator 27 outputs. Subsequently, the individual time series vectors are multiplied by an appropriate gain factor and supplied to the first synthesis filter 22.
The first synthesis filter 22 generates temporary synthesized speech corresponding to the gain-multiplied individual time series vectors using the quantized values of the linear prediction coefficients the linear prediction coefficient encoder 2 outputs.
The first distortion calculator 23 calculates as the encoding distortion, the distance between the temporary synthesized speech and the target signal to be encoded the adaptive excitation encoder 3 outputs, and supplies it to the distortion estimator 27.
On the other hand, the second fixed excitation codebook 24 stores the fixed code vectors consisting of the multiple non-noise-like time series vectors, and sequentially outputs the time series vectors in response to the individual fixed excitation code the distortion estimator 27 outputs. Subsequently, the individual time series vectors are multiplied by an appropriate gain factor, and supplied to the second synthesis filter 25.
The second synthesis filter 25 generates temporary synthesized speech corresponding to the gain-multiplied individual time series vectors using the quantized values of the linear prediction coefficients the linear prediction coefficient encoder 2 outputs.
The second distortion calculator 26 calculates as the encoding distortion, the distance between the temporary synthesized speech and the target signal to be encoded the adaptive excitation encoder 3 outputs, and supplies it to the distortion estimator 27.
The distortion estimator 27 selects the fixed excitation code that will minimize the distance between the temporary synthesized speech and the target signal to be encoded, and supplies it to the multiplexer 6. It also provides the first fixed excitation codebook 21 or second fixed excitation codebook 24 with an instruction to supply the gain encoder 5 with the time series vectors corresponding to the selected fixed excitation code as the fixed excitation signal.
Japanese patent application laid-open No. 5-273999/1993 (Reference 3) discloses the following method in the configuration including the multiple fixed excitation codebooks. To prevent the fixed excitation codebooks from being switched frequently in steady sections of vowels and the like, it categorizes the input speech according to its acoustic characteristics, and reflects the resultant categories in the distortion evaluation for selecting the fixed excitation code.
With the foregoing configurations, the conventional speech encoding apparatuses each include multiple fixed excitation codebooks including different types of time series vectors to be generated, and select time series vectors that will give the minimum distance between the temporary synthesized speech generated from the individual time series vectors and the target signal to be encoded (see, FIG. 3). Here, the non-noise-like (pulse-like) time series vectors are likely to have a smaller distance between the temporary synthesized speech and the target signal to be encoded than the noise-like time series vectors, and hence to be selected more frequently.
However, when the non-noise-like (pulse-like) time series vectors are selected frequently, the sound quality also becomes pulse-like quality, offering a problem in that a subjective sound quality is not always best.
In addition, in the sections where the target signal to be encoded or input speech has noise-like quality, there arise a problem in that the subjective degradation of the sound quality becomes conspicuous due to the pulse-like characteristic resulting from frequent selecting non-noise-like (pulse-like) time series vectors.
Furthermore, when the apparatus includes multiple fixed excitation codebooks, the ratios the individual fixed excitation codebooks are selected depend on the number of the time series vectors the individual fixed excitation codebooks generate, and the fixed excitation codebooks having a larger number of time series vectors to be selected are likely to be selected more often.
Thus, it will be possible to achieve the best subjective quality by adjusting the ratios the individual fixed excitation codebooks are selected by varying the number of the time series vectors the individual fixed excitation codebooks generate.
However, even if the number of the time series vectors to be generated are the same, different configurations of the individual fixed excitation codebooks will require different memory capacities and processing loads of encoding. For example, when using the fixed excitation codebook for generating a pulse train with a pitch period, both the memory capacity and processing load are very small. In contrast, when using the time series vectors that are obtained through distortion minimization learning for the speech by storing them, both the memory capacity and processing load are large. Accordingly, the number of the time series vectors the individual fixed excitation codebooks can generate is restricted by the scale and performance of hardware that implements the speech coding scheme. Consequently, the ratios the individual fixed excitation codebooks are selected cannot be optimized, offering a problem in that the subjective quality is not always best.
Japanese patent application laid-open No. 5-273999/1993 (Reference 3) can circumvent the frequent switching of the fixed excitation codebooks to be selected in the steady sections of the vowels. However, it does not try to improve the subjective quality of the encoding result of the individual frames. On the contrary, it has a problem of degrading the subjective quality because of successive pulse-like sound sources.
Moreover, the foregoing problems are not solved at all when the target signal to be encoded or the input speech has noise-like quality, or the hardware has restrictions.
The present invention is implemented to solve the foregoing problems. Therefore, an object of the present invention is to provide a speech encoding apparatus and speech encoding method capable of obtaining subjectively high-quality speech code by making effective use of the multiple fixed excitation codebooks.