The present invention relates to an encoding/decoding method of a low bit rate used for digital telephone, voice memo, etc.
In recent years, the encoding techniques have found wide applications in the portable telephone or the internet in which the speech and music sound are transmitted and stored by being compressed at a low bit rate. Such techniques include the CELP method (Code Excited Linear Prediction (M. R. Schroeder and B. S. at al), xe2x80x9cCode Excited Linear Prediction (CELP): High Quality Speech at Very Low Bit Ratesxe2x80x9d, Proc. ICASSP, pp.937-940, 1985 (reference 1) and W. S. Kleijin, D. J. Krasinski et al. xe2x80x9cImproved Speech Quality and Efficient Vector Quantization in SELPxe2x80x9d, Proc. ICASSP, pp.155-158, 1988 (reference 2)).
The CELP is an encoding scheme based on the linear predictive analysis. An input speech signal is divided into a linear prediction coefficient representing the phoneme information and a prediction residual signal representing the sound level, etc. according to the linear predictive analysis. Based on the linear predictive coefficients, a recursive digital filter called a synthesis filter is configured, and supplied with a prediction residual signal as an excitation signal thereby to restore the original input speech signal.
For encoding at low bit rate, it is necessary to encode, with as low bit rates as possible, the linear predictive coefficients constituting the synthesis filter information representing the characteristics of the synthesis filter and the prediction residual signal constituting the characteristic of the synthetic filter. In the CELP scheme, two types of signal including the pitch vector and the noise vector are each multiplied by an appropriate gain and added to each other thereby to generate an excitation signal in the form encoded from the prediction residual signal. A method of generating the pitch vector is described in detail in reference 2 for example. There is proposed a method of using a fixed coded vector on a rising portion (onset portion) of a speech other than the method of the reference 2. However, in the present invention, such vectors are used as pitch vectors.
The noise vector is normally generated by storing a multiplicity of candidates in a stochastic codebook and selecting an optimum one. In a method of searching for a noise vector, all the noise vectors are added to the pitch vector and then a synthesis speech signal is generated through a synthetic filter. The error of this synthesis speech signal with respect to the input signal is evaluated thereby to select a noise vector generating a synthesis speech signal with the smallest error. What is most important for the CELP scheme, therefore, is how efficiently to store the noise vectors in the stochastic codebook.
The algebraic codebook (J-P. Adoul et al, xe2x80x9cFast CELP Coding based on algebraic codesxe2x80x9d, Proc. ICASSP ""87, pp.1957-1960 (reference 3)) has a simple structure in which the noise vector is indicated only by the presence or absence of a pulse and the sign (+, xe2x88x92) thereof. The algebraic codebook, as compared with the stochastic codebook with a plurality of noise vectors stored therein, need not store any code vector and has the feature of a very small calculation amount. Also, the sound quality of the system using the algebraic codebook is not inferior to that of the prior art, and therefore has recently been used for various standard schemes.
In the algebraic codebook, however, the deterioration of the sound quality becomes more conspicuous with the decrease in the encoding bit rate. One reason is the shortage of the pulse position information. Specifically, in view of the fact that the algebraic codebook algebraically simplifies the positional information of the pulse, in spite of the advantage described above, position candidates sometimes exist at points where a pulse rise is not required for low bit rate encoding but not at required points. This not only deteriorates the efficiency but also deteriorates the sound quality.
Another reason for the deterioration of the sound quality when using the algebraic codebook is the shortage of the number of pulses. The shortage of pulses gives rise to a pulse-like noise in the decoded speech. This is because an excitation signal is generated from a pulse train and the presence or absence of a pulse can be easily acknowledged perceptually with the decrease in the number of pulses. For improving the sound quality, it is necessary to alleviate the pulse-like noise.
As described above, the conventional algebraic codebook has the advantage of a simple structure and a small amount of calculation, but poses the problem that the quality of the decoded speech is deteriorated due to the shortage of the pulses-and the positional information of the pulse train making up the excitation signal for the synthesis filter at a low bit rate.
The object of the present invention is to provide a speech encoding/decoding method which can secure a superior sound quality even at a low bit rate encoding.
According to a first aspect of the invention, there is provided a speech encoding method comprising the steps of generating at least information representing the characteristics of a synthesis filter for a speech signal, and generating an excitation signal for exciting the synthesis filter, including a pulse train generated by setting pulses at a predetermined number of pulse positions selected from the pulse position candidates adaptively changed in accordance with the characteristics of the speech signal.
According to another aspect of the invention, there is provided a speech decoding method for inputting an excitation signal to a synthesis filter and decoding a speech signal, the excitation signal containing a pulse train generated by setting pulses at a predetermined number of pulse positions selected from the pulse position candidates adaptively changed in accordance with the characteristics of the speech signal.
In a speech encoding/decoding method according to this invention, the excitation signal for exciting the synthesis filter contains a pulse train generated by setting pulses at a predetermined number of pulse positions selected from the pulse position candidates adaptively changed in accordance with the characteristics of the speech signal. More specifically, the pulse position candidates are assigned in such a manner that more candidates exist at a domain of larger power of the speech signal.
Also, the excitation signal can be configured to include a pulse train generated by setting pulses at all the pulse position candidates adaptively changing in accordance with the characteristics of the voice signal and optimizing the amplitude of each pulse with predetermined means. In such a case, more specifically, the pulse position candidates are assigned so that more candidates exist at a domain of larger power of the voice signal.
Alternatively, the excitation signal can be generated by use of a pulse train generated by setting pulses at a predetermined number of pulse positions selected from first pulse position candidates changing adaptively in accordance with the characteristics of the voice signal or a pulse train generated by setting pulses at a predetermined number of pulse positions selected from second pulse position candidates including a part or the whole of the positions not used as the first pulse position candidates. In this case, the first pulse position candidates are arranged, more specifically, so that more candidates exist at a domain that the power of the speech signal is larger.
Also, in the case where the excitation signal includes a pitch vector and a noise vector, the noise vector is generated by setting pulses at a predetermined number of pulse positions selected from the pulse position candidates changed in accordance with the shape of the pitch vector. More specifically, more pulse position candidates are located at a domain of larger power of the pitch vector.
Also, the noise vector can be configured by use of a pulse train generated by setting pulses at a predetermined number of pulse positions selected from position candidates set based on the position candidate density function determined from the shape of the pitch vector. In such a case, the pulse position candidates are, more specifically, arranged in such a manner that more candidates exist at a place where the value of the position candidate density function is larger. The position candidate density function is a function describing the relationship between the probability of arranging the pulses and the power of the pitch vector.
Further, in the case of using a compensation filter such as a pitch period emphasis filter, a modified pitch vector is generated from the pitch vector applied through a filter based on this inverse characteristic, and the noise vector is generated by setting pulses at a predetermined number of pulse positions selected from the pulse position candidates changing in accordance with the shape of the inverse correction pitch vector. In such a case, the pulse position candidates are, more specifically, arranged in such a manner that more candidates exist at a domain that the power of the inverse correction vector is larger.
By adaptively changing the pulse position candidates in accordance with the characteristics such as the power distribution of the speech signal as described above, the encoding efficiency is improved even when using an algebraic codebook in which the pulse positions and the number of pulses are reduced due to the low bit rate. Thus, the bit rate can be reduced while maintaining the quality of the decoded speech. Also, since the pitch vector is used for producing pulse position candidates, the adaptation of the pulse position candidates becomes possible without any additional information.
In another speech encoding/decoding method according to this invention, an excitation signal including a pitch vector and a noise vector contains a pulse train shaped by a pulse shaping filter having the characteristics determined based on the shape of the pitch vector.
With this configuration, the pulse-like noise contained in the decoded speech due to the reduced number of pulses is alleviated, and even in the case where the pulse positions or the number of pulses is reduced due to the low bit rate, the bit rate can be reduced while maintaining the quality of the decoded speech.
Further, in a speech encoding/decoding method according to this invention, an excitation signal is generated, including a pulse train generated by setting pulses at a predetermined number of pulse positions selected from the pulse position candidates adaptively changed in accordance with the characteristics of the speech signal. Also, the pulse train can be shaped by a pulse shaping filter having a characteristic determined based on the shape of the pitch vector.
Additional objects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out hereinafter.