(1) Field of the Invention
The present invention relates to a speech coding apparatus, a speech processing apparatus, and a speech processing method. More particularly, the present invention relates to a speech coding apparatus which encodes speech signals into lower bitrate signals (e.g., less than 4 kb/s) by applying Analysis-by-Synthesis (AbS) vector quantization techniques to source speech signals containing a plurality of periodic components within a fixed time interval, based on an appropriate speech production model. The present invention also relates to a speech processing apparatus and a speech processing method which performs speech coding with AbS vector quantization techniques, based on a speech production model.
(2) Description of the Related Art
The code-excited linear prediction (CELP) method has been known as a speech coding technique which encodes telephone voice signals with a spectrum ranging from 0.3 to 3.4 kHz into a bitstream with a rate of 4 to 16 kb/s. CELP is widely used in digital mobile communications systems and enterprise communications systems.
What CELP coders actually transmit are: linear predictive coding (LPC) coefficients representing resonance characteristics of the human vocal tract, and parameters representing excitation signals (sound source data) consisting of periodic pitch components and noise components. The CELP algorithm uses an LPC synthesis filter H(z) of equation (1) as a model of the human vocal tract, assuming that the input signal to that filter (sound source signal) can be divided into its periodic pitch components and noise components. The former components represent the periodicity of voices, while the latter the randomness.                               H          ⁡                      (            z            )                          =                  1                      1            -                                          ∑                                  i                  =                  1                                p                            ⁢                                                α                  i                                ·                                  z                                      -                    i                                                                                                          (        1        )            
Subsequently, the filter coefficients characterizing the LPC synthesis filter, as well as the pitch interval components and noise components of the excitation signal, are extracted and quantized. Here, data compression is accomplished by sending out the quantized data, i.e., quantization indices.
FIG. 18 depicts the CELP coding algorithm. Suppose that a source voice signal Sn is entered to an LPC analyzing means 21. Using an all-pole filter characterized by equation (1) representing the human vocal tract model, the LPC analyzing means 21 calculates the coefficient xcex1i (i=1 to p) of that all-pole filter, where p represents its order. Typically, the filter""s order p ranges from 10 to 12 for phone-quality speech signals, while 16 to 20 for wideband speech signals.
The LPC filter coefficients are then quantized with scalar quantization or vector quantization techniques (quantizer not shown in FIG. 18). The resultant quantization indices are transmitted to the decoding end. The excitation signal is also quantized. For the quantization of its pitch interval components, the CELP algorithm employs an adaptive codebook Ba recording the past sound source signal series. For the quantization of noise components, the algorithm provides a noise codebook Bn storing various noise signal patterns.
The codebooks Ba and Bn are used in the A-b-s vector quantization process as follows. The process begins with a variable-gain multiplication of code vectors read out of the codebooks Ba and Bn. This operation is executed by multipliers 22a and 22b. The sum of the outputs of the multipliers 22a and 22b are calculated by an adder 23 and supplied to an LPC synthesis filter 24, whose response is defined by the LPC filter coefficients. With its filtering algorithm, the LPC synthesis filter 24 reproduces a signal Sn*. This reproduced speech signal Sn* is then subjected to an arithmetic operator 26 for the calculation of its error en with respect to the source speech signal Sn.
An error power evaluation means 25 evaluates the error en for every possible combination of code vectors read out of the two codebooks Ba and Bn, changing the positions of selection switches SWa and SWb from one to another. Through the error evaluation, the error power evaluation means 25 obtains one particular combination of code vectors which exhibit the smallest error value among others. This combination is referred to herein as the xe2x80x9coptimal code vector pair,xe2x80x9d and the gain corresponding to the pair is referred to as the xe2x80x9coptimal gain.xe2x80x9d Finally, a quantizer (not shown) quantizes the obtained optimal code vector pair and optimal gain, thereby yielding quantization indices.
That is, the coder produces quantization indices of LPC filter coefficients, of optimal code vectors, and of optimal gains. The quantization indices of the optimal code vectors actually include: those of the code vectors selected from the noise codebook Bn, and those of what will be explained later as xe2x80x9clag,xe2x80x9d i.e., a parameter used in extracting optimal vectors from the adaptive codebook Ba. Those quantization indices are transmitted to the decoding end.
The decoder obtains LPC filter coefficients, optimal code vector, and optimal gain by decoding the data received from the encoder. Employing the same codebooks Ba and Bn and LPC synthesis filter as those used at the encoding end, the decoder reproduces the original speech signal.
As described above, the CELP algorithm accomplishes speech compression by establishing a speech production process model and transmitting quantized characteristic parameters of that model. Since the characteristics of human voices exhibit little variation within a short time, e.g., 5 to 10 msec, the CELP algorithm updates vocal tract parameters and excitation parameters only at as short intervals as 5 to 10 msec. Such short time segments are referred to as xe2x80x9cframes.xe2x80x9d This method permits the CELP coders to provide coded speech signals without quality deterioration at reduced bitrates as low as 5 to 6 kb/s.
The above-described conventional speech coding algorithm, however, cannot reduce the bitrate further for the following reason. For bitrates of 4 kb/s or lower, the conventional algorithm requires that the frame length be elongated to more than 10 ms. This means that a single frame of a source speech signal is likely to contain two or more different pitch components, introducing quality deterioration to the resultant coded speech signal.
In other words, the conventional CELP algorithm is weak in modeling the periodicity of a speech signal within a single frame because the periodicity of output signals contained in the adaptive codebook Ba is strictly confined to one component per frame. For this reason, the conventional algorithm is unable to capture the periodicity of a source speech signal precisely enough to provide high coding efficiency in the cases where one frame contains a plurality of periodic pitch components, which limits its coding efficiency.
In view of the foregoing, it is an object of the present invention to provide a speech coding apparatus which encodes given speech signals in an optimal way.
It is another object of the present invention to provide a speech processing apparatus which performs optimal speech processing according to a given source speech signal, so that the signal will be reproduced with high quality at the receiving end.
It is still another object of the present invention to provide a speech processing method which performs speech processing in an optimal way for a given source speech signal, so that the signal will be reproduced with high quality at the receiving end.
To accomplish the above first object, according to the present invention, there is provided a speech coding apparatus. This speech coding apparatus performs speech coding based on a speech production model, in which a given speech signal Sn is divided into fixed-length segments. This speech coding apparatus comprises the following elements: an adaptive codebook Ba which stores a series of signal vectors of a past speech signal; a vector extraction means for extracting a signal vector and neighboring vectors adjacent thereto from the adaptive codebook, the signal vector being stored at a distance given by a lag parameter L from the top entry location O of the adaptive codebook Ba; a long-term prediction synthesis filter with a high order which produces a long-term predicted speech signal Snaxe2x88x921 from the signal vector and neighboring vectors by applying long-term prediction synthesis concerning the periodicity of the speech signal Sn; a filter coefficient calculation means for calculating filter coefficients of the long-term prediction synthesis filter; a perceptual weighting synthesis filter which processes the long-term predicted speech signal Snaxe2x88x921 to yield a reproduced coded speech signal Sna, comprising: a linear predictive synthesis filter 14a defined through estimation with a linear predictive coding synthesis technique that emulates a vocal tract response, a first perceptual weighting filter which is coupled in series with the linear predictive synthesis filter and assigns perceptual weights to a signal supplied thereto according to characteristics of a human hearing system, an error calculation means for calculating the error En of the reproduced coded speech signal Sna with reference to the perceptually weighted speech signal Sn; a minimum error detection means for finding a minimum error point that yields the smallest error among those that the error calculation means has calculated while varying the lag parameter L; and an optimal value transmission means for transmitting optimal values including optimal filter coefficients xcex2a and an optimal delay La, the optimal filter coefficients xcex2a being the filter coefficients at the minimum error point, the optimal lag parameter La being the lag parameter at the minimum error point.
In operation, the adaptive codebook Ba stores a series of signal vectors of a past speech signal. The vector extraction means extracts a signal vector and neighboring vectors adjacent thereto from the adaptive codebook. Here, the signal vector is stored at a distance given by a lag parameter L from the top entry location O of the adaptive codebook Ba. The high-order long-term prediction synthesis filter produces a long-term predicted speech signal Snaxe2x88x921 from the signal vector and neighboring vectors by applying long-term prediction synthesis concerning the periodicity of the speech signal Sn. The filter coefficient calculation means calculates the filter coefficients of the long-term prediction synthesis filter. The perceptual weighting synthesis filter 14 processes the long-term predicted speech signal Snaxe2x88x921 to yield a reproduced coded speech signal Sna. This perceptual weighting synthesis filter comprises the following elements: the linear predictive synthesis filter 14a defined through estimation with a linear predictive coding synthesis technique that emulates a vocal tract response, the first perceptual weighting filter which is coupled in series with the linear predictive synthesis filter and assigns perceptual weights to a signal supplied thereto according to characteristics of a human hearing system, and the second perceptual weighting filter which produces a perceptually weighted speech signal by assigning perceptual weights to the speech signal Sn. The error calculation means calculates the error En of the reproduced coded speech signal Sna with reference to the perceptually weighted speech signal Snxe2x80x2. The error calculation means repeatedly executes this calculation while varying the lag parameter L. The minimum error detection means finds a minimum error point that yields the smallest error among the calculated errors. The optimal value transmission means transmits optimal values including optimal filter coefficients xcex2a and an optimal delay La. The optimal filter coefficient xcex2a and optimal lag parameter La are the filter coefficients and lag parameter at the minimum error point.
In addition to the above, there is provided a speech processing apparatus, which performs speech analysis and synthesis based on a speech production model. This speech processing apparatus comprises the following elements: (a) a speech coding processor comprising: (a1) a first speech coding means for producing coded data by coding a speech signal when at most one periodic component is contained in a fixed-length segment of the speech signal; and (a2) a second speech coding means comprising:
(a2a) an adaptive codebook which stores a series of signal vectors of a past speech signal for use in such cases where the fixed-length segment contains a plurality of periodic components; (a2b) a vector extraction means for extracting a signal vector and neighboring vectors adjacent thereto from the adaptive codebook, the signal vector being stored at a distance given by a lag parameter from the top entry location of the adaptive codebook; (a2c) a long-term prediction synthesis filter with a high order which produces a long-term predicted speech signal from the signal vector and the neighboring vectors by applying long-term prediction synthesis concerning the periodicity of the speech signal; (a2d) a filter coefficient calculation means for calculating filter coefficients of the long-term prediction synthesis filter; (a2e) a perceptual weighting synthesis filter which processes the long-term predicted speech signal to yield a reproduced coded speech signal, comprising: (a2e1) a linear predictive synthesis filter defined through estimation with a linear predictive coding synthesis technique that emulates a vocal tract response, and (a2e2) a first perceptual weighting filter which is coupled in series with the linear predictive synthesis filter and assigns perceptual weights to a signal supplied thereto according to characteristics of a human hearing system; (a2f) a second perceptual weighting filter which produces a perceptually weighted speech signal by assigning perceptual weights to the speech signal; (a2g) an error calculation means for calculating an error of the reproduced coded speech signal with reference to the perceptually weighted speech signal; (a2h) a minimum error detection means for finding a minimum error point that yields the smallest error among those that the error calculation means has calculated while varying the lag parameter; and (a2i) an optimal value transmission means for transmitting optimal values including optimal filter coefficients and an optimal delay, the optimal filter coefficients being the filter coefficients at the minimum error point, the optimal lag parameter being the lag parameter at the minimum error point; and (b) a speech decoding processor, comprising: (b1) a first speech decoding means for reproducing the speech signal by decoding the coded data; and (b2) a second speech decoding means for reproducing the speech signal by decoding the optimal values.
In operation, the first speech coding means produces coded data by coding a speech signal in such cases where a fixed-length segment of the speech signal contains at most one periodic component. The first speech decoding means reproduces the speech signal by decoding the coded data. The second speech decoding means reproduces the speech signal by decoding optimal values.
Furthermore, there is provided a speech processing method, which performs speech analysis and synthesis based on a speech production model. This method comprises the following steps: providing an adaptive codebook which stores a series of signal vectors of a past speech signal; producing coded data by coding a speech signal when at most one periodic component is contained in a fixed-length segment of the speech signal; extracting a signal vector and neighboring vectors adjacent thereto from the adaptive codebook for use in such cases where the fixed-length segment contains a plurality of periodic components, the signal vector being stored at a distance given by a lag parameter from the top entry location of the adaptive codebook; producing a long-term predicted speech signal from the signal vector and the neighboring vectors by using a long-term prediction synthesis filter with a high order to apply long-term prediction synthesis concerning the periodicity of the speech signal; calculating filter coefficients of the long-term prediction synthesis filter; obtaining a reproduced coded speech signal from the long-term predicted speech signal through combined use of a linear predictive synthesis filter defined through estimation with a linear predictive coding synthesis technique that emulates a vocal tract response and a perceptual weighting filter which assigns perceptual weights according to characteristics of a human hearing system; calculating an error of the reproduced coded speech signal with reference to the speech signal; finding a minimum error point that yields the smallest error among those that said step of calculating the error has calculated for various values of the lag parameter; transmitting optimal values including optimal filter coefficients and an optimal delay which are the filter coefficients and the lag parameter at the minimum error point; and reproducing the speech signal by decoding the coded data or the optimal values.
In operation, the proposed speech processing method processes a given speech signal with two algorithms. One algorithm encodes the speech signal to produce coded data when at most one periodic component is contained in a fixed-length segment of the speech signal. The other algorithm, which is activated when a plurality of periodic components are included in a fixed-length segment of the speech signal, executes speech coding with a high-order LTP synthesis filter that is obtained through estimation with a long-term predictive analysis and synthesis techniques, thereby yielding optimal values. At the decoding end, such resultant coded data and optimal values are decoded accordingly.
The above and other objects, features and advantages of the present invention will become apparent from the following description when taken in conjunction with the accompanying drawings which illustrate preferred embodiments of the present invention by way of example.