The present invention relates to a method for encoding an input acoustic signal with a small amount of information by an audio coding scheme which determines codebook indices that will minimize an error between the input acoustic signal and a synthesized signal by its encoding, and a method for decoding the encoded information into the acoustic signal with high quality.
The CELP (Code Excited Linear Prediction) coding is a typical example of conventional low bit rate audio coding through a linear prediction (LP) coding scheme. FIG. 1 is a block diagram for explaining the general outlines of the CELP coding scheme. An input acoustic signal is applied via an input terminal 11 to an LP coding part 12, which performs an LPC analysis of the acoustic signal for each frame of about 5 to 20 ms to obtain p-th order linear predictive (LP) coefficients {circumflex over (xcex1)}i, where i=1, . . . , p. The LP coefficients {circumflex over (xcex1)}i are quantized in a quanization part 13, and the resulting quantized LP coefficients {circumflex over (xcex1)}i are set as filter coefficients in an LP synthesis filter 14. The transfer function of the LP synthesis filter 14 is expressed by the following Equation (1):                               1                      A            ⁡                          (              z              )                                      =                  1                      1            +                                          ∑                                  i                  =                  1                                p                            ⁢                              xe2x80x83                            ⁢                                                α                  1                                ⁢                                  z                                      -                    1                                                                                                          (        1        )            
An excitation signal for the LP synthesis filter 14 is stored in an adaptive codebook 15. The excitation signal (vector) is cut out of the adaptive codebook 15 in accordance with input codes from a control part 16, and the cut-out segment (vector) is repeatedly duplicated and connected together to form a pitch component vector of one frame length. The pitch component vector is fed to a multiplier 22, wherein it is multiplied by a gain g1 selected from a gain codebook 17, and the multiplier output is provided as the excitation signal to the synthesis filter via an adder 18. A synthesized signal from the synthesis filter 14 is subtracted by a subtractor 19 from the input acoustic signal to generate an error signal. The error signal is provided to a perceptual weighting filter 20, wherein the error signal is weighted corresponding to a masking effect by the perceptual characteristic. The control part 16 searches the adaptive codebook 15 for indices (i.e., a pitch lag) that will minimize the power of the weighted error signal. Thereafter, the control part 16 fetches noise vectors from a fixed codebook 21 in a sequential order. The noise vectors are each multiplied in a multiplier 23 by a gain g2 selected from the gain codebook 17, then each multiplier output is added by the adder with the pitch component vector previously selected from the adaptive codebook 15 then the adder output is applied as an excitation signal to the synthesis filter 14, and as is the case with the above, the noise vectors are chosen which minimize the energy of the perceptually weighted error signal from the perceptual weighting filter 20. Finally, for the respective excitation vectors selected from the adaptive and fixed codebooks 15 and 21, the gain codebook 17 is searched for the gains g1, and g2, which are determined such that the powers of the outputs from the perceptual weighting filter 20 are minimized.
FIG. 2 is a block diagram for explaining the general outlines of a decoding scheme for the CELP coded acoustic signal. An LP coefficient code in input codes provided via an input terminal 31 is decoded in a decoding part 32, and the quantized LP coefficients xcex1i obtained by this decoding are set as filter coefficients in an LP synthesis filter 33. A pitch index in the input codes is used to cut out a pitch component vector from an adaptive codebook 34, and a fixed codebook index is used to select random component vector from a fixed codebook 35. The pitch component and random component vectors thus provided from the codebooks 34 and 35 are multiplied in multipliers 52 and 53 by gains g1 and g2 selected from a gain codebook 36 in accordance with a gain index in the input codes, thereafter being added together by an adder 37, whose output is provided as an excitation signal to the LP synthesis filter 33. A post filter processes a synthesized signal from the synthesis filter 33 in a manner to decrease quantization noise from the viewpoint of the perceptual characteristics, and provides the processed signal as a decoded acoustic signal to an output terminal 39.
As described above, in the CELP or similar time-domain audio coding the conventional synthesis filter is formed by a 10th to 20th order LP auto-regressive linear filter for modeling the spectral envelope of speech, or its combination with a comb filter of a single pitch frequency modeled after a glottal source; hence, it is impossible to express a fine spectral structure of a musical sound which has many irregularly-spaced stationary peaks in the frequency domain. A method for reflecting the fine spectral structure in the synthesis filter is proposed by the inventors of this application in Japanese Patent Application Laid-Open Gazette No. 9-258795 and in literature xe2x80x9cA 16 KBIT/S WIDEBAND CELP CODER WITH A HIGH-ORDER BACKWARD PREDICTOR AND ITS FAST COEFFICIENT CALCULATION,xe2x80x9d IEEE, pp.107-108, 1997 (hereinafter referred to as Literature 1). According to the proposed method, the LP synthesis filter in FIG. 1 is formed by a cascade connection of a p-th order (about 10th to 20th order, for instance) LP synthesis filter and a sufficiently higher n-th order LP synthesis filter. LP coefficients obtained by a p-th order linear prediction coding (LPC) analysis of the input signal is provided as coefficients of the p-th order LP synthesis filter, and LP coefficients obtained by an n-th order LPC analysis of a residual signal resulting from LP inverse filtering of a synthesized signal is provided as coefficients to the n-th order LP synthesis filter. With such a cascade-connected synthesis filters, it is possible to express the spectral envelope and fine structure of the input signal.
With the above method, in the coding apparatus of FIG. 1 the LP synthesis filter 14 is formed by a cascade connection of a p-th order LP synthesis filter of relatively low order (a 10th to 20th order synthesis filter commonly used in conventional speech coding, hereinafter referred to as a low-order synthesis filter) and an n-th order LP synthesis filter (a 100th or higher order synthesis filer, hereinafter referred to as a high-order synthesis filter). The low-order synthesis filter is used to define the spectral envelope of the input acoustic signal, and the high-order synthesis filter is used to express the fine spectral structure of the synthesized signal that cannot fully be expressed with the p-th order coefficients. Hence, it is possible to achieve higher audio coding quality.
This method allows expressing the envelope of the fine spectral structure, and hence it permits high quality encoding of a signal which has such a fine spectral structure containing a plurality of pitches as that of a musical sound. However, the use of the high-order synthesis filter means to obtain in a average spectrum of input signal samples in a long analysis window, but on the other hand it is impossible to detect short-time variations in the spectral structure, for example, fine or minute changes in the pitches as in the case of speech. For this reason, when this method is applied to a signal that has a component abruptly changing with time, such as a human vocal codes vibration or musical attack sound, the audio coding quality is degraded by an echo-like noise.
In literature by the inventors of this application, xe2x80x9cWideband CELP Coding using Higher Order Backward Prediction of Residual,xe2x80x9d Technical Report of IEICE, SP97-64, pp.51-56, November, 1997 (hereinafter referred to as Literature 2), there is disclosed a scheme which employs a synthesis filter formed by a cascade connection of high- and low-order synthesis filters as proposed in the afore-mentioned Japanese patent application laid-open gazette and Literature 1, and it is described that the problem of quality degradation in speech coding can be solved by selectively switching between the cascade-connected synthesis filter and the conventional low-order synthesis filter, depending on whether the input signal is a music or speech signal. However, Literature 2 gives no description of how to distinguish between the music signal and the speech signal nor does it set forth a method for distinguishing a signal which contains a considerable amount of minute or fine variations in spectral structure from a signal which has a plurality of pitches mixed therein.
In the afore-mentioned Japanese patent application laid-open gazette, there is also described a method according to which: the output from the adaptive codebook 15 in FIG. 1 is added with a gain and is applied as an excitation signal to a p-th order LP synthesis filter; the output from a random codebook is added with a gain and is applied as an excitation signal to the afore-mentioned cascade-connected synthesis filter; the outputs from these two synthesis filters are added together to produce a synthesized signal; and the synthesized signal is provided to the subtractor 19. With this method, however, when the input acoustic signal is a music signal, the synthesized signal quality would be lower than in the case of using the cascade-connected synthesis filter alone for a composite excitation signal of a pitch vector and a noise vector, and the audio coding quality would be low accordingly.
It is therefore an object of the present invention to provide a method and apparatus for high quality time-domain audio coding based on the linear prediction scheme by selectively using the optimum synthesis filter in accordance with the characteristic of the signal to be encoded, and a method and apparatus for decoding the encoded signal, and a recording medium on which there are recorded programs for implementing such audio coding and decoding methods.
In the coding method and apparatus according to the present invention, at least one of an input acoustic signal and a synthesized acoustic signal is used to determine p-th order LP coefficients for a p-th order LP synthesis filter and pxe2x80x2- and n-th order LP coefficients for pxe2x80x2- and n-th order LP synthesis filters cascaded to each other to form a cascade-connected synthesis filter. The value pxe2x80x2 is comparable to p and the value n is larger than p.
As estimated synthesis acoustic signal estimated from the input acoustic signal is subjected to inverse filtering by a first inverse filter of an inverse characteristic to the p-th order LP synthesis filter and by a second inverse filter of an inverse characteristic to the cascade-connected synthesis filter to obtain first and second residual signals. The first and second residual signals are estimated to be input excitation signals that are applied to the p-th order LP synthesis filter and the cascade-connected synthesis filter when the above-mentioned estimated synthesized acoustic signal is output. The first and second residual signals are used to decide which of the p-the order LP synthesis filter and the cascade-connected synthesis filter will provide higher audio coding quality.
An excitation signal is generated from excitation vectors selected from codebook means and is used to drive the decided synthesis filter to generate a synthesized acoustic signal. The codebook means is searched for indices which will minimize the error of the synthesized acoustic signal to the input acoustic signal.
In the above audio coding, the p-th order LP coefficients are computed by a p-th order LPC analysis of the input acoustic signal, the pxe2x80x2-th order LP coefficients are computed by a pxe2x80x2-th order LPC analysis on a previous synthesized acoustic signal, and the n-th order LP coefficients are computed by an n-th order LPC analysis on a residual signal obtained by inverse filtering of the previous synthesized acoustic signal or a previous excitation signal.
In the case where p=pxe2x80x2 and one p-th order synthesis filter is used both as the p-th order synthesis filter and as the pxe2x80x2-th order LP synthesis filter, the input acoustic signal or a previous synthesized acoustic signal is LPC analyzed to determine the p-th order LP coefficients, and a residual signal obtained by inverse filtering of the p-th order LP coefficients or a previous excitation signal is LPC analyzed to determine the n-th order LP coefficients.
In the decoding method and apparatus according to the present invention, p-th order LP coefficients of p-th order LP synthesis filter are obtained by decoding input codes or making an LPC analysis of a previous synthesized acoustic signal, and pxe2x80x2- and n-th order LP coefficients of pxe2x80x2- and n-th order LP synthesis filters forming a cascade-connected synthesis filter are obtained by decoding the input codes or making an LPC analysis on the previous synthesized acoustic signal to produce the pxe2x80x2-th order LP coefficients, and by decoding the input codes or making an LPC analysis of a residual signal resulting from inverse filtering of the previous synthesized acoustic signal or by making an LPC analysis of a previous excitation signal to produce the n-th order LP coefficients.
The p-th order LP synthesis filter or cascade-connected synthesis filter is selected in accordance with an input mode code. An excitation signal is generated from excitation vectors selected from codebook means corresponding to input codebook indices, and the excitation signal is applied to the selected synthesis filter to generate a synthesized acoustic signal.
In the decoding process, too, it is possible to set p=pxe2x80x2 and use the same p-th order synthesis filter both as the p-th order LP synthesis filter and as the pxe2x80x2-th order LP synthesis filter.