High quality coding of speech signal s at low bit rates is of great importance to modern communications. Applications for such coding include mobile telephony, voice storage and secure telephony, among others. These applications would benefit from high quality coders operating at one to five kilobits per second. As a result, there is a strong research effort aimed at the development of coders operating at these rates. Most of this research effort is directed at coders based on a sinusoidal coding paradigm (e.g. R. J. McAulay and T. F. Quatieri, "Sinusoidal Coding", in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, editors, Elsevier Science, 1995, pages 121-173.) and a waveform interpolation paradigm (e.g., W. B. Kleijn, "Encoding Speech Using Prototype Waveforms", IEEE Trans. Speech and Audio Process., vol. 4, pages 386-399, 1993). Furthermore, several standards based on sinusoidal coders already exist, for example, INMARSAT Mini-M 4 kb/s, and APCO Project 25 North American land mobile radio communication system.
Coders operating at bit rates greater than five kilobits per second commonly use coding paradigms for which the reconstructed signal is identical to the original signal when the quantization errors are zero (i.e. when quantization is turned off). In other words, signal reconstruction becomes exact when the operational bit rate approaches infinity. Such coders are referred to as Asymptotically Exact (AE) coders. Examples of standards which conform with such coders are the ITU G.729 and G.728 standards. These standards are based on a commonly known Code-Excited Linear Prediction(CELP) speech-coding paradigm. AE coders have an advantage in that the quality can be improved by increasing the operational bit rate. Thus, any shortcomings in models of the speech signal used by an AE coder which result in human perception can be compensated for by increasing the operational bit rate. As a result, any de-tuning of parameter settings in a good AE coder increases the required bit rate necessary to obtain a certain quality of the reconstructed speech. In practice, a majority of AE coders employ bit rates which result in the quality of the reconstructed speech to be of a good to excellent quality. Hereinafter, the meaning of "good" and "excellent" are defined by descriptions contained in the commonly known Mean Opinion Score (MOS) which is based on a subjective evaluation.
For most speech-coding paradigms implemented at bit rates below five kilobits per second, the reconstructed signal does not converge to the original signal when the quantization errors are set to zero. Hereinafter, such coders are referred to as parametric coders. Parametric coders are typically based on a model of the speech signal which is more sophisticated than those used in waveform coders. However, since these coders lack the AE property of improved reconstruction signal quality with increased bit rates, slight shortcomings in the model may greatly affect the quality of the reconstructed speech signal. Relatively seen, this effect on quality is most important with the use of high bit rate quantizers. Thus, the quality of the reconstructed speech signal cannot exceed a certain fixed maximum level which is primarily dependent on the particular model. Generally this maximum quality level is below a "good" rating on the MOS scale.
It would be advantageous therefore, to modify promising parametric coders to operate as AE coders. First, usage of sophisticated speech-signal models associated with parametric coders results in an efficient coding. Second, conversion to an AE coder removes limitations on the quality of the reconstructed speech associated with parametric coders. To convert a parametric coder to an AE coder, however, the parametric coder needs to be amenable to such a modification. As will be described below, the waveform interpolation coder is indeed amenable to such a change. Furthermore, use of the present invention allows certain sinusoidal coders to be converted from parametric coders to AE coders as well.
Until recently, all versions of commonly known waveform interpolation coders (e.g. I. S. Burnett and D. H. Pham, "Multi-Prototype Waveform Coding Using Frame-by-Frame Analysis-by-Synthesis", Proc. International Conf. Acoust. Speech Sign. Process., 1997, pages 1567-1570, and Y. Shoham, "very Low Complexity Interpolative Speech Coding at 1.2 to 2.4 kbps", Proc. International Conf. Acoust. Speech Sign. Process., 1997, pages 1599-1602.) were parametric coders. Since the quality of the reconstructed speech signal is limited by the particular model, implementations of waveform interpolation coders have been designed at bit rates of approximately two thousand four hundred bits per second where the shortcomings of the model are least apparent.
Recently, two AE versions of the waveform interpolation coder were proposed (W. B. Kleijn, H. Yang, and E. F. Deprettere, "Waveform Interpolation With Pitch-Spaced Subbands", Proc. International Conf. Speech; and Language Process., 1998 pages 1795-1798). The basic coder operation is the same in both versions. Using either version of the proposed waveform interpolation coders, a pitch period track of the speech signal is estimated by a pitch tracking unit which uses standard commonly known techniques, with the pitch period track also continuing in regions of no discernable periodicity. Hereinafter, a speech signal is defined to be either the original speech signal or any signal derived from a speech signal, for example, a linear-prediction residual signal.
A digitized speech signal and the pitch-period track form an input to a time warping unit which outputs a speech signal having a fixed number of samples per pitch period. This constant-pitch-period speech signal forms an input to a nonadaptive filter bank. The coefficients coming out of the filter bank are quantized and the corresponding indices encoded with the quantization procedure potentially involving multiple steps. At the receiver, the quantized coefficients are reconstructed from the transmitted quantization indices. These coefficients form an input to a synthesis filter bank which produces the reconstructed signal as an output. The filter banks are perfect reconstruction filter banks (e.g., P. P. Vaidyanathan, "Multirate Systems and Filterbanks", Prentice Hall, 1993) which result in an perfect reconstruction when the analysis and synthesis banks are concatenated, that is to say, when the quantization is turned off. Thus, the coder possesses the AE property if an appropriate unwarping procedure is used.
In the two AE versions of the waveform interpolation coder described above, a Gabor-transform and a Modulated Lapped Transform (MLT) were used as filter banks, respectively. Both procedures suffer from disadvantages which are difficult to overcome in practice. A primary disadvantage exhibited by both procedures is of increased delay. In general, the Gabor-transform based waveform interpolation coder requires an over-sampled filter bank for good performance. This means that the number of coefficients to be quantized is larger than the original speech signal, which is a practical disadvantage for coding. When the MLT is used, the coder parameters are not easily converted into either a description of the speech waveforms or a description of the harmonics associated with voiced speech. This makes it more difficult to evaluate the effects of time-domain and frequency-domain masking.
In the Gabor-transform approach, the reconstructed signal is a summation of smoothly windowed complex exponential (sinusoid) functions (vectors). The scaling and summing of the functions is equivalent to the implementation of the synthesis filter bank. The coefficients for each of these windowed exponential functions form the representation to be quantized. In speech coding applications, the main purpose of the smooth window is to prevent any discontinuities of the energy contour of the reconstructed signal upon quantization of the coefficients. If such discontinuities are present, they become audible in voiced speech segments which is the focus of the present invention. Furthermore, a commonly known Balian-Low theorem (e.g., S. Mallat, "A Wavelet Tour of Signal Processing", Academic Press, 1998) implies that a smooth window can be used only in combination with over sampling. Therefore, over sampling cannot be eliminated when the Gabor-transform based approach is used for a speech signal.
With a square window, the Gabor-transform filter bank can be critically sampled. This is convenient for coding since the output of the analysis filter bank has the same number of coefficients (samples) as the original signal had samples. Furthermore, in the case of a square window and critical sampling, the Gabor-transform filter bank reduces to the commonly known block Discrete Fourier Transform(DFT) which is attractive from a computational and a delay viewpoint. Unfortunately, quantization of the coefficients results in discontinuities of the energy contour of the reconstructed signal.
It would be advantageous therefore, to devise a method and apparatus for pre-processing speech signals to create a pre-conditioned speech signal which eliminates the problems associated with the block-DFT based approach.