For many applications, e.g., mobile communications, voice main, secure voice, etc., a speech codec operating at 4.8 kbps and below with high-quality speech is needed. However, there is no known previous speech coding technique which is able to produce near-toll quality speech at this data rate. The government standard LPC-10, operating at 2.4 kbps, is not able to produce natural-sounding speech. Speech coding techniques successfully applied in higher data rates (&gt;10 kbps) completely break down when tested at 4.8 kbps and below. To achieve the goal of near-toll quality speech at 4.8 kbps, a new speech coding method is needed.
A key idea for high quality speech coding at a low data rate is the use of the "analysis-by-synthesis" method. Based on this concept, an effective speech coding scheme, known as Code-Excited Linear Prediction (CELP), has been proposed by M. R. Schroeder and B. S. Atal, "Code-Excited Linear Prediction (CELP): High Quality Speech at Very Low Bit Rates", Proc. Int. Conf. Acoust., Speech, and Signal Processing (ICASSP), pp. 937-940, 1985. CELP has proven to be effective in the areas of medium-band and narrow-band speech coding. Assuming there are L=4 excitation subframes in a speech frame with size N=160 samples, it has been shown that an excitation codebook with 1024, 40-dimensional random Gaussian codewords is enough to produce speech which is indistinguishable from the original speech. For the actual realization of this scheme, however, there still exist several problems.
First, in the original scheme, most of the parameters to be transmitted, except the excitation signal, were left uncoded. Also, the parameter update rates were assumed to be high. Hence, for low-date-rate applications, where there are not enough data bits for accurate parameter coding and high update rates, the 1024 excitation codewords become inadequate. To achieve the same speech quality with a fully-coded CELP codec, a data rate close to 10 kbps is required.
Secondly, typical CELP coders use random Gaussian, Laplacian, uniform, pulse vectors or a combination of them to form the excitation codebook. A full-search, analysis-by-synthesis, procedure is used to find the best excitation vector from the codebook. A major drawback of this approach is that the computational requirement in finding the best excitation vector is extremely high. As a result, for real-time operation, the size of the excitation codebook has to be limited (e.g., &lt;1024) if minimal hardware is to be used.
Thirdly, with the excitation codebook, which contains 1024, 40-dimensional random Gaussian codewords, a computer memory space of 1024.times.40=40960 words is required. This memory space requirement for the excitation codebook alone has already exceeded the storage capabilities of most of the commercially available DSP chips. Many CELP coders, hence, have to be designed with a smaller-sized excitation codebook. The coder performance, therefore, is limited, especially for unvoiced sounds. To enhance the coder performance, an effective method to significantly increase the codebook size without a corresponding increase in the computational complexity (and the memory requirement) is needed.
As described above, there are not enough data bits for accurate excitation representation at 4.8 kbps and below. Comparing the CELP excitation to the ideal excitation, which is the residual signal after both the short-term and the long-term filters, there is still considerable discrepancy. Thus, several critical parts of a CELP coder must be designed carefully. For example, accurate encoding of the short-term filter is found important because of the lack of excitation compensation. Also, appropriate bit allocation between the long-term filter (in terms of the update rate) and the excitation (in terms of the codebook size) is found necessary for good coder performance. However, even with complicated coding schemes, toll-quality is still hardly achieved.
Multipulse excitation, as described by B. S. Atal and J. R. Remde, "A New Model of LPC Excitation for Producing Natural-Sounding Speech at Low Bit Rates", proc. ICASSP, pp. 614-617, 1982, has proven to be an effective excitation model for linear predictive coders. It is a flexible model for both voiced and unvoiced sounds, and it is also a considerably compressed representation of the ideal excitation signal. Hence, from the encoding point of view, multipulse excitation constitutes a good set of excitation signals. However, with typical scalar quantization schemes, the required data rate is usually beyond 10 kbps. To reduce the data rate, either the number of excitation pulses has to be reduced by better modelling of the LPC spectral filter, e.g., as described by I. M. Transcoso, L. B. Almeida and J. M. Tribolet, "Pole-Zero Multipulse Speech Representation Using Harmonic Modelling in the Frequency Domain", ICASSP, pp. 7.8.1-7.8.4., 1985, and/or more efficient coding methods have to be used. Applying vector quantization, e.g., as described by A. Buzo, A. H. Gray, R. M. Gray, and J. P. Market, "Speech Coding Based Upon Vector Quantization", IEEE Tran. Acoust., Speech, and Signal Processing, pp. 562-574, October, 1980, directly to the multipulse vectors is one solution to the latter approach. However, several obstacles, e.g., the definition of an appropriate distortion measure and the computation of the centroid from a cluster of multipulse vectors, have hindered the application of multipulse excitation in the low-bit-rate area.
Hence, for the application of CELP codec structure to 4.8 kbps speech coding, careful compromise system design and effective parameter coding techniques are necessary.