The present invention relates to a method and an apparatus for low bit rate speech signal coding.
Searching an excitation sequence of a speech signal at short time intervals is a method known in the art which is capable of coding a speech signal at a transmission rate of 10 kilobits per second (kbps) or less, provided that an error in the signal reproduced by using the sequence relative to an input signal is minimal. For example, an A-b-S (Analysis-by-Synthesis) method (prior art 1) proposed by B. S. Atal at Bell Telephone Laboratories of the United States is worth notice in that the excitation sequence is represented by a plurality of pulses so as to provide the amplitudes and the phases on the coder side at short time intervals. For details of such a method, a reference may be made to "A NEW MODEL OF LPC EXCITATION FOR PRODUCING NATURAL-SOUNDING SPEECH AT LOW BIT RATES," ICASSP, pp. 614-617, 1982 (reference 1). However, a problem with the prior art 1 is that the A-b-S method used to determine the pulse sequence needs a prohibitive amount of calculation. Another prior art approach (prior art 2) for determining a pulse sequence and which is elaborated to decrease the calculation amount is described by T. Araseki, K. Osawa, S. Ono and K. Ochiai in "MULTI-PULSE EXCITED SPEECH CODER BASED ON MAXIMUM CROSSCORRELATION SPEECH ALGORITHM," IEEE Global Telecommunications Conference, 23.3, Dec. 1987 (reference 2). Various pulse search algorithms (prior art 3) of the type using correlation functions have been proposed by K. Ozawa, S. Ono and T. Araseki in "A Study on Pulse Search Algorithms for Multipulse Excited Speech Coder Realization," IEEE Journal on Selected Areas in Communications, Vol. SAC-4, No. 1, Jan. 1986 (Reference 3). In accordance with the prior art 3, sound is reproducible with high quality for transmission rates of 8 to 16 kbps.
The prior art method which uses correlation functions may be outlined as follows. The excitation sequence comprising K pieces of pulse sequence within a frame is expressed as: ##EQU1## where .delta. (.multidot.) is .delta. of Kronecker, N is the frame length, and g.sub.k is the pulse amplitude at a location m.sub.k.
LPC (Linear Predictive Coding) parameters for a synthesis filter are determined from the covariance of speech signal X (n) constructed into a frame. The synthesis filter characteristic H (z) is given, in the Z-transform notation, by: ##EQU2## where a.sub.i are filter coefficients for the LPC synthesis filter, and P is the filter order.
Let h (n) be the impulse response of the synthesis filter. Then, the reproduced signal Y (n) obtained by inputting V (n) to the synthesis filter can be written as: ##EQU3## where * is representative of convolutional integration.
The weighted mean squared error between the input speech signal X (n) and the reproduced signal Y (n) within one frame is given by: ##EQU4## where W (n) is the weighting function. The weighting function W (n) is introduced to reduce perceptual distortion in the reproduced speech. According to the audio masking effect, noise tends to be suppressed in a zone where the speech energy is greater. The weighting function is determined based on the audio characteristics. As regards the weighting function, there has been proposed a Z-transform function W (z) which uses a real constant .gamma. and a predictive parameter a.sub.i of the synthesis filter under the condition of 0.ltoreq..gamma..ltoreq.1 (see the reference 1), i.e., ##EQU5## The Eq. (4) may be rewritten as: ##EQU6## where X.sub.w (n) and h.sub.w (n) stand for weighted signals of X (n) and h (n), respectively.
Assuming that k-1 pulses were determined, k-th pulse location m.sub.k is given by setting derivative of the error power E with respect to the k-th amplitude g.sub.k to zero for 1.ltoreq.m.sub.k .ltoreq.N. Hence, there holds an equation: ##EQU7##
From the above Eqs. (6) and (7), it will be seen that the optimum pulse location is given at the point m.sub.k where the absolute value of g.sub.k is maximum. By properly processing the frame edge, the above equations can be further reduced to: ##EQU8## Rhx (m.sub.k) is the crosscorrelation function between the weighted speech X.sub.w (n) and the weighted impulse response h.sub.w (n). Rhh (.vertline.m.sub.k -m.sub.i .vertline.) is the autocorrelation function of the weighted impulse response h.sub.w (n).
Actual pulse search is performed by using error criterion function R (n). In the first stage (k=1), R (n) is the same as the crosscorrelation Rhx (n). The absolute maximum of R (n) is searched for, and the optimum pulse location is determined. The amplitude is determined from the Eq. (8) by using the obtained location m.sub.1. R (m) is modified by subtracting the produced g.sub.k Rhh (n) from R (n). Then, after increasing k, the next pulse search is executed based on maximum crosscorrelation search, until the actual number of pulses exceeds a predetermined one. R (n) in the k-th stage R (n).sup.(k) is represented by: ##EQU9##
As regards the pulse search, there have been proposed four different methods (prior art 3), i.e., a method 2 which, when the k-th pulse has been determined, adjusts its amplitude and the amplitudes of k-1 pulses determined before, a method 2--2 which adjusts the amplitude of the k-th pulse and those of two pulses nearest thereto, a method 2-1 which adjusts the amplitude of the k-th pulse and that of one pulse nearest thereto, and a method 1 which does not perform any amplitude adjustment. The quality of sound reproduction sequentially becomes high in the order of the methods 1, 2--2, 2--2 and 2. However, as regards the calculation amount necessary for pulse search, the methods 2-1, 2--2 and 2 are, respectively, substantially twice, three times and K/2 times greater than the method 1 and, therefore, impractical.