Speech and audio coding algorithms have a wide variety of applications in wireless communication, multimedia and voice storage systems. The development of the coding algorithms is driven by the need to save transmission and storage capacity while maintaining the quality of the synthesized signal at a high level. These requirements are often quite contradictory, and thus a compromise between capacity and quality must typically be made. The use of speech coding is particularly important in mobile telecommunication systems since the transmission of the full speech spectrum would require significant bandwidth in an environment where spectral resources are relatively limited. Therefore the use of signal compression techniques are employed through the use of speech encoding and decoding, which is essential for efficient speech transmission at low bit rates.
FIG. 1 shows an exemplary procedure for the transmission and/or storage of digital audio signals for subsequent reproduction at the output end. A speech signal y(k) is input into encoder 100 to encode the signal into a coded digital representation of the original signal. The resulting bit stream is sent to a communication channel (e.g. a radio channel) or storage medium 110 such as a solid state memory, a magnetic or optical storage medium, for example. From the channel/storage medium 110, the bit stream is input into a decoder 120 where it is decoded in order to reproduce the original signal y(k) in the form of output signal ŷ(k).
Speech coding algorithms and systems can be categorized in different ways depending on the criterion used. One way of classifying them consists of waveform coders, parametric coders, and hybrid coders. Waveform coders, as the name implies, try to preserve the waveform being coded as closely as possible without paying much attention to the characteristics of the speech signal. Waveform coders also have the advantage of being relatively less complex and typically perform well in noisy environments. However, they generally require relatively higher bit rates to produce high quality speech. Hybrid coders use a combination of waveform and parametric techniques in that they typically use parametric approaches to model, e.g., the vocal tract by an LPC filter. The input signal for the filter is then coded by using what could be classified as waveform coding method. Currently, hybrid speech coders are widely used to produce near wireline speech quality at bit rates in the range of 8–12 kbps.
In many current hybrid coders, the transmitted parameters are determined in an Analysis-by-Synthesis (AbS) fashion where the selected distortion criterion is minimized between the original speech signal and the reconstructed speech corresponding to each possible parameter value. These coders are thus often called AbS speech coders. By way of example, in a typical AbS coder, an excitation candidate is taken from a codebook, filtered through the LPC filter, in which the error between the filtered and input signal is calculated such that the one providing the smallest error is chosen.
In a typical AbS speech coder, the input speech signal is processed in frames. Usually the frame length is 10–30 ms, and a look-ahead segment of 5–15 ms of the subsequent frame is also available. In every frame, a parametric representation of the speech signal is determined by an encoder. The parameters are quantized, and transmitted through a communication channel or stored in a storage medium in digital form. At the receiving end, a decoder constructs a synthesized speech signal representative of the original signal based on the received parameters.
One important class of analysis-by-synthesis speech coder is the Code Excited Linear Predictive (CELP) speech coder which is widely used in many wireless digital communication systems. CELP is an efficient closed loop analysis-by-synthesis coding method that has proven to work well for low bit rate systems in the range of 4–16 kbps. In CELP coders, speech is segmented into frames (e.g. 10–30 ms) such that an optimum set of linear prediction and pitch filter parameters are determined and quantized for each frame. Each speech frame is further divided into a number of subframes (e.g. 5 ms) where, for each subframe, an excitation codebook is searched to find an input vector to the quantized predictor system that gives the best reproduction of the original speech signal.
The basic underlying structure of most AbS coders is quite similar. Typically they employ a type of linear predictive coding (LPC) technique, for example, a cascade of time variant pitch predictor and an LPC filter. An all-pole LPC filter:
                                          1                          A              ⁡                              (                                  q                  ,                  s                                )                                              =                      1                          1              +                                                                    a                    1                                    ⁡                                      (                    s                    )                                                  ⁢                                  q                                      -                    1                                                              +                                                                    a                    2                                    ⁡                                      (                    s                    )                                                  ⁢                                  q                                      -                    2                                                              +              …              +                                                                    a                                          n                      a                                                        ⁡                                      (                    s                    )                                                  ⁢                                  q                                      -                                          n                      a                                                                                                          ,                            (        1        )            where q−1 is unit delay operator and s is subframe index, is used to model the short-time spectral envelope of the speech signal. The order na of the LPC filter is typically 8–12.
A pitch predictor of the form:
                              1                      B            ⁡                          (                              q                ,                s                            )                                      =                  1                      1            -                                          b                ⁡                                  (                  s                  )                                            ⁢                              q                                  -                                      τ                    ⁡                                          (                      s                      )                                                                                                                              (        2        )            utilizes the pitch periodicity of speech to model the fine structure of the spectrum. Typically, the gain b(s) is bounded to the interval [0, 1.2], and the pitch lag τ(s) to the interval [20, 140] samples (assuming a sampling frequency of 8000 Hz). The pitch predictor is also referred to as long-term predictor (LTP) filter.
FIG. 2 shows a simplified functional block diagram of an exemplary AbS speech encoder. An excitation signal uc(k) is produced by an excitation generator 200. The excitation generator 200 is often referred to as an excitation codebook, where the signal is multiplied by a gain g(s) 205 to form an input signal to a filter cascade 225. A feedback loop consisting of the delay q−τ(S) 215 and the gain b(s) 210 represent an LTP filter. The LTP filter models the periodicity of the signal, which is especially relevant in voiced speech, where the prior periodic speech is used as an approximate for the speech in current subframe and the error is coded using fixed excitation such as an algebraic codebook. The output of the filter cascade 225 is a synthesized speech signal ŷ(k). In the encoder, an error signal e(k) (mean squared weighted error) is computed by subtracting the synthesized speech signal ŷ(k) from the original speech signal y(k). An error minimizing procedure 235 is employed to choose the best excitation signal provided for by the excitation generator 200. Typically, a perceptual weighting filter is applied to the error signal prior to the error minimization procedure in order to shape the spectrum of the error signal so that it is less audible.
Although AbS speech coders generally provide good performance at low bit rates they are relatively computationally demanding. Another characteristic is that at low bit rates, e.g. below 4 kbps, the matching to the original speech waveform becomes a severe constraint in improving the coding efficiency further. This applies to the coding of speech in general which includes voiced, unvoiced, and plosive speech. Although there have been solutions put forth for improvements in modeling voiced speech, substantial improvements in modeling nonstationary speech such as plosives have so far not been presented. As known by those skilled in the art, plosives and unvoiced speech tend to be abrupt such as in the stop consonants like /p/, /k/, and /t/, for example. These speech waveforms are particularly difficult to model accurately in prior-art low bit rate AbS coders since there is often a clear mismatch between the original and coded excitation signals due to the lack of bits to accurately model the original excitation. The differences in the overall waveform shape causes the energy of the coded excitation to be much smaller than that of the ideal excitation due to the parameter estimation method. This often results in synthesized speech that can sound unnatural at a very low energy level.
FIG. 3 illustrates the resulting synthetic excitation of a CELP coder when using a codebook having a relatively high pulse population density (codebook 1) i.e. a dense pulse position grid. Also shown is the resulting synthetic excitation when using a codebook having a relatively lower pulse population density (codebook 2). In top graph A, the ideal excitation for the sound /p/ is shown. In both codebooks, two positive or negative pulses are used over a subframe of 40 samples. The example pulse locations and shifts for the individual codebooks are presented separately in Table 1 and Table 2 respectively. As can be seen by the bottom graph C, the excitation signal constructed by using the codebook of Table 2 has a much lower energy level than the ideal excitation (top) since the possible pulse locations do not match well with pulse locations in the ideal excitation. In contrast, when codebook 1 is used, the energy is significantly higher because the pulse locations more closely match the ideal excitation, as shown in the middle graph B. For both codebooks, only one pulse gain is used per subframe and adaptive codebooks are not used.
TABLE 1PulsePositions00, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36,3811, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37,39
TABLE 2PulsePositions00, 4, 8, 12, 16, 20, 24, 28, 32, 3612, 6, 10, 14, 18, 22, 26, 30, 34, 38
The resulting energy disparity between the synthesized excitations is clearly evident when using a codebook having fewer pulse positions whereby the lower energy excitation results in a sound that is unsatisfactory and barely audible. In view of the foregoing, an improved method is needed which enable AbS speech coders to more accurately produce high quality speech in speech signals containing nonstationary speech.