Communication of speech information often involves transmitting electrical signals which represent speech over a channel or network ("channel"). A problem commonly encountered in speech communication is how to transmit speech through a channel of limited capacity or bandwidth. (In modern digital communications systems, bandwidth is often expressed in terms of bit-rate.) The problem of limited channel bandwidth is usually addressed by the application of a speech coding system, which compresses a speech signal to meet channel bandwidth requirements. Speech coding systems include an encoder, which converts speech signals into code words for transmission over a channel, and a decoder, which reconstructs speech from received code words.
As a general matter, a goal of most speech coding systems concomitant with that of signal compression is the faithful reproduction of original speech sounds, such as, e.g, voiced speech. Voiced speech is produced when a speaker's vocal cords are tensed and vibrating quasi-periodically. In the time domain, a voiced speech signal appears as a succession of similar but slowly evolving waveforms referred to as pitch-cycles. Each pitch-cycle has a duration referred to as a pitch-period. Like the pitch-cycle waveform itself, the pitch-period generally varies slowly from one pitch-cycle to the next.
Many speech coding systems which operate at bit-rates around 8 kilobits per second (kbps) code original speech waveforms by exploiting knowledge of the speech generation process. Illustrative of these so-called waveform coders are the code-excited linear prediction (CELP) speech coding systems, which code a speech waveform by filtering it with a time-varying linear prediction (LP) filter to produce a residual speech signal. During voiced speech, the residual signal comprises a series of pitch-cycles, each of which includes a major transient referred to as a pitch-pulse and a series of lower amplitude vibrations surrounding it. The residual signal is represented by the CELP system as a concatenation of scaled fixed-length vectors from a codebook. To achieve a high coding efficiency of voiced speech, most implementations of CELP also include a long-term predictor (or adaptive codebook) to facilitate reconstruction of a communicated signal with appropriate periodicity. Despite improvements over time, however, many waveform coding systems suffer from perceptually significant distortion when operating at rates below 6 kb/s. This distortion is typically characterized as noise.
Low bit-rate coding systems which operate, for example, at rates of 2.4 kb/s are generally parametric in nature. That is, they operate by transmitting parameters describing pitch-period and the spectral envelope (or formants) of the speech signal at regular intervals. Illustrative of these so-called parametric coders is the LP vocoder system. LP vocoders model a voiced speech signal with a single pulse per pitch period. This basic technique may be augmented to include transmission information about the spectral envelope, among other things. Although LP vocoders provide reasonable performance generally, they also may introduce perceptually significant distortion, typically characterized as buzziness.
The types of distortion discussed above, and another--reverberation--common in sinusoidal coding systems, are generally the result of a reconstructed speech signal which lacks (in whole or in significant part) the pitch-cycle dynamics found in original voiced speech. Naturally, these types of distortion are more pronounced at lower bit-rates, as the ability of speech coding systems to code information about speech dynamics decreases. These problems have been addressed, and significant progress has recently been achieved in low-rate speech coding, with the introduction of algorithms based on waveform interpolation and associated signal modeling techniques. The general idea behind these techniques is to try to synthesize a coded signal that mimics the natural evolution of the original speech, while sending as little information as possible about the original signal. This idea is based on the observation that speech usually carries slowly varying attributes that may be sampled and interpolated at low rates. A significant amount of information in the signal can be discarded, as long as certain key features are faithfully regenerated.
The main techniques used in accomplishing this task are waveform interpolation (WI) and signal decomposition (SD). WI is used in the synthesis process (i.e., in the decoder) to maintain the degree of smoothness usually observed in speech signal, particularly in voiced regions. Maintaining smoothness increases the robustness to coding distortions. As an example, larger errors in pitch can be perceptually tolerated if the pitch varies smoothly rather than abruptly (unnaturally). The same is true for other types of distortions. SD enables the coding system to focus on the more important signal domains, discarding information carried in less important ones. WI coders are described, for example, in Y. Shoham, "High-quality speech coding at 2.4 to 4.0 kbps based on time-frequency interpolation," Proc. ICASSP '93 pp. II167-170; Y. Shoham, "High-quality speech coding at 2.4 kbps based on time-frequency interpolation," Proc. Eurospeech '93, pp. 741-744; W. B. Kleijn et al., "A speech coder based on decomposition of characteristic waveforms," Proc. ICASSP '95 pp. 508-511; and W. B. Kleijn et al., "A low-complexity waveform interpolation coder," Proc. ICASSP '96, pp. 212-215. WI coders are also described in the above referenced commonly assigned U.S. patent application "Method and Apparatus for Prototype Waveform Speech Coding," Ser. No. 08/667,295, and in commonly owned U.S. Pat. No. 5,517,595, entitled "Decomposition in Noise and Periodic Signal Waveforms in Waveform Interpolation," issued to W. B. Kleijn on May 14, 1996, which patent is hereby incorporated by reference as if fully set forth herein.
Although WI coders generally produce reasonably good quality reconstructed speech at low bit rates, the complexity of these prior art coders is often too high to be commercially viable for use, for example, in low-cost terminals. Therefore, it would be desirable if a WI coder were available having substantially less complexity than that of prior art WI coders, while maintaining an adequate level of performance (i.e., with respect to the quality of the reconstructed speech).