In today's highly verbal and highly interactive technical climate, it is often necessary or desirable to transmit human voice electronically from one point to another, sometimes over great distance, and often over channels of limited bandwidth. For example, conversations via cell phone links or via the Internet or other digital electronic networks are now commonplace. Likewise, it is often useful to digitally store human voice, such as on the hard drive of a computer, or in the volatile or nonvolatile memory of a digital recording device. For example, digitally stored human voice may be replayed as part of a telephone answering protocol or an audio presentation.
Channels and media usable for the transmission and/or storage of digital voice are often of limited capacity, and grow more so every day. For example, the advent of quality video for use in conjunction with real time or recorded voice has created a demand for audio/video conferencing over digital networks in real time as well as for non-real time high quality audio/video presentations, such as those receivable in streaming format and those downloadable for storage in their entirety. As video content displaces bandwidth and storage capacity in various transmission channels and storage media, the need to efficiently and properly compress both voice and video becomes imperative. Other scenarios also create a need for extreme and effective compression of voice. For example, increasingly congested cell phone links must be able to accommodate a greater number of users often over channels whose capacity has not changed in keeping with the number of users.
Whatever the motivation, the compression of voice has been and remains an important area of communication technology. Available digital voice coding techniques span a spectrum from inefficient techniques that employ no compression to efficient techniques that achieve compression ratios of four or greater. Generally, existing coders may be classified as either waveform coders or voice coders. Waveform coders actually attempt to describe the sound wave itself and typically do not achieve high rates of compression. Voice coders, or vocoders, take into account the source and peculiarities of human speech rather than simply attempting to map the resultant sound wave, and accordingly may achieve much higher compression rates, albeit at the expense of increased computational complexity. Waveform coders are generally more robust to peculiar human voices, non-speech sounds and high levels of background noise.
Most prevalent voice coders employ techniques based on linear predictive coding. The linear predictive coding technique assumes that for each portion of the speech signal there exists a digital filter that when excited by a certain signal will produce a signal much like the original speech signal portion. In particular, a coder implementing a linear predictive technique will typically first derive a set of coefficients that describe the spectral envelope, or formants, of the speech signal. A filter corresponding to these coefficients is established and used to reduce the input speech signal to a predictive residual. In general terms, the above described filter is an inverse synthesis filter, such that inputting the residual signal into a corresponding synthesis filter will produce a signal that closely approximates the original speech signal.
Typically, the filter coefficients and the residual are transmitted or stored for later and/or distant re-synthesis of the speech signal. While the filter coefficients require little space for storage or little bandwidth, e.g. 1.5 kbps, for transmittal, the predictive residual is a high-bandwidth signal similar to the original speech signal in complexity. Thus, in order to effectively compress the speech signal, the predictive residual must be compressed. The technique of Codebook Excited Linear Prediction (CELP) is used to achieve this compression. CELP utilizes one or more codebook indexes which are usable to select particular vectors, one each from a set of “codebooks”. Each codebook is a collection of vectors. The selected vectors are chosen such that when scaled and summed, they produce a response from the synthesis filter that best approximates the response of the filter to the residual itself. The CELP decoder has access to the same codebooks as the CELP encoder did, and thus the simple indexes are usable to identify the same vectors from the encoder and decoder codebooks.
When the available capacity or bandwidth is ample, it is not difficult to have codebooks that are rich enough to allow for a close approximation to the original residual, however complex. However, as the available capacity or bandwidth decreases, the richness of the CELP codebooks necessarily decreases.
One way to decrease the number of bits needed to mimic the residual signal is to increase its periodicity. That is, redundancies in the original signal are more compactly representable than are non-redundant features. One technique that takes advantage of this principle is Relaxation Codebook Excited Linear Predictive coding (RCELP). An example of this technique is discussed in the article “The RCELP Speech coding Algorithm,” Eur. Trans. On Communications, vol. 4, no. 5, pp. 573-82 (1994), authored by W. B. Kleijn et al, which is incorporated herein by reference in its entirety for all that it discloses. In particular, this article describes a method of uniformly advancing or delaying whole segments of a residual signal such that its modified pitch-period contour matches a synthetic pitch-period contour. Problems with this approach include the fact that as an artifact of the particular warping methodology, certain portions of the original signal may be omitted or repeated. In particular, if two adjacent segments of the signal experience a cumulative compressive shift, portions of the original signal near the overlap may be omitted in the modified signal. Likewise, if two adjacent segments experience a cumulative expansive shift, portions of the original signal near the overlap may be repeated in the modified signal. These artifacts produce an audible distortion in the final reproduced speech.
Other art has suggested a similar approach. See for example the article “Interpolation of the Pitch-Predictor parameters in Analysis-by-Synthesis Speech Coders,” IEEE Transactions of Speech and Audio Processing, vol. 2, no. 1, part I (January, 1994), authored by W. B. Kleijn et al, which is incorporated herein by reference in its entirety for all that it discloses.
All pitch warping approaches suggested in the past have suffered similar shortcomings, including a reduction in quality due to the shifting of segment edges, causing omissions and repeats of the original signal. It is desired to provide a frame warping method to reduce the transmission bit rate for a speech signal, while not introducing signal repeats and omissions, and without increasing the complexity or delay of the coding calculations to the point where real-time communications are not possible.