Audio coding refers to the application of data compression to audio signals such as music and speech signals. In audio coding, a “coder” encodes an input audio signal into a digital bit stream for transmission or storage, and a “decoder” decodes the bit stream into an output audio signal. The combination of the coder and the decoder is called a “codec.” The goal of audio coding is usually to reduce the encoding bit rate while maintaining a certain degree of perceptual audio quality. For this reason, audio coding is sometimes referred to as “audio compression.” When audio coding is applied specifically to speech signals, it is often referred to as speech coding.
One type of speech coding known in the art is termed Continuously Variable Slope Delta Modulation (CVSD). CVSD is a delta modulation technique with a variable step size that was first proposed by J. A. Greefkes and K. Riemens in “Code Modulation with Digitally Controlled Companding for Speech Transmission,” Philips Tech. Rev., pp. 335-353 (1970), the entirety of which is incorporated by reference herein. CVSD is a sample-by-sample source coding method that encodes at 1 bit per sample. Thus, in accordance with CVSD, audio that is sampled at 64 kilohertz (kHz) is encoded at 64 kilobits/second (kbit/s).
In CVSD, the encoder maintains a reference sample and a step size. Each input sample is compared to the reference sample. If the input sample is equal or larger, the encoder emits a “0” bit and adds the step size to the reference sample. If the input sample is smaller, the encoder emits a “1” bit and subtracts the step size from the reference sample. The CVSD encoder also keeps the previous K bits of output (K=3 or K=4 are very common) to determine adjustments to the step size; if J of the previous K bits are all “1”s or “0”s (J=3 or J=4 are also common), the step size is increased by a fixed amount. Otherwise, the step size remains the same (although it may be multiplied by a decay factor which is slightly less than 1). The step size is adjusted for every input sample processed.
A CVSD decoder starts with the reference sample, and adds or subtracts the step size according to the bit stream. The sequence of adjusted reference samples constitutes the reconstructed audio waveform, and the step size is increased or maintained in accordance with the same all-1s-or-0s logic as in the CVSD encoder.
In CVSD, the adaptation of the step size helps to minimize the occurrence of coding noise in the form of slope overload and granular noise. Slope overload occurs when the slope of the audio signal is so steep that the encoder cannot keep up. Adaptation of the step size in CVSD helps to minimize or prevent this effect by enlarging the step size sufficiently. Granular noise occurs when the audio signal is not in the slope overload condition. A CVSD system has no symbols to represent steady state, so a constant input is represented by alternate ones and zeros. Accordingly, the effect of granular noise is minimized when the step size is sufficiently small.
CVSD has been referred to as a compromise between simplicity, low bit rate, and quality. Different forms of CVSD are currently used in a variety of applications. For example, a 12 kbit/s version of CVSD is used in the SECURENET® line of digitally encrypted two-way radio products produced by Motorola, Inc. of Schaumburg, Ill. A 16 kbit/s version of CVSD is used by military digital telephones (referred to as Digital Non-Secure Voice Terminals (DNVT) and Digital Secure Voice Terminals (DSVT)) for use in deployed areas to provide voice recognition quality audio. The BLUETOOTH® specification for wireless personal area networks (PANs) specifies a 64 kbit/s version of CVSD that may be used to encode voice signals in telephony-related BLUETOOTH® service profiles, e.g. between mobile phones and wireless headsets.
The 64 kbits/s version of CVSD defined by the BLUETOOTH® specification is used to encode an 8 kHz input speech signal. Since CVSD encodes at 1 bit per sample, the 8 kHz input speech signal must be up-sampled to 64 kHz prior to encoding thereof. Furthermore, the 64 kHz decoded speech signal produced by the CVSD decoder must be down-sampled to produce an 8 kHz output speech signal. Thus, a conventional implementation of CVSD for BLUETOOTH® typically includes an up-sampling stage that precedes the encoder and a down-sampling stage that follows the decoder. The BLUETOOTH® specification does not specify how such sampling rate conversion (SRC) stages should be implemented. However, the BLUETOOTH® specification does require that the attenuation of the stopband (˜4-32 kHz) be greater than 20 dB relative to the passband (˜0-4 kHz). It has been observed in practice that the requirement of greater than 20 dB stopband attenuation is too loose and a CVSD implementation for BLUETOOTH® having 20-30 dB stopband attenuation may still produce fairly audible distortion.
As compared to other sample-by-sample codecs, CVSD is more robust to random bit errors. However, as compared to other toll quality codecs, the overall quality of the speech signal produced by CVSD leaves something to be desired. Thus, there exists a desire to improve CVSD speech quality.
One approach to improving CVSD speech quality for a BLUETOOTH® implementation involves optimizing certain filters applied in the previously-discussed up-sampling and down-sampling stages in order to achieve increased stopband attenuation. While such an approach can produce an improvement in speech quality, such an approach alone may not be sufficient to achieve the same speech quality as that achieved by other toll-quality codecs.
Another approach to improving CVSD speech quality could entail modifying the CVSD encoding rules. However, such a modification would affect bit-stream compatibility with the BLUETOOTH® CVSD standard and codecs that implemented such an approach would not be interoperable with the large installed base of existing CVSD codecs.
Yet another approach to improving CVSD speech quality could involve introducing an adaptive post-filter after the CVSD decoder to reduce the perceived level of granular coding noise. However, such an adaptive post-filter would distort the speech itself. Thus, if distortion of the speech signal is sought to be avoided, this solution is not an attractive one.
What is needed then is a system and method for improving the speech quality of a CVSD codec. The desired system and method should not entail modifying the CVSD encoding rules or require the use of an adaptive post-filter that may distort the speech signal. It would be beneficial if the desired system and method were also applicable to other delta modulation codecs as well as to any sample-by-sample audio codec.
The features and advantages of the present invention will become more apparent from the detailed description set forth below when taken in conjunction with the drawings, in which like reference characters identify corresponding elements throughout. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The drawing in which an element first appears is indicated by the leftmost digit(s) in the corresponding reference number.