The present invention relates to audio signal processing and is directed more particularly to a system and method for scalable and embedded coding and transmission of speech and audio signals.
In conventional telephone services, speech is sampled at 8,000 samples per second (8 kHz), and each speech sample is represented by 8 bits using the ITU-T G.711 Pulse Code Modulation (PCM), resulting in a transmission bit-rate of 64,000 bits/second, or 64 kb/s for each voice conversation channel. The Plain Old Telephone Service (POTS) is built upon the so-called Public Switched Telephone Networks, (PSTN), which are circuit-switched networks designed to route millions of such 64 kb/s speech signals. Since telephone speech is sampled at 8 kHz, theoretically such 64 kb/s speech signal cannot carry any frequency component that is above 4 kHz. In practice, the speech signal is typically band-limited to the frequency range of 300 to 3,400 Hz by the ITU-T P.48 Intermediate Reference System (IRS) filter before its transmission through the PSTN. Such a limited bandwidth of 300 to 3,400 Hz is the main reason why telephone speech sounds thin, unnatural, and less intelligible compared with the full-bandwidth speech as experienced in face-to-face conversation.
In the last several years, there is a tremendous interest in the so-called xe2x80x9cIP telephonyxe2x80x9d, i.e., telephone calls transmitted through packet-switched data networks employing the Internet Protocol (IP). Currently, the common approach is to use a speech encoder to compress 8 kHz sampled speech to a low bit rate, package the compressed bit-stream into packets, and then transmit the packets over IP networks. At the receiving end, the compressed bit-stream is extracted from the received packets, and a speech decoder is used to decode the compressed bit-stream back to 8 kHz sampled speech. The term xe2x80x9ccodecxe2x80x9d (coder and decoder) is commonly used to denote the combination of the encoder and the decoder. The current generation of IP telephony products typically use existing speech codecs that were designed to compress 8 kHz telephone speech to very low bit rates. Examples of such codecs include the ITU-T G.723.1 at 6.3 kb/s, G.729 at 8 kb/s, and G.729A at 8 kb/s. All of these codecs have somewhat degraded speech quality when compared with the ITU-T 64 kb/s G.711 PCM and, of course, they all still have the same 300 to 3,400 Hz bandwidth limitation.
In many IP telephony applications, there is plenty of transmission capacity, so there is no need to compress the speech to a very low bit rate. Such applications include xe2x80x9ctoll bypassxe2x80x9d using high-speed optical fiber IP network backbones, and xe2x80x9cLAN phonesxe2x80x9d that connect to and communicate through Local Area Networks such as 100 Mb/s fast ethernets. In many such applications, the transmission bit rate of each channel can be as high as 64 kb/s. Further, it is often desirable to have a sampling rate higher than 8 kHz, so the output quality of the codec can be much higher than POTS quality, and ideally approaches CD quality, for both speech and non-speech signals, such as music. It is also desirable to have a codec complexity as low as possible in order to achieve high port density and low hardware cost per channel. Furthermore, it is desirable to have a coding delay as low as possible, so that users will not experience significant delay in two-way conversations. In addition, depending on applications, sometimes it is necessary to transmit the decoder output through PSTN. Therefore, the decoder output should be easy to down-sample to 8 kHz for transcoding to 8 kHz G.711. Clearly, there is a need to address the requirements presented by these and other applications.
The present invention is designed to meet these and other practical requirements by using an adaptive transform coding approach. Most prior art audio codecs based on adaptive transform coding use a single large transform (1024 to 2048 data points) in each processing frame. In some cases, switching to smaller transform sizes is used, but typically during transient regions of the signal. As known in the art, a large transform size leads to relatively high computational complexity and high coding delay which, as pointed above, are undesirable in many applications. On the other hand, if a single small transform is used in each frame, the complexity and coding delay go down, but the coding efficiency also go down, partially because the transmission of side information (such as quantizer step sizes and adaptive bit allocation) takes a significantly higher percentage of the total bit rate.
By contrast, the present invention uses multiple small-size transforms in each frame to achieve low complexity, low coding delay, and a good compromise in coding efficiently the side information. Many low-complexity techniques are used in accordance with the present invention to ensure that the overall codec complexity is as low as possible. In a preferred embodiment, the transform used is the Modified Discrete Cosine Transform (MDCT), as proposed by Princen et al., Proceedings of 1987 IEEE International Conference in Acoustics, Speech, and Signal Processing, pp. 2161-2164, the content of which is incorporated by reference.
In IP-based voice or audio communications, it is often desirable to support multiple sampling rates and multiple bit rates when different end points have different requirements on sampling rates and bit rates. A conventional (although not so elegant) solution is to use several different codecs, each capable of operating at only a fixed bit-rate and a fixed sampling rate. A serious disadvantage of this approach is that several completely different codecs have to be implemented on the same platform, thus increasing the total storage requirement for storing the programs for all codecs. Furthermore, if the application requires multiple output bit-streams at multiple bit-rates, the system needs to run several different speech codecs in parallel, thus increasing the overall computational complexity.
A solution to this problem in accordance with the present invention is to use scalable and embedded coding. The concept of scalable and embedded coding itself is known in the art. For example, the ITU-T has a G.727 standard, which specifies a scalable and embedded ADPCM codec at 16, 24 and 32 kb/s. Also available is the Philips proposal of a scalable and embedded CELP (Code Excited Linear Prediction) codec architecture for 14 to 24 kb/s [1997 IEEE Speech Coding Workshop]. However, both the ITU-T standard and the Phillips proposal deal with a single fixed sampling rate of 8 kHz. In practical applications this can be a serious limitation.
In particular, due to the large variety of terminal devices and communication links used for IP-based voice communications, it is generally desirable, and sometimes even necessary, to link communication devices with widely different operating characteristics. For example, it may be necessary to provide high-quality, high-bandwidth speech (at sampling rates higher than 8 kHz and bandwidths wider than the typical 3.4 kHz telephone bandwidth) for devices connected to a LAN, and at the same time provide telephone-bandwidth speech over PSTN to remote locations. Such needs may arise, for example, in tele-conferencing applications. Addressing such needs, the present invention is able to handle several sampling rates rather than a single fixed sampling rate. In terms of scalability in sampling rate and bit rate, the present invention is similar to co-pending application Ser. No. 60/059,610 filed Sep. 23, 1997, the content of which is incorporated by reference. However, the actual implementation methods are very different.
It should be noted that although the present invention is described primarily with reference to a scalable and embedded codec for IP-based voice or audio communications, it is by no means limited to such applications, as will be appreciated by those skilled in the art.
In a preferred embodiment, the system of the present invention is an adaptive transform codec based on the MDCT transform. The codec is characterized by low complexity and low coding delay and as such is particularly suitable for IP-based communications. Specifically, in accordance with a basic-configuration embodiment, the encoder of the present invention takes digitized input speech or general audio signal and divides it into (preferably short-duration) signal frames. For each signal frame, two or more transform computations are performed on overlapping analysis windows. The resulting output is stored in a multi-dimensional coefficient array. Next, the coefficients thus obtained are quantized using a novel processing method, which is based on calculations of the log-gains for different frequency bands. A number of techniques are disclosed to make the quantization as efficient as possible for a low encoder complexity. In particular, a novel adaptive bit-allocation approach is proposed, which is characterized by very low complexity. The stream of quantized transform coefficients and log-gain parameters are finally converted to a bit-stream. In a specific embodiment, a 32 kHz input signal and a 64 kb/s output bit-stream are used.
The decoder implemented in accordance with the present invention, is capable of decoding this bit-stream directly, without the conventional downsampling, into one or more output signals having sampling rate(s) of 32 kHz, 16 kHz, or 8 kHz in this illustrative embodiment. The lower bit-rate output is decoded in a simple and elegant manner, which has low complexity. Further, the decoder features a novel adaptive frame loss concealment processor that reduces the effect of missing or delayed packets on the quality of the output signal.
Importantly, in accordance with the present invention, the proposed system and method can be extended to implementations featuring embedded coding over a set of sampling rates. Embedded coding in the present invention is based on the concept of using a simplified model of the signal with a small number of parameters, and gradually adding to the accuracy of each next stage of bit-rate to achieve a higher and higher fidelity in the reconstructed signal by adding new signal parameters (i.e., different transform coefficients), and/or increasing the accuracy of their representation.
More specifically, a system for processing audio signals is disclosed, comprising: (a) a frame extractor for dividing an input audio signal into a plurality of signal frames corresponding to successive time intervals; (b) a transform processor for performing transform computation of a signal in at least one signal frame, said transform processor generating a transform signal having one or more bands; (c) a quantizer providing an output bit stream corresponding to quantized values of the transform signal in said one or more bands; and (d) a decoder capable of reconstructing from the output bit stream at least two replicas of the input signal, each replica having a different sampling rate. In another embodiment, the system of the present invention further comprises an adaptive bit allocator for determining an optimum bit-allocation for encoding at least one of said one or more bands of the transform signal.