Many types of systems use audio signal processing to create audio signals or to reproduce sound from such signals. Typically, signal processing converts audio signals to digital data and encodes that data for transmission over a network. Then, additional signal processing decodes the transmitted data and converts it back to analog signals for reproduction as acoustic waves.
Various techniques exist for encoding or decoding audio signals. (A processor or a processing module that encodes and decodes a signal is generally referred to as a codec.) Audio codecs are used in conferencing to reduce the amount of data that must be transmitted from a near-end to a far-end to represent the audio. For example, audio codecs for audio and video conferencing compress high-fidelity audio input so that a resulting signal for transmission retains the best quality but requires the least number of bits. In this way, conferencing equipment having the audio codec needs less storage capacity, and the communication channel used by the equipment to transmit the audio signal requires less bandwidth.
Audio codecs can use various techniques to encode and decode audio for transmission from one endpoint to another in a conference. Some commonly used audio codecs use transform coding techniques to encode and decode audio data transmitted over a network. One type of audio codec is Polycom's Siren codec. One version of Polycom's Siren codec is the ITU-T (International Telecommunication Union Telecommunication Standardization Sector) Recommendation G.722.1 (Polycom Siren 7). Siren 7 is a wideband codec that codes the signal up to 7 kHz. Another version is ITU-T G.722.1.0 (Polycom Siren 14). Siren14 is a super wideband codec that codes the signal up to 14 kHz.
The Siren codecs are Modulated Lapped Transform (MLT)-based audio codecs. As such, the Siren codecs transform an audio signal from the time domain into a Modulated Lapped Transform (MLT) domain. As is known, the Modulated Lapped Transform (MLT) is a form of a cosine modulated filter bank used for transform coding of various types of signals. In general, a lapped transform takes an audio block of length L and transforms that block into M coefficients, with the condition that L>M. For this to work, there must be an overlap between consecutive blocks of L-M samples so that a synthesized signal can be obtained using consecutive blocks of transformed coefficients.
FIGS. 1A-1B briefly show features of a transform coding codec, such as a Siren codec. Actual details of a particular audio codec depend on the implementation and the type of codec used. For example, known details for Siren 14 can be found in ITU-T Recommendation G.722.1 Annex C, and known details for Siren 7 can be found in ITU-T Recommendation G.722.1, which are incorporated herein by reference. Additional details related to transform coding of audio signals can also be found in U.S. patent application Ser. Nos. 11/550,629 and 11/550,682, which are incorporated herein by reference.
An encoder 10 for the transform coding codec (e.g., Siren codec) is illustrated in FIG. 1A. The encoder 10 receives a digital signal 12 that has been converted from an analog audio signal. The amplitude of the analog audio signal has been sampled at a certain frequency and has been converted to a number that represents the amplitude. The typical sampling frequency is approximately 8 kHz (i.e., sampling 8,000 times per second), 16 kHz to 196 kHz, or something in between. In one example, this digital signal 12 may have been sampled at 48 kHz or other rate in about 20-ms blocks or frames.
A transform 20, which can be a Discrete Cosine Transform (DCT), converts the digital signal 12 from the time domain into a frequency domain having transform coefficients. For example, the transform 20 can produce a spectrum of 960 transform coefficients for each audio block or frame. The encoder 10 finds average energy levels (norms) for the coefficients in a normalization process 22. Then, the encoder 10 quantizes the coefficients with a Fast Lattice Vector Quantization (FLVQ) algorithm 24 or the like to encode an output signal 14 for packetization and transmission.
A decoder 50 for the transform coding codec (e.g., Siren codec) is illustrated in FIG. 1B. The decoder 50 takes the incoming bit stream of the input signal 52 received from a network and recreates a best estimate of the original signal from it. To do this, the decoder 50 performs a lattice decoding (reverse FLVQ) 60 on the input signal 52 and de-quantizes the decoded transform coefficients using a de-quantization process 62. In addition, the energy levels of the transform coefficients may then be corrected in the various frequency bands. Finally, an inverse transform 64 operates as a reverse DCT and converts the signal from the frequency domain back into the time domain for transmission as an output signal 54.
Although such audio codecs are effective, increasing needs and complexity in audio conferencing applications call for more versatile and enhanced audio coding techniques. For example, audio codecs must operate over networks, and various conditions (bandwidth, different connection speeds of receivers, etc.) can vary dynamically. A wireless network is one example where a channel's bit rate varies over time. Thus, an endpoint in a wireless network has to send out a bit stream at different bit rates to accommodate the network conditions.
Use of an MCU (Multi-way Control Unit), such as Polycom's RMX series and MGC series products, is another example where more versatile and enhanced audio coding techniques may be useful. For example, an MCU in a conference first receives a bit stream from a first endpoint A and then needs to send bit streams at different lengths to a number of other endpoints B, C, D, E, F . . . . The different bit streams to be sent can depend on how much network bandwidth each of the endpoints has, upon the decoding capabilities of the endpoint, or upon other factors. For example, one endpoint B may be connected to the network at 64 k bps (bits per second) for audio, while another endpoint C may be connected at only 8 kbps.
Accordingly, the MCU sends the bit stream at 64 kbps to the one endpoint B, sends the bit stream at 8 kbps to the other endpoint C, and so on for each of the endpoints. Currently, the MCU decodes the bit stream from the first endpoint A, i.e., converts it back to time domain. Then, the MCU encodes a separate stream for every single endpoint B, C, D, E, F . . . so the appropriate bit streams can be sent to them. Obviously, this approach requires many computational resources, introduces signal latency, and degrades signal quality due to the transcoding performed.
Dealing with lost packets is another area where more versatile and enhanced audio coding techniques may be useful. In videoconferencing or VoIP calls, for example, coded audio information is sent in packets that typically have 20 milliseconds of audio per packet. Packets can be lost during transmission, and the lost audio packets lead to gaps in the received audio. One way to combat the packet loss in the network is to transmit the packet (i.e., bit stream) multiple times, say 4 times. The chance of losing all four of these packets is much lower so the chance of having gaps is lessened.
Transmitting the packet multiple times, however, requires the network bandwidth to increase by four times. To minimize the costs, usually the same 20 ms time-domain signal is encoded at a higher bit rate (in a normal mode, say 48 k bps) and encoded at a lower bit rate (say, 8 kbps). The lower (8 kbps) bit stream is the one transmitted multiple times. This way, the total required bandwidth is 48+8*3=72 kbps, instead of 48*4=192 kbps if the original were sent multiple times. Due to the masking effect, the 48+8*3 scheme performs nearly as well as the 48*4 scheme in terms of speech quality when the network has packet loss. Yet, this traditional solution of encoding the same 20 ms time domain data independently at different bit rates requires computational resources.
Lastly, some endpoints may not have enough computational resources to do a full decoding. For example, an endpoint may have a slower signal processor, or the signal processor may be busy doing other tasks. If this is the case, decoding only part of the bit stream that the endpoint receives may not produce useful audio. As is known, audio quality typically depends, at least in part, on how many bits the decoder receives and decodes.
For these reasons, a need exists for an audio codec that is scalable for use in audio and video conferencing.