The need for offering telecommunication services over packet switched networks has been dramatically increasing and is today stronger than ever. In parallel there is a growing diversity in the media content to be transmitted, including different bandwidths, mono and stereo sound and both speech and music signals. A lot of efforts at diverse standardization bodies are being mobilized to define flexible and efficient solutions for the delivery of mixed content to the users. Noticeably, two major challenges still await solutions. First, the diversity of deployed networking technologies and user-devices imply that the same service offered for different users may have different user-perceived quality due to the different properties of the transport networks. Hence, improving quality mechanisms is necessary to adapt services to the actual transport characteristics. Second, the communication service must accommodate a wide range of media content. Currently, speech and music transmission still belong to different paradigms and there is a gap to be filled for a service that can provide good quality for all types of audio signals.
Today, scalable audiovisual and in general media content codecs are available, in fact one of the early design guidelines of MPEG was scalability from the beginning. However, although these codecs are attractive due to their functionality, they lack the efficiency to operate at low bitrates, which do not really map to the current mass market wireless devices. With the high penetration of wireless communications more sophisticated scalable-codecs are needed. This fact has been already realized and new codecs are to be expected to appear in the near future.
Despite the tremendous efforts being put on adaptive services and scalable codecs, scalable services will not happen unless more attention is given to the transport issues. Therefore, besides efficient codecs an appropriate network architecture and transport framework must be considered as an enabling technology to fully utilize scalability in service delivery. Basically, three scenarios can be considered:                Adaptation at the end-points. That is, if a lower transmission rate must be chosen the sending side is informed and it performs scaling or codec changes.        Adaptation at intermediate gateways. If a part of the network becomes congested, or has a different service capability, a dedicated network entity as illustrated in FIG. 1, performs the transcoding of the service. With scalable codec this could be as simple as dropping or truncating media frames.        Adaptation inside the network. If a router or wireless interface becomes congested adaptation is performed right at the place of the problem by dropping or truncating packets. This is a desirable solution for transient problems like handling of severe traffic bursts or the channel quality variations of wireless links.Scalable Audio CodingNon-Conversational, Streaming/Download        
In general the current audio research trend is to improve the compression efficiency at low rates (provide good enough stereo quality at bit rates below 32 kbps). Recent low rate audio improvements are the finalization of the Parametric Stereo (PS) tool development in MPEG, the standardization of a mixed CELP/and transform codec Extended AMR-WB (a.k.a. AMR-WB+) in 3GPP. There is also an ongoing MPEG standardization activity around Spatial Audio Coding (Surround/5.1 content), where a first reference model (RM0) has been selected.
With respect to scalable audio coding, recent standardization efforts in MPEG have resulted in a scalable to lossless extension tool, MPEG4-SLS. MPEG4-SLS provides progressive enhancements to the core AAC/BSAC all the way up to lossless with granularity step down to 0.4 kbps. An Audio Object Type (AOT) for SLS is yet to be defined. Further within MPEG a Call for Information (CfI) has been issued in January 2005 [1] targeting the area of scalable speech and audio coding, in the CfI the key issues addressed are scalability, consistent performance across content types (e.g. speech and music) and encoding quality at low bit rates (<24 kbps).
Speech Coding (Conversational Mono)
General
In general speech compression the latest standardization efforts is an extension of the 3GPP2NMR-WB codec to also support operation at a maximum rate of 8.55 kbps. In ITU-T the Multirate G.722.1 audio/video conferencing codec has previously been updated with two new modes providing super wideband (14 kHz audio bandwidth, 32 kHz sampling) capability operating at 24, 32 and 48 kbps. An additional mode is currently under standardization that will extend the bandwidth to 48 kHz full-band coding.
With respect to scalable conversational speech coding the main standardization effort is taking place in ITU-T, (Working Party 3, Study Group 16). There the requirements for a scalable extension of G.729 have been defined recently (November 2004), and the qualification process was ended in July 2005. This new G.729 extension will be scalable from 8 to 32 kbps with at least 2 kbps granularity steps from 12 kbps. The main target application for the G.729 scalable extension is conversational speech over shared and bandwidth limited xDSL-links, i.e. the scaling is likely to take place in a Digital Residential Gateway that passes the VoIP packets through specific controlled Voice channels (Vc's). ITU-T is also in the process of defining the requirements for a completely new scalable conversational codec in SG16/WP3/Question 9. The requirements for the Q.9/Embedded Variable rate (EV) codec were finalized in July 2006; currently the Q.9/EV requirements state a core rate of 8.0 kbps and a maximum rate of 32 kbps. A specific requirement for Q.9/EV fine grain scalability is not yet introduced instead certain operation points are likely to be evaluated, butt fine grain scalability is still an objective. The Q.9/EV core is not restricted to narrowband (8 kHz sampling) like the G.729 extension will be, i.e. Q.9/EV may provide wideband (16 kHz sampling) from the core layer and onwards. Further the requirements for an extension of the forthcoming Q.9/EV codec that will give it super wideband and stereo capabilities (32 kHz sampling/2 channels) was defined in November 2006.
SNR Scalability
There are a number of scalable conversational codec that can increase SNR with increasing amounts of bits/layers. E.g. MPEG4-CELP [8], G.727 (Embedded ADPCM) are SNR-scalable, each additional layer increases the fidelity of the reconstructed signal. Recently Kövesi et al has proposed a flexible SNR and bandwidth scalable codec [9], that achieves fine grain scalability from a certain core rate, enabling a fine granular optimization of the transport bandwidth, applicable for speech/audio conferencing servers or open loop network congestion control.
Bandwidth Scalability
There are also codecs that can increase bandwidth with increasing amount of bits. Examples include G722 (Sub band ADPCM), the TI candidate to the 3GPP WB speech codec competition [3] and the academic AMR-BWS [2] codec. For these codecs addition of a specific bandwidth layer increases the audio bandwidth of the synthesized signal from ˜4 kHz to ˜7 kHz. Another example of a bandwidth scalable coder is the 16 kbps bandwidth scalable audio coder based on G.729 described by Koishida in [4]. Also In addition to being SNR-scalable MPEG4-CELP specifies a SNR scalable coding system for 8 and 16 kHz sampled input signals [9].
Channel Robustness Technology
With regards to improving channel robustness of conversational codecs, this has been done in various ways for existing standards and codecs. For example:                EVRC (1995), Transmits a delta Delay parameter, which is a partial redundant coded parameter, making it possible to reconstruct the Adaptive Codebook State after a channel erasure, and thus enhancing error recovery. A detailed overview of EVRC is found in [11].        In AMR-NB [12], a speech service specified for GSM networks operates on a maximum source rate adaptation principle. The trade off between channel coding and source coding for a given gross bit rate is continuously monitored and adjusted by the GSM-system and the encoder source rate is adapted to provide the best quality possible. The source rate may be varied from 4.75 kbps to 12.2 kbps. And the channel gross rate is either 22.8 kbps or 11.4 kbps.        In addition to the maximum source rate adaptation capabilities described in the bullet above. The AMR RTP payload format [5] allows for the retransmission of whole past frames, significantly increasing the robustness to random frame errors. In [10] a multimode adaptive AMR system using the full and partial redundancy concepts adaptively is described. Further the RTP payload allows for interleaving of packets, thus enhancing the robustness for non-conversational applications.        Multiple Descriptive Coding in combination with AMR-WB is described in [6], further an adaptive codec mode selection scheme is proposed where AMR-WB is used for low error conditions and the described channel robust MD-AMR (WB) coder is used during severe error conditions.        A channel robustness technology variation to the transmitting redundant data technique is to adjust the encoder analysis to reduce the dependency of states; this is done in the AMR 4.75 encoding mode. The application of a similar encoder side analysis technique for AMR-WB was described by Lefebvre et al in [7].        In [13] Chen et al describes a multimedia application that uses multi rate audio capabilities to adapt the total rate and also the actually used compressing schemes based on information from a slow (1 sec) feedback channel. In addition Chen et al extends the audio application with a very low rate base layer that uses text, as a redundant parameter, to be able to provide speech synthesis for really severe error conditions.Audio Scalability        
Basically, audio scalability can be achieved by:                Changing the quantization of the signal, i.e. SNR-like scalability.        Extending or tightening the bandwidth of the signal.        Dropping audio channels (e.g., mono consist of 1 channel, stereo 2 channels, surround 5 channels)—(spatial scalability).        
Currently available, fine-grained scalable audio codec is the AAC-BSAC (Advanced Audio Coding-Bit-Sliced Arithmetic Coding). It can be used for both audio and speech coding, it also allows for bit-rate scalability in small increments.
It produces a bit-stream, which can even be decoded if certain parts of the stream are missing. There is a minimum requirement on the amount of data that must be available to permit decoding of the stream. This is referred to as base-layer. The remaining set of bits corresponds to quality enhancements, hence their reference as enhancement-layers. The AAC-BSAC supports enhancement layers of around 1 Kbit/s/channel or smaller for audio signals.
“To obtain such fine grain scalability, a bit-slicing scheme is applied to the quantized spectral data. First the quantized spectral values are grouped into frequency bands, each of these groups containing the quantized spectral values in their binary representation. Then the bits of the group are processed in slices according to their significance and spectral content. Thus, first all most significant bits (MSB) of the quantized values in the group are processed and the bits are processed from lower to higher frequencies within a given slice. These bit-slices are then encoded using a binary arithmetic coding scheme to obtain entropy coding with minimal redundancy.” [1]
“With an increasing number of enhancement layers utilized by the decoder, providing more LSB information refines quantized spectral data. At the same time, providing bit-slices of spectral data in higher frequency bands increases the audio bandwidth. In this way, quasi-continuous scalability is achievable.” [1]
In other words, scalability can be achieved in a two-dimensional space. Quality, corresponding to a certain signal bandwidth, can be enhanced by transmitting more LSBs, or the bandwidth of the signal can be extended by providing more bit-slices to the receiver. Moreover, a third dimension of scalability is available by adapting the number of channels available for decoding. For example, a surround audio (5 channels) could be scaled down to stereo (2 channels) which, on the other hand, can be scaled to mono (1 channels) if, e.g., transport conditions make it necessary.
Perceptual Models for Audio Coding
To achieve the best perceived quality at a given bitrate for an audio coding system, one must consider the properties of the human auditory system. The purpose is to focus the resources to the parts of the sound which will be scrutinized, while saving resources where auditory perception is dull. The properties of the human auditory system have been documented in various listening tests, whose results have been used in the derivation of perceptual models.
The application of perceptual models in audio coding can be implemented in different ways. One method is to perform the bit allocation of the coding parameters in a way that corresponds to perceptual importance. In a transform domain codec, such as e.g. MPEG-1/2 Layer III, this is implemented by allocating bits in the frequency domain to different sub bands according to their perceptual importance. Another method is to perform a perceptual weighting, or filtering, in order to emphasize the perceptually important frequencies of the signal. The emphasis guarantees more resources will be allocated in a standard MMSE encoding technique. Yet another way is to perform perceptual weighting on the residual error signal after the coding. By minimizing the perceptually weighted error, the perceptual quality is maximized with respect to the model. This method is commonly used in e.g. CELP speech codecs.
Stereo Coding or Multi-Channel Coding
A general example of an audio transmission system using multi-channel (i.e. at least two input channels) coding and decoding is schematically illustrated in FIG. 2. The overall system basically comprises a multi-channel audio encoder 100 and a transmission module 10 on the transmitting side, and a receiving module 20 and a multi-channel audio decoder 200 on the receiving side.
The simplest way of stereophonic or multi-channel coding of audio signals is to encode the signals of the different channels separately as individual and independent signals, as illustrated in FIG. 3. However, this means that the redundancy among the plurality of channels is not removed, and that the bit-rate requirement will be proportional to the number of channels.
Another basic way used in stereo FM radio transmission and which ensures compatibility with legacy mono radio receivers is to transmit a sum signal (mono) and a difference signal (side) of the two involved channels.
State-of-the art audio codecs such as MPEG-1/2 Layer III and MPEG-2/4 AAC make use of so-called joint stereo coding. According to this technique, the signals of the different channels are processed jointly rather than separately and individually. The two most commonly used joint stereo coding techniques are known as ‘Mid/Side’ (M/S) Stereo and intensity stereo coding which usually are applied on sub-bands of the stereo or multi-channel signals to be encoded.
M/S stereo coding is similar to the described procedure in stereo FM radio, in a sense that it encodes and transmits the sum and difference signals of the channel sub-bands and thereby exploits redundancy between the channel sub-bands. The structure and operation of a coder based on M/S stereo coding is described, e.g., in U.S. Pat. No. 5,285,498 by J. D. Johnston.
Intensity stereo on the other hand is able to make use of stereo irrelevancy. It transmits the joint intensity of the channels (of the different sub-bands) along with some location information indicating how the intensity is distributed among the channels. Intensity stereo does only provide spectral magnitude information of the channels, while phase information is not conveyed. For this reason and since temporal inter-channel information (more specifically the inter-channel time difference) is of major psycho-acoustical relevancy particularly at lower frequencies, intensity stereo can only be used at high frequencies above e.g. 2 kHz. An intensity stereo coding method is described, e.g., in European Patent 0497413 by R. Veldhuis et al.
A recently developed stereo coding method is described, e.g., in a conference paper with title ‘Binaural cue coding applied to stereo and multi-channel audio compression’, 112th AES convention, May 2002, Munich (Germany) by C. Faller et al. This method is a parametric multi-channel audio coding method. The basic principle of such parametric techniques is that at the encoding side the input signals from the N channels c1, c2, . . . cN are combined to one mono signal m. The mono signal is audio encoded using any conventional monophonic audio codec. In parallel, parameters are derived from the channel signals, which describe the multi-channel image. The parameters are encoded and transmitted to the decoder, along with the audio bit stream. The decoder first decodes the mono signal m′ and then regenerates the channel signals c1′, c2′, . . . cN′, based on the parametric description of the multi-channel image.
The principle of the binaural cue coding (BCC[14]) method is that it transmits the encoded mono signal and so-called BCC parameters. The BCC parameters comprise coded inter-channel level differences and inter-channel time differences for sub-bands of the original multi-channel input signal. The decoder regenerates the different channel signals by applying sub-band-wise level and phase adjustments of the mono signal based on the BCC parameters. The advantage over e.g. M/S or intensity stereo is that stereo information comprising temporal inter-channel information is transmitted at much lower bit rates.
Another technique, described in U.S. Pat. No. 5,434,948 by C. E. Holt et al. uses the same principle of encoding of the mono signal and side information. In this case, side information consists of predictor filters and optionally a residual signal. The predictor filters, estimated by the LMS algorithm, when applied to the mono signal allow the prediction of the multi-channel audio signals. With this technique one is able to reach very low bit rate encoding of multi-channel audio sources, however at the expense of a quality drop.
The basic principles of parametric stereo coding are illustrated in FIG. 4, which displays a layout of a stereo codec, comprising a down-mixing module 120, a core mono codec 130, 230 and a parametric stereo side information encoder/decoder 140, 240. The down-mixing transforms the multi-channel (in this case stereo) signal into a mono signal. The objective of the parametric stereo codec is to reproduce a stereo signal at the decoder given the reconstructed mono signal and additional stereo parameters.
In International Patent Application, published as WO 2006/091139, a technique for adaptive bit allocation for multi-channel encoding is described. It utilises at least two encoders, where the second encoder is a multistage encoder. Encoding bits are adaptively allocated among the different stages of the second multi-stage encoder based on multi-channel audio signal characteristics.
Finally, for completeness, a technique is to be mentioned that is used in 3D audio. This technique synthesizes the right and left channel signals by filtering sound source signals with so-called head-related filters. However, this technique requires the different sound source signals to be separated and can thus not generally be applied for stereo or multi-channel coding.
Traditional parametric multi-channel or stereo encoding solutions aim to reconstruct a stereo or multi-channel signal from a mono down-mix signal using a parametric representation of the channel relations. If the quality of the coded down-mix signal is low this will also be reflected in the end result, regardless of the amount of resources spent on the stereo signal parameters.