The need for offering telecommunication services over packet switched networks has been dramatically increasing and is today stronger than ever. In parallel there is a growing diversity in the media content to be transmitted, including different bandwidths, mono and stereo sound and both speech and music signals. A lot of efforts at diverse standardization bodies are being mobilized to define flexible and efficient solutions for the delivery of mixed content to the users. Noticeably, two major challenges still await solutions. First, the diversity of deployed networking technologies and user-devices imply that the same service offered for different users may have different user-perceived quality due to the different properties of the transport networks. Hence, improving quality mechanisms is necessary to adapt services to the actual transport characteristics. Second, the communication service must accommodate a wide range of media content. Currently, speech and music transmission still belong to different paradigms and there is a gap to be filled for a service that can provide good quality for all types of audio signals.
Today, scalable audiovisual and in general media content codecs are available, in fact one of the early design guidelines of MPEG was scalability from the beginning. However, although these codecs are attractive due to their functionality, they lack the efficiency to operate at low bitrates, which do not really map to the current mass market wireless devices. With the high penetration of wireless communications more sophisticated scalable-codecs are needed. This fact has been already realized and new codecs are to be expected to appear in the near future.
Despite the tremendous efforts being put on adaptive services and scalable codecs, scalable services will not happen unless more attention is given to the transport issues. Therefore, besides efficient codecs appropriate network architecture and transport framework must be considered as an enabling technology to fully utilize scalability in service delivery. Basically, three scenarios can be considered:                Adaptation at the end-points. That is, if a lower transmission rate must be chosen the sending side is informed and it performs scaling or codec changes.        Adaptation at intermediate gateways. If a part of the network becomes congested, or has a different service capability, a dedicated network entity as illustrated in FIG. 1, performs the transcoding of the service. With scalable codec this could be as simple as dropping or truncating media frames.        Adaptation inside the network. If a router or wireless interface becomes congested adaptation is performed right at the place of the problem by dropping or truncating packets. This is a desirable solution for transient problems like handling of severe traffic bursts or the channel quality variations of wireless links.        
Below, an overview of scalable codecs for speech and audio according to the prior art is given. We also give a general background on stereo coding concepts.
Scalable Audio Coding
Non-Conversational, Streaming/Download
In general the current audio research trend is to improve the compression efficiency at low rates (provide good enough stereo quality at bit rates below 32 kbps). Recent low rate audio improvements are the finalization of the Parametric Stereo (PS) tool development in MPEG, the standardization of a mixed CELP/and transform codec Extended AMR-WB (a.k.a. AMR-WB+) in 3GPP. There is also an ongoing MPEG standardization activity around Spatial Audio Coding (Surround/5.1 content), where a first reference model (RMO) has been selected [4].
With respect to scalable audio coding, recent standardization efforts in MPEG have resulted in a scalable to lossless extension tool, MPEG4-SLS. MPEG4-SLS provides progressive enhancements to the core AAC/BSAC all the way up to lossless with granularity step down to 0.4 kbps. An Audio Object Type (AOT) for SLS is yet to be defined. Further within MPEG a Call for Information (Cfl) has been issued in January 2005 [1] targeting the area of scalable speech and audio coding, in the Cfl the key issues addressed are scalability, consistent performance across content types (e.g. speech and music) and encoding quality at low bit rates (<24 kbps). Later, the scalable part was dropped and the work is now targeting a codec running at a variety of bitrates without embedded scalability.
Speech Coding (Conversational Mono)
General
In general speech compression the latest standardization efforts is an extension of the 3GPP2NMR-WB codec to also support operation at a maximum rate of 8.55 kbps. In ITU-T the Multirate G.722.1 audio/video conferencing codec has previously been updated with two new modes providing super wideband (14 kHz audio bandwidth, 32 kHz sampling) capability operating at 24, 32 and 48 kbps. Further standardization efforts were aiming to add an additional mode that would extend the bandwidth to 48 kHz full-band coding. The end result was the new stand-alone codec G.719, which provides low complex full-band coding from 32 to 128 kbps in steps of 16 kbps.
With respect to scalable conversational speech coding the main standardization effort is taking place in ITU-T, (Working Party 3, Study Group 16). There a scalable extension of G.729 was standardized in May 2006, called G.729.1. This extension is scalable from 8 to 32 kbps with 2 kbps granularity steps from 12 kbps. The main target application for G.729.1 is conversational speech over shared and bandwidth limited xDSL-links, i.e. the scaling is likely to take place in a Digital Residential Gateway that passes the VoIP packets through specific controlled Voice channels (Vc's). ITU-T has also recently (September 2008) approved the recommendation for a completely new scalable conversational codec, G.718. The codec comprises a core rate of 8.0 kbps and a maximum rate of 32 kbps., with scaling steps at 12.0, 16.0 and 24.0 kbps. The G.718 core is a WB speech codec inherited from VMR-WB, but also handles NB input signals by upsampling to the core samplerate. Further a joint extension of the G.718 and G.729.1 codecs that will bring super wideband and stereo capabilities (32 kHz sampling/2 channels) is currently under standardization in ITU-T (Working Party 3, Study Group 16, Question 23). The qualification period ended July 2008.
SNR Scalability
The principle of SNR scalability is to increase the SNR with increasing number of bits or layers. The two previously mentioned speech codecs G.729.1 and G.718 have this feature. Typically this is achieved by stepwise re-encoding of the coding residual from the previous layer. The embedded layered structure is attractive since lower bitrates can be decoded by simply discarding the upper layers. However, the embedded layering may not be optimal when considering the higher bitrates and a layered codec usually performs worse than a fixed bitrate codec at the same bitrate. Other codecs that can be mentioned here is the SNR scalable MPEG4-CELP and G.727 (Embedded ADPCM).
Bandwidth Scalability
There are also codecs that can increase bandwidth with increasing amount of bits, e.g. G722 (Sub band ADPCM) but also G.729.1 and G.718. G.729.1 operates with a cascaded CELP codec for the bitrates 8 and 12 kbps, but provides WB signals at 14 kbps using a bandwidth extension to fill the range from 4 kHz to 7 kHz. The bandwidth extension typically creates an excitation signal from the lower band by spectral folding or other mappings, which is further gain adjusted and shaped with a spectral envelope to simulate the higher end frequency spectrum. Although the solution might sound good, the extended spectrum does not generally match the input signal in an MSE sense. For codecs that also SNR scalable, the bandwidth extension used at lower rates is typically replaced with coded content in higher layers. This is the case for G.729.1 where the spectrum is gradually replaced with coded spectrum on a subband basis. G.718 exhibits the same feature and uses bandwidth extension from 6.4 kHz to 7.0 kHz for rates 8, 12 and 16 kbps. For the rates 24 and 32 kbps, the bandwidth extension is disabled and replaced with coded spectrum. Also in addition to being SNR-scalable MPEG4-CELP specifies a bandwidth scalable coding system for 8 and 16 kHz sampled input signals.
Audio Scalability
Basically, audio scalability can be achieved by:                Changing the quantization of the signal, i.e. SNR-like scalability.        Extending or tightening the bandwidth of the signal.        Dropping audio channels (e.g., mono consist of 1 channel, stereo 2 channels, surround 5 channels)—(spatial scalability).        
Currently available, fine-grained scalable audio codec is the AAC-BSAC (Advanced Audio Coding—Bit-Sliced Arithmetic Coding). It can be used for both audio and speech coding, it also allows for bit-rate scalability in small increments.
It produces a bit-stream, which can even be decoded if certain parts of the stream are missing. There is a minimum requirement on the amount of data that must be available to permit decoding of the stream. This is referred to as base-layer. The remaining set of bits corresponds to quality enhancements, hence their reference as enhancement-layers. The AAC-BSAC supports enhancement layers of around 1 Kbit/s/channel or smaller for audio signals.
“To obtain such fine grain scalability, a bit-slicing scheme is applied to the quantized spectral data. First the quantized spectral values are grouped into frequency bands, each of these groups containing the quantized spectral values in their binary representation. Then the bits of the group are processed in slices according to their significance and spectral content. Thus, first all most significant bits (MSB) of the quantized values in the group are processed and the bits are processed from lower to higher frequencies within a given slice. These bit-slices are then encoded using a binary arithmetic coding scheme to obtain entropy coding with minimal redundancy.” [1]
“With an increasing number of enhancement layers utilized by the decoder, providing more least significant bit (LSB) information refines quantized spectral data. At the same time, providing bit-slices of spectral data in higher frequency bands increases the audio bandwidth. In this way, quasi-continuous scalability is achievable.” [1]
In other words, scalability can be achieved in a two-dimensional space.
Quality, corresponding to a certain signal bandwidth, can be enhanced by transmitting more LSBs, or the bandwidth of the signal can be extended by providing more bit-slices to the receiver. Moreover, a third dimension of scalability is available by adapting the number of channels available for decoding. For example, a surround audio (5 channels) could be scaled down to stereo (2 channels) which, on the other hand, can be scaled to mono (1 channels) if, e.g., transport conditions make it necessary.
Stereo Coding or Multi-Channel Coding
A general example of an audio transmission system using multi-channel (i.e. at least two input channels) coding and decoding is schematically illustrated in FIG. 2. The overall system basically comprises a multi-channel audio encoder 100 and a transmission module 10 on the transmitting side, and a receiving module 20 and a multi-channel audio decoder 200 on the receiving side.
The simplest way of stereophonic or multi-channel coding of audio signals is to encode the signals of the different channels separately as individual and independent signals, as illustrated in FIG. 3. However, this means that the redundancy among the plurality of channels is not removed, and that the bit-rate requirement will be proportional to the number of channels.
Another basic way used in stereo FM radio transmission and which ensures compatibility with legacy mono radio receivers is to transmit a sum signal (mono) and a difference signal (side) of the two involved channels.
State-of-the art audio codecs such as MPEG-1/2 Layer III and MPEG-2/4 AAC make use of so-called joint stereo coding. According to this technique, the signals of the different channels are processed jointly rather than separately and individually. The two most commonly used joint stereo coding techniques are known as ‘Mid/Side’ (M/S) Stereo and intensity stereo coding which usually are applied on sub-bands of the stereo or multi-channel signals to be encoded.
M/S stereo coding is similar to the described procedure in stereo FM radio, in a sense that it encodes and transmits the sum and difference signals of the channel sub-bands and thereby exploits redundancy between the channel sub-bands. The structure and operation of a coder based on M/S stereo coding is described, e.g., in U.S. Pat. No. 5,285,498 by J. D. Johnston.
Intensity stereo on the other hand is able to make use of stereo irrelevancy. It transmits the joint intensity of the channels (of the different sub-bands) along with some location information indicating how the intensity is distributed among the channels. Intensity stereo does only provide spectral magnitude information of the channels, while phase information is not conveyed. For this reason and since temporal inter-channel information (more specifically the inter-channel time difference) is of major psycho-acoustical relevancy particularly at lower frequencies, intensity stereo can only be used at high frequencies above e.g. 2 kHz. An intensity stereo coding method is described, e.g., in European Patent 0497413 by R. Veldhuis et al.
A recently developed stereo coding method is described, e.g., in a conference paper with title ‘Binaural cue coding applied to stereo and multi-channel audio compression’, 112th AES convention, May 2002, Munich (Germany) by C. Faller et al. This method is a parametric multi-channel audio coding method. The basic principle of such parametric techniques is that at the encoding side the input signals from the N channels c1, c2, . . . , cN are combined to one mono signal m. The mono signal is audio encoded using any conventional monophonic audio codec. In parallel, parameters are derived from the channel signals, which describe the multi-channel image. The parameters are encoded and transmitted to the decoder, along with the audio bit stream. The decoder first decodes the mono signal m′ and then regenerates the channel signals c1′, c2′, cN′, based on the parametric description of the multi-channel image.
The principle of the binaural cue coding (BCC[2]) method is that it transmits the encoded mono signal and so-called BCC parameters. The BCC parameters comprise coded inter-channel level differences and inter-channel time differences for sub-bands of the original multi-channel input signal. The decoder regenerates the different channel signals by applying sub-band-wise level and phase adjustments of the mono signal based on the BCC parameters. The advantage over e.g. M/S or intensity stereo is that stereo information comprising temporal inter-channel information is transmitted at much lower bit rates.
Another technique, described in U.S. Pat. No. 5,434,948 by C. E. Holt et al. uses the same principle of encoding of the mono signal and side information. In this case, side information consists of predictor filters and optionally a residual signal. The predictor filters, estimated by the LMS algorithm, when applied to the mono signal allow the prediction of the multi-channel audio signals. With this technique one is able to reach very low bit rate encoding of multi-channel audio sources, however at the expense of a quality drop.
The basic principles of parametric stereo coding are illustrated in FIG. 4, which displays a layout of a stereo codec, comprising a down-mixing module 120, a core mono codec 130, 230, a bitstream multiplexer/demultiplexer 150, 250 and a parametric stereo side information encoder/decoder 140, 240. The down-mixing transforms the multi-channel (in this case stereo) signal into a mono signal. The objective of the parametric stereo codec is to reproduce a stereo signal at the decoder given the reconstructed mono signal and additional stereo parameters.
In International Patent Application, published as WO 2006/091139, a technique for adaptive bit allocation for multi-channel encoding is described. It utilizes at least two encoders, where the second encoder is a multistage encoder. Encoding bits are adaptively allocated among the different stages of the second multi-stage encoder based on multi-channel audio signal characteristics.
A downmixing technique employed in MPEG Parametric Stereo in explained in [3]. Here the potential energy loss from channel cancellation in the downmix procedure is compensated with a scaling factor.
MPEG Surround [4][5] divides the audio coding into two partitions: one predictive/parametric part called the Dry component and a non-predictable/diffuse part called the Wet component. The Dry component is obtained using channel prediction from a down-mix signal which has been encoded and decoded separately. The Wet component may be either one of the following three: a synthesized diffuse sound signal generated from the prediction and decorrelating filters, a gain adjusted version of the predicted part or simply by the encoded prediction residual.