Audio signals, like speech or music, are encoded for example for enabling an efficient transmission or storage of the audio signals.
Audio encoders and decoders are used to represent audio based signals, such as music and background noise. These types of coders typically do not utilise a speech model for the coding process, rather they use processes for representing all types of audio signals, including speech.
Speech encoders and decoders (codecs) are usually optimised for speech signals, and can operate at either a fixed or variable bit rate.
An audio codec can also be configured to operate with varying bit rates. At lower bit rates, such an audio codec may work with speech signals at a coding rate equivalent to a pure speech codec. At higher bit rates, the audio codec may code any signal including music, background noise and speech, with higher quality and performance.
In some audio codecs the input signal is divided into a limited number of bands. Each of the band signals may be quantized. From the theory of psychoacoustics it is known that the highest frequencies in the spectrum are perceptually less important than the low frequencies. This in some audio codecs is reflected by a bit allocation where fewer bits are allocated to high frequency signals than low frequency signals.
One emerging trend in the field of media coding are so-called layered codecs, for example ITU-T Embedded Variable Bit-Rate (EV-VBR) speech/audio codec and ITU-T Scalable Video Codec (SVC). The scalable media data consists of a core layer, which is always needed to enable reconstruction in the receiving end, and one or several enhancement layers that can be used to provide added value to the reconstructed media (e.g. improved media quality or increased robustness against transmission errors, etc).
The scalability of these codecs may be used in a transmission level e.g. for controlling the network capacity or shaping a multicast media stream to facilitate operation with participants behind access links of different bandwidth. In an application level the scalability may be used for controlling such variables as computational complexity, encoding delay, or desired quality level. Note that whilst in some scenarios the scalability can be applied at the transmitting end-point, there are also operating scenarios where it is more suitable that an intermediate network element is able to perform the scaling.
A majority of real time speech coding is with regards to mono signals, but for some high end video and audio teleconferencing systems, stereo encoding has been used to produce better speech reproduction experience for the listener. Traditional stereo speech encoding involves the encoding of separate left and right channels, which position the source to some location in the auditory scene. Commonly used stereo encoding for speech is binaural encoding, where the audio source (such as a voice of a speaker) is detected by two microphones which are located on a simulated reference head left and right ear position.
Encoding and transmission (or storage) of the left and right microphone generated signals requires more transmission bandwidth and computation since there are more signals to encode and decode than a conventional mono audio source recording. One approach to reduce the amount of transmission (storage) bandwidth used in stereo encoding methods is to require the encoder to mix both the left and right channels together and then encode the constructed (combined) mono signal as a core layer. The information on the left and right channel differences may then be encoded as a separate bit stream or enhancement layer. This type of encoding however produces a mono signal at the decoder with a sound quality worse than traditional encoding of a mono signal from a single microphone (located for example near the mouth) as the two microphone signals combined together receive much more background or environmental noise than a single microphone located near the audio source (for example the mouth). This makes the backwards compatible ‘mono’ output quality using legacy playback equipment worse than the original mono recording and mono playback process.
Furthermore the binaural stereo microphone placement where the microphones are located at simulated ear positions on a simulated head may produce an audio signal disturbing for the listener especially when the audio source moves rapidly or suddenly. For example, in an arrangement where the microphone placement is near the source, a speaker, poor quality listening experiences may be generated simply when the speaker rotates their head causing a dramatic and wrenching switch in left and right output signals.