Audio signals, like speech or music, are encoded for example by enabling an efficient transmission or storage of the audio signals.
Audio encoders and decoders are used to represent audio based signals, such as music and background noise. These types of coders typically do not utilise a speech model for the coding process, rather they use processes for representing all types of audio signals, including speech.
Speech encoders and decoders (codecs) are usually optimised for speech signals, and often operate at a fixed bit rate.
An audio codec can also be configured to operate with varying bit rates. At lower bit rates, such an audio codec may work with speech signals at a coding rate equivalent to pure speech codec. At higher bit rates, the audio codec may code any signal including music, background noise and speech, with higher quality and performance.
A further audio coding option is an embedded variable rate speech or audio coding scheme, which is also referred as a layered coding scheme. Embedded variable rate audio or speech coding denotes an audio or speech coding scheme, in which a bit stream resulting from the coding operation is distributed into successive layers. A base or core layer which comprises of primary coded data generated by a core encoder is formed of the binary elements essential for the decoding of the binary stream, and determines a minimum quality of decoding. Subsequent layers make it possible to progressively improve the quality of the signal arising from the decoding operation, where each new layer brings new information. One of the particular features of layered based coding is the possibility offered of intervening at any level whatsoever of the transmission or storage chain, so as to delete a part of binary stream without having to include any particular indication to the decoder.
The decoder uses the binary information that it receives and produces a signal of corresponding quality. For instance International Telecommunications Union Technical (ITU-T) standardisation aims at an embedded variable bit rate codec of 50 to 7000 Hz with bit rates from 8 to 32 kbps. The codec core layer will either work at 8 kbps or 12 kbps, and additional layers with quite small granularity will increase the observed speech and audio quality. The proposed layers will have as a minimum target at least five bit rates of 8, 12, 16, 24 and 32 kbps available from the same embedded bit stream. Further, the codec may optionally operate with higher bit rates and layers to include a super wideband extension mode, in which the frequency band of the codec is extended from 7000 Hz to 14000 Hz. In addition the higher layers may also incorporate a stereo extension mode in which information relating to the stereo image may be encoded and distributed to the bitstream.
By the very nature of layered, or scalable, based coding schemes the structure of the codecs tends to be hierarchical in form, consisting of multiple coding stages. Typically different coding techniques are used for the core (or base) layer and the additional layers. The coding methods used in the additional layers are then used to either code those parts of the signal which have not been coded by previous layers, or to code a residual signal from the previous stage. The residual signal is formed by subtracting a synthetic signal i.e. a signal generated as a result of the previous stage from the original. By adopting this hierarchical approach a combination of coding methods makes it possible to reduce the output to relatively low bit rates but retain sufficient quality, whilst also producing good quality audio reproduction by using higher bit rates.
Some of the foreseen applications for embedded variable bit rate coding and its super wideband and stereo extension technologies include high quality audio conferencing and audio streaming services.
A further enhancement to an audio coder is to incorporate an audio signal classifier in order to characterise the signal. The classifier typically categorises the audio signal in terms of its statistical properties. The output from the classifier may be used to switch the mode of encoding such that the codec is more able to adapt to the input signal characteristics. Alternatively, the output from an audio signal classifier may be used to determine the encoding bit rate of an audio coder. One of the most commonly used audio signal classifiers is a voice activity detector for a cellular speech codec. This classifier is typically used in conjunction with a discontinuous transmission (DTX) system, whereby the classifier is used to detect silence regions in conversational speech.
However in some audio coding systems it is desirable to distinguish between different types of audio signal such as music and speech by deploying an audio signal classifier.
Audio signal classification consists of extracting physical and perceptual features from a sound, and using these features to identify into which of a set of classes the sound is most likely to fit. An audio signal classification system may consist of a number of processing stages, where each stage can comprise one or more relatively complex algorithms. For instance, a typical audio signal classification system may deploy a feature extraction stage which is used to reduce and extract the physical data upon which the classification is to be based. This is usually succeeded by a clustering stage using for example a k-means clustering algorithm in order to determine the mapping of feature values to corresponding categories. Incorporated into most classification systems is a duration analysis stage which is performed over the length of the feature in order to improve the performance of the system. This analysis is usually implemented in the form of a Hidden Markov model.
Therefore a typical audio signal classification system will invariably require a considerable amount of computational processing power in order to effectively operate.