An audio signal can be a speech signal or another type of audio signal, like music, and for different types of audio signals different coding models might be appropriate.
A widely used technique for coding speech signals is the Algebraic Code-Excited Linear Prediction (ACELP) coding. ACELP models the human speech production system, and it is very well suited for coding the periodicity of a speech signal. As a result, a high speech quality can be achieved with very low bit rates. Adaptive Multi-Rate Wideband (AMR-WB), for example, is a speech codec which is based on the ACELP technology. AMR-WB has been described for instance in the technical specification 3GPP TS 26.190: “Speech Codec speech processing functions; AMR Wideband speech codec; Transcoding functions”, V5.1.0 (2001-12). Speech codecs which are based on the human speech production system, however, perform usually rather badly for other types of audio signals, like music.
A widely used technique for coding other audio signals than speech is transform coding (TCX). The superiority of transform coding for audio signal is based on perceptual masking and frequency domain coding. The quality of the resulting audio signal can be further improved by selecting a suitable coding frame length for the transform coding. But while transform coding techniques result in a high quality for audio signals other than speech, their performance is not good for periodic speech signals when operating at low bitrates. Therefore, the quality of transform coded speech is usually rather low, especially with long TCX frame lengths.
The extended AMR-WB (AMR-WB+) codec encodes a stereo audio signal as a high bitrate mono signal and provides some side information for a stereo extension. The AMR-WB+ codec utilizes both, ACELP coding and TCX models to encode the core mono signal in a frequency band of 0 Hz to 6400 Hz. For the TCX model, a coding frame length of 20 ms, 40 ms or 80 ms is utilized.
Since an ACELP model can degrade the audio quality and transform coding performs usually poorly for speech, especially when long coding frames are employed, the respectively best coding model has to be selected depending on the properties of the signal which is to be coded. The selection of the coding model which is actually to be employed can be carried out in various ways.
In systems requiring low complex techniques, like mobile multimedia services (MMS), usually music/speech classification algorithms are exploited for selecting the optimal coding model. These algorithms classify the entire source signal either as music or as speech based on an analysis of the energy and the frequency properties of the audio signal.
If an audio signal consists only of speech or only of music, it will be satisfactory to use the same coding model for the entire signal based on such a music/speech classification. In many other cases, however, the audio signal which is to be encoded is a mixed type of audio signal. For example, speech may be present at the same time as music and/or be temporally alternating with music in the audio signal.
In these cases, a classification of entire source signals into music or speech category is a too limited approach. The overall audio quality can then only be maximized by temporally switching between the coding models when coding the audio signal. That is, the ACELP model is partly used as well for coding a source signal classified as an audio signal other than speech, while the TCX model is partly used as well for a source signal classified as a speech signal.
The extended AMR-WB (AMR-WB+) codec is designed as well for coding such mixed types of audio signals with mixed coding models on a frame-by-frame basis.
The selection, that is, the classification, of coding models in AMR-WB+ can be carried out in several ways.
In the most complex approach, the signal is first encoded with all possible combinations of ACELP and TCX models. Next, the signal is synthesized again for each combination. The best excitation is then selected based on the quality of the synthesized speech signals. The quality of the synthesized speech resulting with a specific combination can be measured for example by determining its signal-to-noise ratio (SNR). This analysis-by-synthesis type of approach will provide good results. In some applications, however, it is not practicable, because of its very high complexity. Such applications include, for example, mobile applications. The complexity results largely from the ACELP coding, which is the most complex part of an encoder.
In systems like MMS, for example, the above mentioned full closed-loop analysis-by-synthesis approach is far too complex to perform. In an MMS encoder, therefore, lower complexity open-loop methods may be employed in the classification for determining whether an ACELP coding model or a TCX model is to be used for encoding a particular frame.
AMR-WB+ may use various low-complex open-loop approaches for selecting the respective coding model for each frame. The selection logic employed in such approaches aims at evaluating the source signal characteristics and encoding parameters in more detail for selecting a respective coding model.
One proposed selection logic within a classification procedure involves first splitting up an audio signal within each frame into several frequency bands, and analyzing the relation between the energy in the lower frequency bands and the energy in the higher frequency bands, as well as analyzing the energy level variations in those bands. The audio content in each frame of the audio signal is then classified as a music-like content or a speech-like content based on both of the performed measurements or on different combinations of these measurements using different analysis windows and decision threshold values.
In another proposed selection logic aiding the classification, which can be used in particular in addition to the first selection logic and which is therefore also referred to as model classification refinement, the coding model selection is based on an evaluation of the periodicity and the stationary properties of the audio content in a respective frame of the audio signal. Periodicity and stationary properties are evaluated more specifically by determining correlation, Long Term Prediction (LTP) parameters and spectral distance measurements.
The AMR-WB+ codec allows in addition to switch during the coding of an audio stream between AMR-WB modes, which employ exclusively an ACELP coding model, and extension modes, which employ either an ACELP coding model or a TCX model, provided that the sampling frequency does not change. The sampling frequency can be for example 16 kHz.
The extension modes output a higher bit rate than the AMR-WB modes. A switch from an extension mode to an AMR-WB mode can thus be of advantage when transmission conditions in the network connecting the encoding end and the decoding end require a changing from a higher bit-rate mode to a lower bit-rate mode to reduce congestion in the network. A change from a higher bit-rate mode to a lower bit-rate mode might also be required for incorporating new low-end receivers in a Mobile Broadcast/Multicast Service (MBMS).
A switch from an AMR-WB mode to an extension mode, on the other hand, can be of advantage when a change in the transmission conditions in the network allows a change from a lower bit-rate mode to a higher bit-rate mode. Using a higher bit-rate mode enables a better audio quality.
Since the core codec use the same sampling rate of 6.4 kHz for the AMR-WB modes and the AMR-WB+ extension modes and employs at least partially similar coding techniques, a change from an extension mode to an AMR-WB mode, or vice versa, at this frequency band can be handled smoothly. As the ACELP core-band coding process is slightly different for an AMR-WB mode and an extension mode, it has to be taken care, however, that all required state variables and buffers are stored and copied from one algorithm to the other when switching between the coder modes.
Further, it has to be taken into account that a transform model can only be used in the extension modes.
For encoding a specific coding frame, the TCX model makes use of overlapping windows. This is illustrated in FIG. 1. FIG. 1 is a diagram presenting a time line with a plurality of coding frames and a plurality overlapping analysis windows. For coding a TCX frame, a window covering the current TCX frame and a preceding TCX frame is used. Such a TCX frame 11 and a corresponding overlapping window 12 are indicated in the diagram with solid bold lines. The next TCX frame 13 and a corresponding window 14 are indicated in the diagram with dashed bold lines. In the presented example, the analysis windows are overlapping by 50%, even though in practice, the overlap is usually smaller.
In a typical operation within the AMR-WB extension mode, an overlapping signal for the respective next frame is generated based on information on the current frame after the current frame has been encoded.
When the transform coding model is used for a current coding frame, the overlapping signal for a next coding frame is generated by definition, since the analysis windows for the transform are overlapping.
The ACELP coding model, in contrast, relies only on information from the current coding frame, that is, it does not use overlapping windows. If a ACELP coding frame is followed by an TCX frame, the ACELP algorithm is therefore required to generate an overlap signal artificially, that is, in addition to the actual ACELP related processing.
FIG. 2 presents a typical situation in an extension mode, in which an artificial overlap signal has to be generated for a TCX frame, because it follows upon an ACELP frame. The ACELP coding frame 21 and the artificial overlap signal 22 for the TCX frame 23 are indicated with dashed bold lines. The TCX frame 23 and the overlap signal 24 from and for the TCX frame 23 are indicated with solid bold lines. Since ACELP coding does not require any overlapping signal from the previous coding frame, no overlapping signal is generated, if an ACELP frame is followed by a further ACELP frame.
In the AMR-WB extension modes, the artificial overlap signal generation in the ACELP mode is a built-in feature. Hence, the switching between ACELP coding and TCX is smooth.
There remains a problem, however, when switching at an AMR-WB+codec from a standard AMR-WB mode to an extension mode. The standard AMR-WB mode does not provide any artificial overlap signal generation, since an overlap signal is not needed in this coder mode. Hence, if the audio signal frame after a switch from an AMR-WB mode to an extension mode is selected to be a TCX frame, the coding cannot be performed properly. As a result, the missing overlapping signal part will cause audible artifacts in the synthesis of the audio signal.