It is known to encode audio signals for enabling an efficient transmission and/or storage of audio signals.
An audio signal can be a speech signal or another type of audio signal, like music, and for different types of audio signals different coding models might be appropriate.
A widely used technique for coding speech signals is the Algebraic Code-Excited Linear Prediction (ACELP) coding. ACELP models the human speech production system, and it is very well suited for coding the periodicity of a speech signal. As a result, a high speech quality can be achieved with very low bit rates. Adaptive Multi-Rate Wideband (AMR-WB), for example, is a speech codec that is based on the ACELP technology. AMR-WB has been described for instance in the technical specification 3GPP TS 26.190: “Speech Codec speech processing functions; AMR Wideband speech codec; Transcoding functions”, V5.1.0 (2001-12). Speech codecs which are based on the human speech production system, however, perform usually rather badly for other types of audio signals, like music.
A widely used technique for coding other audio signals than speech is transform coding (TCX). The superiority of transform coding for audio signal is based on perceptual masking and frequency domain coding. The quality of the resulting audio signal can be further improved by selecting a suitable coding frame length for the transform coding. But while transform coding techniques result in a high quality for audio signals other than speech, their performance is not good for periodic speech signals. Therefore, the quality of transform-coded speech is usually rather low, especially with long TCX frame lengths.
The extended AMR-WB (AMR-WB+) codec encodes a stereo audio signal as a high bitrate mono signal and provides some side information for a stereo extension. The AMR-WB+ codec utilizes both, ACELP coding and TCX models to encode the core mono signal in a frequency band of 0 Hz to 6400 Hz. For the TCX model, a coding frame length of 20 ms, 40 ms or 80 ms is utilized.
Since an ACELP model can degrade the audio quality and transform coding performs usually poorly for speech, especially when long coding frames are employed, the respectively best coding model has to be selected. The selection of the coding model that is actually to be employed can be carried out in various ways.
In systems requiring low complex techniques, like mobile multimedia services (MMS), usually music/speech classification algorithms are exploited for selecting the optimal coding model. These algorithms classify the entire source signal either as music or as speech based on an analysis of the energy and the frequency of the audio signal.
If an audio signal consists only of speech or only of music, it will be satisfactory to use the same coding model for the entire signal based on such a music/speech classification. In many other cases, however, the audio signal that is to be encoded is a mixed type of audio signal. For example, speech may be present at the same time as music and/or be alternating with music in the audio signal.
In these cases, a classification of entire source signals into music or a speech category is a too limited approach. Switching between the coding models when coding the audio signal can then only maximize the overall audio quality. That is, the ACELP model is partly used as well for coding a source signal classified as an audio signal other than speech, while the TCX model is partly used as well for a source signal classified as a speech signal.
The extended AMR-WB (AMR-WB+) codec is designed as well for coding such mixed types of audio signals with mixed coding models on a frame-by-frame basis.
The selection of coding models in AMR-WB+ can be carried out in several ways.
In the most complex approach, the signal is first encoded with all possible combinations of ACELP and TCX models. Next, the signal is synthesized again for each combination. The best excitation is then selected based on the quality of the synthesized speech signals. The quality of the synthesized speech resulting with a specific combination can be measured for example by determining its signal-to-noise ratio (SNR). This analysis-by-synthesis type of approach will provide good results. In some applications, however, it is not practicable, because of its very high complexity. The complexity results largely from the ACELP coding, which is the most complex part of an encoder.
In systems like MMS, for example, the full closed-loop analysis-by-synthesis approach is far too complex to perform. In an MMS encoder, therefore, a low complex open-loop method is employed for determining whether an ACELP coding model or a TCX model is selected for encoding a particular frame.
AMR-WB+ offers two different low-complex open-loop approaches for selecting the respective coding model for each frame. Both open-loop approaches evaluate source signal characteristics and encoding parameters for selecting a respective coding model.
In the first open-loop approach, an audio signal is first split up within each frame into several frequency bands, and the relation between the energy in the lower frequency bands and the energy in the higher frequency bands is analyzed, as well as the energy level variations in those bands. The audio content in each frame of the audio signal is then classified as a music-like content or a speech-like content based on both of the performed measurements or on different combinations of these measurements using different analysis windows and decision threshold values.
In the second open-loop approach, which is also referred to as model classification refinement, the coding model selection is based on an evaluation of the periodicity and the stationary properties of the audio content in a respective frame of the audio signal. Periodicity and stationary properties are evaluated more specifically by determining correlation, Long Term Prediction (LTP) parameters and spectral distance measurements.
If the signal properties are analyzed with an open-loop approach for selecting either ACELP or TCX, and TCX is selected for encoding, it is still necessary to define the to be used TCX frame length one of 20 ms, 40 ms or 80 ms. The optimal frame length for TCX, however, is very difficult to select based on signal characteristics in an open-loop approach.
It would thus be possible to select only the TCX frame lengths in the above-mentioned analysis-by-synthesis approach. In systems requiring low complex techniques, however, the analysis-by-synthesis approach is too complex, even if it is only used for the selection of TCX frame lengths.