Cellular communication systems are commonplace today. Cellular communication systems typically operate in accordance with a given standard or specification. For example, the standard or specification may define the communication protocols and/or parameters that shall be used for a connection. Examples of the different standards and/or specifications include, without limiting to these, GSM (Global System for Mobile communications), GSM/EDGE (Enhanced Data rates for GSM Evolution), AMPS (American Mobile Phone System), WCDMA (Wideband Code Division Multiple Access) or 3rd generation (3G) UMTS (Universal Mobile Telecommunications System), IMT 2000 (International Mobile Telecommunications 2000) and so on.
In a cellular communications system and in general signal processing applications, a signal is often compressed to reduce the amount of information needed to represent the signal. For example, an audio signal is typically captured as an analogue signal, digitised in an analogue to digital (A/D) converter and then encoded. In a cellular communication system, the encoded signal can be transmitted over the wireless air interface between a user equipment, such as a mobile terminal, and a base station. Alternatively, as in a more general signal processing systems, the encoded audio signal can be stored in a storage medium for later use or reproduction of the audio signal.
The encoding compresses the signal and, as in a cellular communication system, can then be transmitted over the air interface with the minimum amount of data whilst maintaining an acceptable signal quality level. This is particularly important as radio channel capacity over the wireless air interface is limited in a cellular communication system.
An ideal encoding method will encode the audio signal in as few bits as possible thereby optimising channel capacity, while producing a decoded signal that sounds as close to the original audio as possible. In practice there is usually a trade-off between the bit rate of the compression method and the quality of the decoded speech.
The compression or encoding can be lossy or lossless. In lossy compression some information is lost during the compression where it is not possible to fully reconstruct the original signal from the compressed signal. In lossless compression no information is normally lost and the original signal can be fully reconstructed from the compressed signal.
An audio signal can be considered as a signal containing speech, music (or non-speech) or both. The different characteristics of speech and music make it difficult to design a single encoding method that works well for both speech and music. Often an encoding method that is optimal for speech signals is not optimal for music or non-speech signals. Therefore, to solve this problem, different encoding methods have been developed for encoding speech and music. However, the audio signal must be classified as speech or music before an appropriate encoding method can be selected.
Classifying an audio signal as either a speech signal or music/non-speech signal is a difficult task. The required accuracy of the classification depends on the application using the signal. In some applications the accuracy is more critical like in speech recognition or in archiving for storage and retrieval purposes.
However, it is possible that an encoding method for parts of the audio signal comprising mainly of speech is also very efficient for parts comprising mainly of music. Indeed, it is possible that an encoding method for music with strong tonal components may be very suitable for speech. Therefore, methods for classifying an audio signal based purely on whether the signal is made up of speech or music does not necessarily result in the selection of the optimal compression method for the audio signal.
The adaptive multi-rate (AMR) codec is an encoding method developed by the 3rd Generation Partnership Project (3GPP) for GSM/EDGE and WCDMA communication networks. In addition, it has also been envisaged that AMR will be used in future packet switched networks. AMR is based on Algebraic Code Excited Linear Prediction (ACELP) excitation encoding. The AMR and adaptive multi-rate wideband (AMR-WB) codecs consist of 8 and 9 active bit rates respectively and also includes voice inactivity detection (VAD) and discontinuous transmission (DTX) functionality. The sampling rate in the AMR codec is 8 kHz. In the AMR WB codec the sampling rate is 16 kHz.
Details of the AMR and AMR-WB codecs can be found in the 3GPP TS 26.090 and 3GPP TS 26.190 technical specifications. Further details of the AMR-WB codec and VAD can be found in the 3GPP TS 26.194 technical specification.
In another encoding method, the extended AMR-WB (AMR-WB+) codec, the encoding is based on two different excitation methods: ACELP pulse-like excitation and transform coded (TCX) excitation. The ACELP excitation is the same as that used already in the original AMR-WB codec. TCX excitation is an AMR-WB+ specific modification.
ACELP excitation encoding operates using a model of how a signal is generated at the source, and extracts from the signal the parameters of the model. More specifically, ACELP encoding is based on a model of the human vocal system, where the throat and mouth are modelled as a linear filter and a signal is generated by a periodic vibration of air exciting the filter. The signal is analysed on a frame by frame basis by the encoder and for each frame a set of parameters representing the modelled signal is generated and output by the encoder. The set of parameters may include excitation parameters and the coefficients for the filter as well as other parameters. The output from an encoder of this type is often referred to as a parametric representation of the input signal. The set of parameters is used by a suitably configured decoder to regenerate the input signal.
In the AMR-WB+ codec, linear prediction coding (LPC) is calculated in each frame of the signal to model the spectral envelope of the signal as a linear filter. The result of the LPC, known as the LPC excitation, is then encoded using ACELP excitation or TCX excitation.
Typically, ACELP excitation utilises long term predictors and fixed codebook parameters, whereas TCX excitation utilises Fast Fourier Transforms (FFTs). Furthermore, in the AMR-WB+ codec the TCX excitation can be performed using one of three different frame lengths (20, 40 and 80 ms).
TCX excitation is widely used in non-speech audio encoding. The superiority of TCX excitation based encoding for non-speech signals is due to the use of perceptual masking and frequency domain coding. Even though TCX techniques provide superior quality music signals, the quality is not so good for periodic speech signals. Conversely, codecs based on the human speech production system such as ACELP, provide superior quality speech signals but poor quality music signals.
Therefore, in general, ACELP excitation is mostly used for encoding speech signals and TCX excitation is mostly used for encoding music and other non-speech signals. However, this is not always the case, as sometimes a speech signal has parts that are music like and a music signal has parts that are speech like. There also exists audio signals that contain both music and speech where the selected encoding method based solely on one of ACELP excitation or TCX excitation may not be optimal.
The selection of excitation in AMR-WB+ can be done in several ways.
The first and simplest method is to analyse the signal properties once before encoding the signal, thereby classifying the signal into speech or music/non-speech and selecting the best excitation out of ACELP and TCX for the type of signal. This is known as a “pre-selection” method. However, such a method is not suited to a signal that has varying characteristics of both speech and music, resulting in an encoded signal that is neither optimised for speech or music.
The more complex method is to encode the audio signal using both ACELP and TCX excitation and then select the excitation based on the synthesised audio signal which is of a better quality. The signal quality can be measured using a signal-to-noise type of algorithm. This “analysis-by-synthesis” type of method, also known as the “brute-force” method as all different excitations are calculated and the best one selected, provides good results but it is not practical because of the computational complexity of performing multiple calculations.
It is the aim of embodiments of the present invention to provide an improved method for selecting an excitation method for encoding a signal that at least partly mitigates some of the above problems.