In many audio signal processing applications audio signals are compressed to reduce the processing power requirements when processing the audio signal. For example, in digital communication systems an audio signal is typically captured as an analogue signal, digitised in an analogue to digital (A/D) converter and then encoded before transmission over a wireless air interface between a user equipment, such as a mobile station, and a base station. The purpose of the encoding is to compress the digitised signal and transmit it over the air interface with the minimum amount of data whilst maintaining an acceptable signal quality level. This is particularly important as radio channel capacity over the wireless air interface is limited in a cellular communication network. There are also applications in which a digitised audio signal is stored to a storage medium for later reproduction of the audio signal.
The compression can be lossy or lossless. In lossy compression some information is lost during the compression wherein it is not possible to fully reconstruct the original signal from the compressed signal. In lossless compression no information is normally lost. Hence, the original signal can usually be completely reconstructed from the compressed signal.
The term audio signal is normally understood as a signal containing speech, music (non-speech) or both. The different nature of speech and music makes it rather difficult to design one compression algorithm which works enough well for both speech and music. Therefore, the problem is often solved by designing different algorithms for both audio and speech and use some kind of recognition method to recognise whether the audio signal is speech like or music like and select the appropriate algorithm according to the recognition.
In overall, classifying purely between speech and music or non-speech signals is a difficult task. The required accuracy depends heavily on the application. In some applications the accuracy is more critical like in speech recognition or in accurate archiving for storage and retrieval purposes. However, the situation is a bit different if the classification is used for selecting an optimal compression method for the input signal. In this case, it may happen that there does not exist one compression method that is always optimal for speech and another method that is always optimal for music or non-speech signals. In practise, it may be that a compression method for speech transients is also very efficient for music transients. It is also possible that a music compression for strong tonal components may be good for voiced speech segments. So, in these instances, methods for classifying just purely for speech and music do not create the most optimal algorithm to select the best compression method.
Often speech can be considered as bandlimited to between approximately 200 Hz and 3400 Hz. The typical sampling rate used by an A/D converter to convert an analogue speech signal into a digital signal is either 8 kHz or 16 kHz. Music or non-speech signals may contain frequency components well above the normal speech bandwidth. In some applications the audio system should be able to handle a frequency band between about 20 Hz to 20 000 kHz. The sample rate for that kind of signal should be at least 40 000 kHz to avoid aliasing. It should be noted here that the above mentioned values are just non-limiting examples. For example, in some systems the higher limit for music signals may be about 10 000 kHz or even less than that.
The sampled digital signal is then encoded, usually on a frame by frame basis, resulting in a digital data stream with a bit rate that is determined by a codec used for encoding. The higher the bit rate, the more data is encoded, which results in a more accurate representation of the input frame. The encoded audio signal can then be decoded and passed through a digital to analogue (D/A) converter to reconstruct a signal which is as near the original signal as possible.
An ideal codec will encode the audio signal with as few bits as possible thereby optimising channel capacity, while producing decoded audio signal that sounds as close to the original audio signal as possible. In practice there is usually a trade-off between the bit rate of the codec and the quality of the decoded audio.
At present there are numerous different codecs, such as the adaptive multi-rate (AMR) codec and the adaptive multi-rate wideband (AMR-WB) codec, which are developed for compressing and encoding audio signals. AMR was developed by the 3rd Generation Partnership Project (3GPP) for GSM/EDGE and WCDMA communication networks. In addition, it has also been envisaged that AMR will be used in packet switched networks. AMR is based on Algebraic Code Excited Linear Prediction (ACELP) coding. The AMR and AMR WB codecs consist of 8 and 9 active bit rates respectively and also include voice activity detection (VAD) and discontinuous transmission (DTX) functionality. At the moment, the sampling rate in the AMR codec is 8 kHz and in the AMR WB codec the sampling rate is 16 kHz. It is obvious that the codecs and sampling rates mentioned above are just non-limiting examples.
ACELP coding operates using a model of how the signal source is generated, and extracts from the signal the parameters of the model. MORE specifically, ACELP coding is based on a model of the human vocal system, where the throat and mouth are modelled as a linear filter and speech is generated by a periodic vibration of air exciting the filter. The speech is analysed on a frame by frame basis by the encoder and for each frame a set of parameters representing the modelled speech is generated and output by the encoder. The set of parameters may include excitation parameters and the coefficients for the filter as well as other parameters. The output from a speech encoder is often referred to as a parametric representation of the input speech signal. The set of parameters is then used by a suitably configured decoder to regenerate the input speech signal.
For some input signals, the pulse-like ACELP-excitation produces higher quality and for some input signals transform coded excitation (TCX) is more optimal. It is assumed here that ACELP-excitation is mostly used for typical speech content as an input signal and TCX-excitation is mostly used for typical music as an input signal. However, this is not always the case, i.e., sometimes speech signal has parts, which are music like and music signal has parts, which are speech like. The definition of speech like signal in this application is that most of the speech belongs to this category and some of the music may also belong to this category. For music like signals the definition is other way around. Additionally, there are some speech signal parts and music signal parts that are neutral in a sense that they can belong to the both classes.
The selection of excitation can be done in several ways: the most complex and quite good method is to encode both ACELP and TCX-excitation and then select the best excitation based on the synthesised speech signal. This analysis-by-synthesis type of method will provide good results but it is in some applications not practical because of its high complexity. In this method for example SNR-type of algorithm can be used to measure the quality produced by both excitations. This method can be called as a “brute-force” method because it tries all the combinations of different excitations and selects afterwards the best one. The less complex method would perform the synthesis only once by analysing the signal properties beforehand and then selecting the best excitation. The method can also be a combination of pre-selection and “brute-force” to make compromised between quality and complexity.
FIG. 1 presents a simplified encoder 100 with prior-art high complexity classification. An audio signal is input to the input signal block 101 in which the signal is digitised and filtered. The input signal block 101 also forms frames from the digitised and filtered signal. The frames are input to a linear prediction coding (LPC) analysis block 102. It performs a LPC analysis on the digitised input signal on a frame by frame basis to find such a parameter set which matches best with the input signal. The determined parameters (LPC parameters) are quantized and output 109 from the encoder 100. The encoder 100 also generates two output signals with LPC synthesis blocks 103, 104. The first LPC synthesis block 103 uses a signal generated by the TCX excitation block 105 to synthesise the audio signal for finding the code vector producing the best result for the TCX excitation. The second LPC synthesis block 104 uses a signal generated by the ACELP excitation block 106 to synthesise the audio signal for finding the code vector producing the best result for the ACELP excitation. In the excitation selection block 107 the signals generated by the LPC synthesis blocks 103, 104 are compared to determine which one of the excitation methods gives the best (optimal) excitation. Information about the selected excitation method and parameters of the selected excitation signal are, for example, quantized and channel coded 108 before outputting 109 the signals from the encoder 100 for transmission.