A mobile communications system presents a challenging environment for voice transmission services. A voice call can take place virtually anywhere, and the surrounding background noises and acoustic conditions will have an impact on the quality and intelligibility of the transmitted speech. At the same time, there is strong motivation for limiting the transmission resources consumed by each communication device. Mobile communications services therefore employ compression technologies in order to reduce the transmission bandwidth consumed by the voice signals. Lower bandwidth consumption yields lower power consumption in both the mobile device and the base station. This translates to energy and cost saving for the mobile operator, while the end user will experience prolonged battery life and increased talk-time. Furthermore, with less consumed bandwidth per user, a mobile network can service a larger number of users at the same time.
Today, the dominating compression technology for mobile voice services is Code Excited Linear Prediction (CELP), described for example in “Code-Excited Linear Prediction (CELP) high-quality speech at very low bit rates”, M. R. Schroeder and B. Atal, IEEE ICASSP 1985.
CELP is an encoding method operating according to an analysis-by-synthesis procedure. In CELP for voice coding, linear prediction analysis is used in order to determine, based on an audio signal to be encoded, a slowly varying linear prediction (LP) filter A(z) representing the human vocal tract. The audio signal is divided into signal segments, and a signal segment is filtered using the determined A(z), the filtering resulting in a filtered signal segment, often referred to as the LP residual. A target signal x(n) is then formed, typically by filtering the LP residual through a weighted synthesis filter W(z)/Â(z) to form a target signal x(n) in the weighted domain. The target signal x(n) is used as a reference signal for an analysis-by-synthesis procedure wherein an adaptive code book is searched for a sequence of past excitation samples which, when filtered through weighted synthesis filter, would give a good approximation of the target signal. A secondary target signal x2(n) is then derived by subtracting the selected adaptive code book signal from the filtered signal segment. The secondary target signal is in turn used as a reference signal for a further analysis-by-synthesis procedure, wherein a fixed code book is searched for a vector of pulses which, when filtered through the weighted synthesis filter, would give a good approximation of the secondary target signal. The adaptive code book is then updated with a linear combination of the selected adaptive code book vector and the selected fixed code book vector.
By use of CELP, a good speech quality at moderately low bandwidth is typically achieved, and the method is widely used in deployed codecs such as GSM-EFR, AMR and AMR-WB. However, for the very low bit rates, the limitations of the CELP coding technique begin to show. While the segments of voiced speech remain well represented, the more noise-like consonants such as fricatives start to sound worse. Degradation can also be perceived in the background noises.
As seen above, the CELP technique uses a pulse based excitation signal. For voiced signal segments, the filtered signal segment (target excitation signal) is concentrated around so called glottal pulses, occurring at regular intervals corresponding to the fundamental frequency of the speech segment. This structure can be well modeled with a vector of pulses. For a noise-like segment, on the other hand, the target excitation signal is less structured in the sense that the energy is more spread over the entire vector. Such an energy distribution is not well captured with a vector of pulses, and particularly not at low bitrates. When the bit rate is low, the pulses simply become too few to adequately capture the energy distribution of the noise-like signals, and the resulting synthesized speech will have a buzzing distortion, often referred to as the sparseness artefact of CELP codecs.
Hence, for the very low bit rates, which could for example be advantageous when the transmission channel conditions are poor, an alternative to the CELP is required in order to arrive at a well sounding synthesized signal. Several technologies have been developed in order to deal with the CELP sparseness artefact at low bitrates. WO99/12156 discloses a method of decoding an encoded signal, wherein an anti-sparseness filter is applied as a post-processing step in the decoding of the speech signal. Such anti-sparseness processing reduces the sparseness artefact, but the end result can still sound a bit unnatural.
Another method of mitigating the sparseness artefact which is well known in the art is often referred to as Noise Excited Linear Prediction (NELP). In NELP, signal segments are processed using a noise signal as the excitation signal. The noise excitation is only suitable for representation of noise-like sounds. Therefore, a system using NELP often uses a different excitation method, e.g. CELP, for the tonal or voiced segments. Thus, the NELP technology relies on a classification of the speech segment, using different encoding strategies for unvoiced and voiced parts of an audio signal. The difference between these coding strategies gives rise to switching artefacts upon switching between the voiced and unvoiced switching strategies. Furthermore, the noise excitation will typically not be able to successfully model the excitation of complex noise-like signals, and parts of the anti-sparseness artefacts will therefore typically remain.
As can be seen from the above, there is a need for an improved codec by which a high quality synthesized audio signal can be obtained even when the encoded signal is encoded for low bit rate transmission.