With the emergence of distribution of speech and audio content via communication networks, an efficient use of the available bandwidth is an important issue for the network operators, while, at the same time, the quality perceived by the end-user has to remain high. This raises a demand for efficient processing schemes at codec's, both of the transmitting and receiving entities.
In order to obtain efficient transmission of speech and audio over a communication network, bandwidth extension (BWE) and noise-fill schemes are commonly used in speech and audio codec's, and, due to increasing bandwidth requirements, use of such schemes will be even more important in the future. A main issue with using the BWE concept is to quantize and transmit only low-frequency (LF) regions of a signal on the transmitting (encoder) side, to transmit these regions to a receiver, and then to reconstruct high-frequency (HF) regions at the receiver side (decoder).
A process of HF reconstruction can be based on the signal residual of the LF signal, i.e. the signal with the spectrum envelope removed, together with some additional transmitted information, such as e.g. a set of energy gains, or a set of linear-prediction coefficients and a global energy gain, which represents the HF spectrum envelope. As a result, BWE causes a special type of degradation of the signal that is localized in the residual of the HF bands of the signal. Similar artifacts are also caused by the noise-fill schemes, when used in speech or audio coding. A basic concept of noise-filling is that some low-energy LF bands are not encoded at the encoder of the transmitter. At the decoder of the receiver, the signal residual in these bands is then replaced with White Gaussian Noise (WGN), or reconstructed from neighboring LF bands.
A spectrum envelope and a compressed residual for a speech frame can be exemplified with the illustration of FIG. 1.
For a signal having a spectrum envelope 100, a LF residual 101 and a HF residual 102, the spectrum envelope 100 and the LF residual 101 may typically be quantized and compressed in the encoder, before it is transmitted to a receiver/decoder, where the HF residual 102 may be reconstructed by translating or flipping the LF residual 101, according to any prior art reconstruction procedure.
A typical configuration for estimating a quality degradation originating from a signal process of a codec can be described as follows, with reference to the schematic illustration of FIG. 2, where an apparatus configured to estimate a quality measure, here referred to as a quality assessment device 200, is receiving a signal, in the present context typically a speech or audio signal, that has been transmitted from a signal source 201, via a communication network 202. This signal, which is an encoded signal that has been transmitted via communication network 202, and decoded before it is provided to the quality assessment device 200, is typically referred to as the processed signal 203. The quality assessment device 200, also have access to a reference signal 204, which is representing the unprocessed signal of signal source 201.
On the basis of both the reference signal 204 and the processed signal 203, the quality assessment device 200 may estimate speech or audio quality of a signal that has been affected by coding distortion, on the basis of some algorithm that is suitable for such a measure. Such algorithms are known e.g. from ITU-T Rec. P.862, “Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment in narrow-band telephone networks and speech codec's”, 2001-02; ITU-T Rec. P.862.2, “Wideband extension to recommendation P.862 for the assessment of wideband telephone networks and speech codec's”, 2005-11, and from ITU-R Rec. BS.1387-1, “Method for objective measurements of perceived audio quality”, 2001.
One problem with existing solutions, such as any of the ones mentioned above, is that, due to the so called BWE effects, they are quite insensitive to distortions introduced by the codec, to the signal residual of the higher bands of the processed signal, during an encoding process. At the same time these distortions are audible and, thus, normally they lead to overall quality degradation. One reason why BWE distortions are not captured by the state-of-the-art quality measures lies in the specific of the perceptual transform used during these measures. This is particularly relevant in the well known frequency transform to the Bark or Mel scale, where the higher frequency bands have a large bandwidth, and, thus, masks any effects of the signal residual that may reside inside these bands.
Consequently, despite the fact that BWE is widely used in today's codec's, and that this type of schemes most likely will be even more important for the future codec's, there is at present no clear methods known on how to obtain a representative measure on the degradation, caused from using a BWE or noise-fill-scheme. The above statement is applicable even to the best known algorithms for speech/audio quality estimation of coding distortions.