Speech coding is the process of obtaining a compact representation of voice signals for efficient transmission over band-limited wired and wireless channels and/or storage. Today, speech coders have become essential components in telecommunications and in the multimedia infrastructure. Commercial systems that rely on efficient speech coding include cellular communication, voice over internet protocol (VOIP), videoconferencing, electronic toys, archiving, and digital simultaneous voice and data (DSVD), as well as numerous PC-based games and multimedia applications.
Being a continuous-time signal, speech may be represented digitally through a process of sampling and quantization. Speech samples are typically quantized using either 16-bit or 8-bit quantization. Like many other signals, a speech signal contains a great deal of information that is either redundant (nonzero mutual information between successive samples in the signal) or perceptually irrelevant (information that is unperceivable by human listeners). Most telecommunication coders are lossy, meaning that the synthesized speech is perceptually similar to the original but may be physically dissimilar.
A speech coder converts a digitized speech signal into a coded representation, which is usually transmitted in frames. Correspondingly, a speech decoder receives coded frames and synthesizes reconstructed speech. Many modern speech coders belong to a large class of speech coders known as LPC (Linear Predictive Coders). Examples of such coders are: the 3GPP FR, EFR, AMR and AMR-WB speech codecs, the 3GPP2 EVRC, SMV and EVRC-WB speech codecs, and various ITU-T codecs such as G.728, G723, G.729, etc.
These coders all utilize a synthesis filter concept in the signal generation process. The filter is used to model the short-time spectrum of the signal that is to be reproduced, whereas the input to the filter is assumed to handle all other signal variations.
A common feature of these synthesis filter models is that the signal to be reproduced is represented by parameters defining the filter. The term “linear predictive” refers to a class of methods often used for estimating the filter parameters. Thus, the signal to be reproduced is partially represented by a set of filter parameters and partly by the excitation signal driving the filter.
The gain of such a coding concept arises from the fact that both the filter and its driving excitation signal can be described efficiently with relatively few bits.
One particular class of LPC based codecs are based on the analysis-by-synthesis (AbS) principle. These codecs incorporate a local copy of the decoder in the encoder and find the driving excitation signal of the synthesis filter by selecting that excitation signal among a set of candidate excitation signals which maximizes the similarity of the synthesized output signal with the original speech signal.
The concept of utilizing such a liner predictive coding and particularly AbS coding has proven to work relatively well for speech signals, even at low bit rates of e.g. 4-12 kbps. However, when the user of a mobile telephone using such coding technique is silent and the input signal comprises the surrounding sounds, the presently known coders have difficulties coping with this situation, since they are optimized for speech signals. A listener on the other side may easily get annoyed when familiar background sounds cannot be recognized since they have been “mistreated” by the coder.
So-called swirling causes one of the most severe quality degradations in the reproduced background sounds. This is a phenomenon occurring in scenarios with relatively stationary background sounds, such as car noise and is caused by non-natural temporal fluctuations of the power and the spectrum of the decoded signal. These fluctuations in turn are caused by inadequate estimation and quantization of the synthesis filter coefficients and its excitation signal. Usually, swirling becomes less when the codec bit rate increases.
Swirling has previously been identified as a problem and numerous solutions to it have been proposed in the literature. U.S. Pat. No. 5,632,004 [1] discloses one proposed solutions is disclosed in. According to this patent, during speech inactivity the filter parameters are modified by means of low pass filtering or bandwidth expansion such that spectral variations of the synthesized background sound are reduced. This method was further refined in U.S. Pat. No. 5,579,432 [2] such that the described anti-swirling technique is only applied upon detected stationary of the background noise.
U.S. Pat. No. 5,487,087 [3] discloses a further method addressing the swirling problem. This method makes use of a modified signal quantization scheme, which matches both the signal itself and its temporal variations. In particular, it is envisioned to use such a reduced-fluctuation quantizer for LPC filter parameters and signal gain parameters during periods of inactive speech.
Signal quality degradations caused by undesired power fluctuations of the synthesized signal are addressed by another set of methods. One of them is described in U.S. Pat. No. 6,275,798 [4] and is also a part of the AMR speech codec algorithm described in 3GPP TS 26.090 [5]. According to this disclosure, the gain of at least one component of the synthesized filter excitation signal, the fixed codebook contribution, is adaptively smoothed depending on the stationarity of the LPC short-term spectrum. This method is further explored in the disclosures of patent EP 1096476 [6] and patent application EP 1688920 [7] where the smoothing operation further involves a limitation of the gain to be used in the signal synthesis. A related method to be used in LPC vocoders is described in U.S. Pat. No. 5,953,697 [8]. According to this disclosure, the gain of the excitation signal of the synthesis filter is controlled such that the maximum amplitude of the synthesized speech just reaches the input speech waveform envelope.
Another class of methods addressing the swirling problem operates as a post processor after a speech decoder. Patent EP 0665530 [9] describes a method that during detected speech inactivity replaces a portion of the speech decoder output signal by a low-pass filtered white noise or comfort noise signal. Similar approaches are taken in various publications that disclose related methods replacing part of the speech decoder output signal with filtered noise.
Scalable or embedded coding, with reference to FIG. 1, is a coding paradigm in which the coding is done in layers. A base or core layer encodes the signal at a low bit rate, while additional layers, each on top of the other, provide some enhancement relative to the coding, which is achieved with all layers from the core up to the respective previous layer. Each layer adds some additional bit rate. The generated bit stream is embedded, meaning that the bit stream of lower-layer encoding is embedded into bit streams of higher layers. This property makes it possible anywhere in the transmission or in the receiver to drop the bits belonging to higher layers. Such stripped bit stream can still be decoded up to the layer which bits are retained.
The most used scalable speech compression algorithm today is the 64 kbps G.711 A/U-law logarithm PCM codec. The 8 kHz sampled G.711 codec coverts 12 bit or 13 bit linear PCM samples to 8 bit logarithmic samples. The ordered bit representation of the logarithmic samples allows for stealing the Least Significant Bits (LSBs) in a G.711 bit stream, making the G.711 coder practically SNR-scalable between 48, 56 and 64 kbps. This scalability property of the G.711 codec is used in the Circuit Switched Communication Networks for in-band control signaling purposes. A recent example of use of this G.711 scaling property is the 3GPP TFO protocol that enables Wideband Speech setup and transport over legacy 64 kbps PCM links. Eight kbps of the original 64 kbps G.711 stream is used initially to allow for a call setup of the wideband speech service without affecting the narrowband service quality considerably. After call setup the wideband speech will use 16 kbps of the 64 kbps G.711 stream. Other older speech coding standards supporting open-loop scalability are G.727 (embedded ADPCM) and to some extent G.722 (sub-band ADPCM).
A more recent advance in scalable speech coding technology is the MPEG-4 standard that provides scalability extensions for MPEG4-CELP. The MPE base layer may be enhanced by transmission of additional filter parameter information or additional innovation parameter information. The International Telecommunications Union-Standardization Sector, ITU-T has recently ended the standardization of a new scalable codec G.729.1, nicknamed s G.729.EV. The bit rate range of this scalable speech codec is from 8 kbps to 32 kbps. The major use case for this codec is to allow efficient sharing of a limited bandwidth resource in home or office gateways, e.g. shared xDSL 64/128 kbps uplink between several VOIP calls.
One recent trend in scalable speech coding is to provide higher layers with support for the coding of non-speech audio signals such as music. In such codecs the lower layers employ mere conventional speech coding, e.g. according to the analysis-by-synthesis paradigm of which CELP is a prominent example. As such coding is very suitable for speech only but not that much for non-speech audio signals such as music, the upper layers work according to a coding paradigm which is used in audio codecs. Here, typically the upper layer encoding works on the coding error of the lower-layer coding.
Another relevant method concerning speech codecs is the so-called spectral tilt compensation, which is done in the context of adaptive post filtering of decoded speech. The problem solved by this is to compensate for the spectral tilt introduced by short-term or formant post filters. Such techniques are a part of e.g. the AMR codec and the SMV codec and primarily target the performance of the codec during speech rather than its background noise performance. The SMV codec applies this tilt compensation in the weighted residual domain before synthesis filtering though not in response to an LPC analysis of the residual.
Common to any of the above-described techniques addressing the swirling problem is that it is essential to apply them such that they provide the best possible enhancement effect on the swirling without negatively affecting the quality of the speech reproduction. All these methods hence provide only benefits if there are proper rules implemented according to which they are activated or inactivated depending on the properties of the signal to be reconstructed. In the following state-of-the-art anti-swirling techniques are discussed under the particular aspect of how they are controlled.
One prior art publication [10] discloses a particular noise smoothing method and its specific control. The control is based on an estimate of the background noise ratio in the decoded signal which in turn steers certain gain factors in that specific smoothing method. It is worth highlighting that unlike other methods the activation of this smoothing method is not controlled in response of a VAD flag or e.g. some stationarity metric.
In contrast to the above described prior art, another publication [11] describes a smoothing operation in response to some stationary noise detector. No dedicated VAD is used and rather a hard decision is made depending on measurements of LPC parameters (LSF) and energy fluctuations as well as on pitch information. In order to mitigate problems with misclassifications of speech frames as stationary noise frames a hangover period is added to bursts of speech.
Another prior art disclosure [9] describes a control function of a background noise smoothing method which operates in response to a VAD flag. In order to prevent speech frames from being declared inactive a hangover period is added to signal bursts declared active speech during which the noise smoothing remains inactive. To ensure smooth transitions from periods with background noise smoothing deactivated to periods with smoothing activated, the smoothing is gradually activated up to some fixed maximum degree of smoothing operation. The power and spectral characteristics (degree of high pass filtering) of the noise signal replacing parts of the decoded speech signal is made adaptive to a background noise level estimate in the decoded speech signal. However, the degree of smoothing operation, i.e. amount by which the decoded speech signal is replaced with noise merely depends on the VAD decision and by no means on an analysis of the properties (such as stationarity or so) of the background noise.
The previously mentioned disclosure of [4] describes a parameter smoothing method for a decoder that allows for gradual (gain) parameter smoothing in response to a mix factor. The mix factor is indicative of the stationarity of the signal to be reconstructed and controls the parameter smoothing such that more smoothing is performed the larger the detected stationarity is.
The main problem with the smoothing operation control algorithm according to the above [10] is that it is specifically tailored to the particular noise smoother described therein. It is hence not obvious if (and how) it could be used in connection with any other noise smoothing method. The fact that no VAD is used causes the particular problem that the method even performs signal modifications during active speech parts, which potentially degrade the speech or at least affect the naturalness of its reproduction.
The main problem with the smoothing algorithms according to [11] and [9] is that the degree of background noise smoothing is not gradually dependent on the properties of the background noise that is to be approximated. Prior art [11] for instance makes use of a stationary noise frame detection depending on which the smoothing operation is fully enabled or disabled. Similarly, the method disclosed in [9] does not have the ability to steer the smoothing method such that it is used to a lesser degree, depending on the background noise characteristics. This means that the methods may suffer from unnatural noise reproductions for those background noise types, which are classified as stationary noise or as inactive speech, though exhibit properties that cannot adequately be modeled by the employed noise smoothing method.
The main problem of the method disclosed in [4] is that it strongly relies on a stationarity estimate that takes into account at least a current parameter of the current frame and a corresponding previous parameter. During investigations related to the present invention it was however found that stationarity even though useful does not always provide a good indication whether background noise smoothing is desirable or not. Merely relying on a stationarity measure may again lead to situations where certain noise types are classified as stationary noise even though they exhibit properties that cannot adequately be modeled by the employed noise smoothing method.
A particular problem limiting all described methods arises from the fact that they are mere decoder methods. Due to this fact, they have conceptual problems to assess background noise properties with an accuracy which would be required if the noise smoothing operation should be controlled with a gradual resolution. This however, would be necessary for natural noise reproduction.
A general problem with all methods relying on a stationarity measure is that stationarity itself is a property indicative of how much statistical signal properties like energy or spectrum remains unchanged over time. For this reason stationarity measures are often calculated by comparing the statistical properties of a given frame, or sub-frame, with the properties of a preceding frame or sub-frame. However, only to a lesser degree provide stationarity measures an indication of the actual perceptual properties of the background signal. In particular, stationarity measures are not indicative of how noise-like a signal is, which however, according to studies by the inventors is an essential parameter for a good anti-swirling method.
Therefore, there is a demand for methods and arrangements for controlling background noise smoothing operation speech sessions in telecommunication systems.