The invention relates generally to audio signal compression and, more particularly, to speech/noise classification during audio compression.
Speech coders and decoders are conventionally provided in radio transmitters and radio receivers, respectively, and are cooperable to permit speech (voice) communications between a given transmitter and receiver over a radio link. The combination of a speech coder and a speech decoder is often referred to as a speech codec. A mobile radiotelephone (e.g., a cellular telephone) is an example of a conventional communication device that typically includes a radio transmitter having a speech coder, and a radio receiver having a speech decoder.
In conventional block-based speech coders the incoming speech signal is divided into blocks called frames. For common 4 kHz telephony bandwidth applications a typical framelength is 20 ms or 160 samples. The frames are further divided into subframes, typically of length 5 ms or 40 samples.
In compressing the incoming audio signal, speech encoders conventionally use advanced lossy compression techniques. The compressed (or coded) signal information is transmitted to the decoder via a communication channel such as a radio link. The decoder then attempts to reproduce the input audio signal from the compressed signal information. If certain characteristics of the incoming audio signal are known, then the bit rate in the communication channel can be maintained as low as possible. If the audio signal contains relevant information for the listener, then this information should be retained. However, if the audio signal contains only irrelevant information (for example background noise), then bandwidth can be saved by only transmitting a limited amount of information about the signal. For many signals which contain only irrelevant information, a very low bit rate can often provide high quality compression. In extreme cases, the incoming signal may be synthesized in the decoder without any information updates via the communication channel until the input audio signal is again determined to include relevant information.
Typical signals which can be conventionally reproduced quite accurately with very low bit rates include stationary noise, car noise and also, to some extent, babble noise. More complex non-speech signals like music, or speech and music combined, require higher bit rates to be reproduced accurately by the decoder.
For many common types of background noise a much lower bit rate than is needed for speech provides a good enough model of the signal. Existing mobile systems make use of this fact by downwardly adjusting the transmitted bit rate during background noise. For example, in conventional systems using continuous transmission techniques, a variable rate (VR) speech coder may use its lowest bit rate.
In conventional Discontinuous Transmission (DTX) schemes, the transmitter stops sending coded speech frames when the speaker is inactive. At regular or irregular intervals (for example, every 100 to 500 ms), the transmitter sends speech parameters suitable for conventional generation of comfort noise in the decoder. These parameters for comfort noise generation (CNG) are conventionally coded into what are sometimes called Silence Descriptor (SID) frames. At the receiver, the decoder uses the comfort noise parameters received in the SID frames to synthesize artificial noise by means of a conventional comfort noise injection (CNI) algorithm.
When comfort noise is generated in the decoder in a conventional DTX system, the noise is often perceived as being very static and much different from the background noise generated in active (non-DTX) mode. The reason for this perception is that DTX SID frames are not sent to the receiver as often as normal speech frames. In conventional linear prediction analysis-by-synthesis (LPAS) codecs having a DTX mode, the spectrum and energy of the background noise are typically estimated over several frames (for example, averaged), and the estimated parameters are then quantized and transmitted in SID frames over the channel to the decoder.
The benefit of sending the SID frames with their relatively low update rate instead of sending regular speech frames is twofold. The battery life in, for example, a mobile radio transceiver, is extended due to lower power consumption, and the interference created by the transmitter is lowered, thereby providing higher system capacity.
If a complex signal like music is compressed using a compression model that is too simple, and a corresponding bit rate that is too low, the reproduced signal at the decoder will differ dramatically from the result that would be obtained using a better (higher quality) compression technique. The use of a too simple compression scheme can be caused by misclassifying the complex signal as noise. When such misclassification occurs, not only does the decoder output a poorly reproduced signal, but the misclassification itself disadvantageously results in a switch from a higher quality compression scheme to a lower quality compression scheme. To correct the misclassification, another switch back to the higher quality scheme is needed. If such switching between compression schemes occurs frequently, it is typically very audible and can be irritating to the listener.
It can be seen from the foregoing that it is desirable to reduce the misclassification of subjectively relevant signals, while still maintaining a low bit rate (high compression) where appropriate, for example when compressing background noise while the speaker is silent. Very strong compression techniques can be used, provided they are not perceived as irritating. The use of comfort noise parameters as described above with respect to DTX systems is an example of a strong compression technique, as is conventional low rate linear predictive coding (LPC) using random excitation methods. Coding techniques such as these, which utilize strong compression, can typically reproduce accurately only perceptually simple noise types such as stationary car noise, street noise, restaurant noise (babble) and other similar signals.
Conventional classification techniques for determining whether or not an input audio signal contains relevant information are primarily based on a relatively simple stationarity analysis of the input audio signal. If the input signal is determined to be stationary, then it is assumed to be a noise-like signal. However, this conventional stationarity analysis alone can cause complex signals that are fairly stationary but actually contain perceptually relevant information to be misclassified as noise. Such a misclassification disadvantageously results in the problems described above.
It is therefore desirable to provide a classification technique that reliably detects the presence of perceptually relevant information in complex signals of the type described above.
According to the present invention, complex signal activity detection is provided for reliably detecting complex non-speech signals that include relevant information that is perceptually important to the listener. Examples of complex non-speech signals that can be reliably detected include music, music on-hold, speech and music combined, music in the background, and other tonal or harmonic sounds.