The present invention generally relates to the field of noise reduction based on speech activity recognition, in particular to an audio-visual user interface of a telecommunication device running an application that can advantageously be used e.g. for a near-speaker detection algorithm in an environment where a speaker's voice is interfered by a statistically distributed background noise including environmental noise as well as surrounding persons' voices.
Discontinuous transmission of speech signals based on speech/pause detection represents a valid solution to improve the spectral efficiency of new-generation wireless communication systems. In this context, robust voice activity detection algorithms are required, as conventional solutions according to the state of the art present a high misclassification rate in the presence of the background noise typical of mobile environments.
A voice activity detector (VAD) aims to distinguish between a speech signal and several types of acoustic background noise even with low signal-to-noise ratios (SNRs). Therefore, in a typical telephone conversation, such a VAD, together with a comfort noise generator (CNG), is used to achieve silence compression. In the field of multimedia communications, silence compression allows a speech channel to be shared with other types of information, thus guaranteeing simultaneous voice and data applications. In cellular radio systems which are based the Discontinuous Transmission (DTX) mode, such as GSM, VADs are applied to reduce co-channel interference and power consumption of the portable equipment. Furthermore, a VAD is vital to reduce the average data bit rate in future generations of digital cellular networks such as the UMTS, which provide for a variable bit-rate (VBR) speech coding. Most of the capacity gain is due to the distinction between speech activity and inactivity. The performance of a speech coding approach which is based on phonetic classification, however, strongly depends on the classifier, which must be robust to every type of background noise. As is well known, the performance of a VAD is critical for the overall speech quality, in particular with low SNRs. In case speech frames are detected as noise, intelligibility is seriously impaired owing to speech clipping in the conversation. If, on the other hand, the percentage of noise detected as speech is high, the potential advantages of silence compression are not obtained. In the presence of background noise it may be difficult to distinguish between speech and silence. Hence, for voice activity detection in wireless environments more efficient algorithms are needed.
Although the Fuzzy Voice Activity Detector (FVAD) proposed in “Improved VAD G.729 Annex B for Mobile Communications Using Soft Computing” (Contribution ITU-T, Study Group 16, Question 19/16, Washington, Sep. 2-5, 1997) by F. Beritelli, S. Casale, and A. Cavallaro performs better than other solutions presented in literature, it exhibits an activity increase, above all in the presence of non-stationary noise. The functional scheme of the FVAD is based on a traditional pattern recognition approach wherein the four differential parameters used for speech activity/inactivity classification are the full-band energy difference, the low-band energy difference, the zero-crossing difference, and the spectral distortion. The matching phase is performed by a set of fuzzy rules obtained automatically by means of a new hybrid learning tool as described in “FuGeNeSys: Fuzzy Genetic Neural System for Fuzzy Modeling” by M. Russo (to appear in IEEE Transaction on Fuzzy Systems). As is well known, a fuzzy system allows a gradual, continuous transition rather than a sharp change between two values. Thus, the Fuzzy VAD returns a continuous output signal ranging from 0 (non-activity) to 1 (activity), which does not depend on whether single input signals have exceeded a predefined threshold or not, but on an overall evaluation of the values they have assumed (“defuzzyfication process”). The final decision is made by comparing the output of the fuzzy system, which varies in a range between 0 and 1, with a fixed threshold experimentally chosen as described in “Voice Control of the Pan-European Digital Mobile Radio System” (ICC '89, pp. 1070-1074) by C. B. Southcott et al.
Just as voice activity detectors conventional automatic speech recognition (ASR) systems also experience difficulties when being operated in noisy environments since accuracy of conventional ASR algorithms largely decreases in noisy environments. When a speaker is talking in a noisy environment including both ambient noise as well as surrounding persons' interfering voices, a microphone picks up not only the speaker's voice but also these background sounds. Consequently, an audio signal which encompasses the speaker's voice superimposed by said background sounds is processed. The louder the interfering sounds, the more the acoustic comprehensibility of the speaker is reduced. To overcome this problem, noise reduction circuitries are applied that take use of the different frequency regions of environmental noise and the respective speaker's voice.
A typical noise reduction circuitry for a telephony-based application based on a speech activity estimation algorithm according to the state of the art that implements a method for correlating the discrete signal spectrum S(k·Δf) of an analog-to-digital-converted audio signal s(t) with an audio speech activity estimate is shown in FIG. 2a. Said audio speech activity estimate is obtained by an amplitude detection of the digital audio signal s(nT). The circuit outputs a noise-reduced audio signal ŝi(nT), which is calculated by subjecting the difference of the discrete signal spectrum S(k·Δf) and a sampled version {tilde over (Φ)}nn(k·Δf) of the estimated noise power density spectrum {tilde over (Φ)}nn(f) of a statistically distributed background noise ñ(t) to an Inverse Fast Fourier Transform (IFFT).