Speech signals are not continuous. Typically, in between words and sentences, there are silence periods which contain background noise only. Algorithms to identify these silence periods are called as voice-activity detection (VAD) algorithms and find important usage in speech application algorithms. VADs are generally used in speech recognition systems, voice over Internet protocol (VoIP) systems, speech coders, noise suppression and/or enhancement systems, or any other suitable speech applications or algorithms.
VAD is becoming increasingly important and relevant in modern telecommunication and speech enhancement systems. Conventional voice-based communication typically use public switched telephone network (PSTN). Such systems are expensive when the distance between the calling and called subscriber is large because of dedicated connection.
Data networks, on the other hand, currently work on the best effort delivery techniques and resource sharing algorithms through statistical multiplexing. Therefore, the cost of such data services is considerably less relative to PSTN based services. Data networks, however, do not guarantee faithful voice transmission.
VoIP systems have to ensure that voice quality does not significantly deteriorate due to network conditions such as packet-loss and delays. Therefore, providing toll grade voice quality through VoIP is a challenge given that designers often prefer to lower the average bit-rate of speech communication systems. The VAD is used to selectively encode and transmit data. Apart from data savings, VAD also results in power savings in mobile devices and decreased co-channel interference in mobile telephony.
VAD is also used in non real-time systems such as voice recognition systems. VAD is generally critical for performance level demands associated with noise suppression systems. In addition, because VAD based systems need only operate when speech is present, the complexity of noise suppression systems is generally reduced.
Some conventional approaches include relatively robust applications of VAD for discontinuous transmission (DTX) operation of speech coders such as, for example, IS-641, GSM-FR and GSM-EFR based systems. In addition, DTX operation can be essential for longer battery life.
Conventional VAD algorithms are typically based on heuristics or fuzzy rules and, in some cases, general speech properties. Such design methodologies makes it difficult to optimize relevant parameters and obtain consistent results. Conventional attempts have been made to develop a statistical model based VAD using, for example, a likelihood-ratio test (LRT). Other conventional algorithms suggest using a smoothed LRT or algorithms based on Kullback-Leibler distance. Still other conventional models use statistical methods that compare second order statistics of the signals to models.
Most conventional VAD detection is performed on a block by block basis. Generally, the block size is chosen such that speech is considered stationary. Speech is generally stationary for about 10 ms-20 ms. As an example, for a sampling rate of 8 KHz, the block size would be 160 (20 ms). Noise is considered to be stationary over a longer period, typically 1 s-2 s. For a given block, a statistic (Λ) is typically derived. Based on the statistic (Λ), conventional algorithms could assess whether speech is present.
Consider two hypotheses H1 and H0. H1 is when speech present, while H0 represents when speech absent. The relative relationship between H1 and H0 is shown by Equations 1a and 1b below.H1:xk(n)=sk(n)+nk(n) n=0 . . . N−1  (Eqn. 1a)H0:xk(n)=nk(n) n=0 . . . N−1  (Eqn. 1b)
In Equations 1a and 1b, xk(n) is the observed signal in block k at time instant n. Also, in Equations 1a and 1b, N is the observation length, sk(n) is the speech and nk(n) is the background noise.
The background noise, nk(n), is generally a colored noise process. Deciding the hypothesis H1 or H0 is a generally a problem in detection theory. The detection criterion shown by Equations 2a and 2b below are typically used.H1:Λ>T  (Eqn. 2a)H0:Λ<T  (Eqn. 2b)
In Equations 2a and 2b, T is generally a threshold.
FIG. 1 generally illustrates the relationship between clean speech 100a, noisy speech 100b and the VAD output. In FIG. 1, the VAD outputs a ‘1’ (H1) when speech is present (e.g., points 102 and 104) and a ‘0’ (H0) when speech is absent (e.g., point 106).
The probability of detection (PD) is generally the probability of detecting speech (H1), given that speech is present (i.e., condition H1 is true). The probability of a false alarm (PF) is generally the probability of detecting speech (H1) when speech is absent (i.e., condition H0 is true).
Accordingly, PD and PF depend upon noise as well as speech statistics. However, in some cases only noise statistics are considered. In such cases, the system is typically designed for a given false alarm PF and hence there is no control over PD.
Other conventional methods are based on the principle that the expected value of periodogram is equal to the power spectral density (psd). The periodogram is typically the square of the absolute value of Fourier fast transform (FFT). The psd depends on the statistics of the randomness of the signal. If the periodogram of many blocks of the signal are averaged, periodogram tends to be equal to the psd.
The decision statistic is typically given by the relationship seen in Equation 3 below.
                              Λ          k                =                              ∑            l                    ⁢                                    ψ              k                        ⁡                          (                              f                l                            )                                                          (                  Eqn          .                                          ⁢          3                )            
In Equation 3, the term ψk(f1) is the decision statistic for frequency bin f1 and block k and is defined by the relationship shown by Equation 4 below.
                                          ψ            k                    ⁡                      (                          f              l                        )                          =                                                            pgm                k                            ⁡                              (                                  f                  l                                )                                                    psd              ⁡                              (                                  f                  l                                )                                              -          1                                    (                  Eqn          .                                          ⁢          4                )            
In Equation 4, pgmk(f1) is the periodogram of the f1 frequency bin obtained on the kth block of observed samples. Also in Equation 4, psd(f1) is the psd estimate of the f1 frequency bin of the background noise. The term psd(f1) is obtained over the silence periods present in the training period at the beginning of the phone call (when, invariably, only noise is present). Accordingly, the relationships shown in Equations 5 and 6 below can be made, where k (and the summation) corresponds to noise blocks.
                                          ∑            k                    ⁢                                    ψ              k                        ⁡                          (                              f                l                            )                                      ≈        0                            (                  Eqn          .                                          ⁢          5                )                                                      ∑            k                    ⁢                      Λ            k                          ≈        0                            (                  Eqn          .                                          ⁢          6                )            
The decision statistic is 0 if averaged over many blocks containing only noise (Hypothesis H0). Over each noise block, it is assumed to take low values. In the presence of speech, the decision statistic has a variable value and generally greater than those obtained when speech is absent (noise blocks). There is, however, an overlap of these values. The statistic is based on background noise only and no speech information is used. Hence, the design or threshold can only be chosen for a given false alarm.
There is therefore a need for improved voice activity detection (VAD) in noise suppression systems.