This invention relates in general to systems for transmission of speech and, more specifically, to detecting speech activity in a transmission.
The purpose of some speech activity detection algorithms, or VAD algorithms, for transmission systems is to detect periods of speech inactivity during a transmission. During these periods a substantially lower transmission rate can be utilized without quality reduction to obtain a lower overall transmission rate. A key issue in the detection of speech activity is to utilize speech features that show distinctive behavior between the speech activity and noise. A number of different features have been proposed in prior art.
Time Domain Measures
In a low background noise environment, the signal level difference between active and inactive speech is significant. One approach is therefore to use the short-term energy and tracking energy variations in the signal. If energy increases rapidly, that may correspond to the appearance of voice activity, however it may also correspond to a change in background noise. Thus, although that method is very simple to implement, it is not very reliable in relatively noisy environments, such as in a motor vehicle, for example. Various adaptation techniques and complementing the level indicator with another time-domain measures, e.g. the zero crossing rate and envelope slope, may improve the performance in higher noise environments.
Spectrum Measures
In many environments, the main noise sources occur in defined areas of the frequency spectrum. For example, in a moving car most of the noise is concentrated in the low frequency regions of the spectrum. Where such knowledge of the spectral position of noise is available, it is desirable to base the decision as to whether speech is present or absent upon measurements taken from that portion of the spectrum containing relatively little noise.
Numerous techniques are known that have been developed for spectral cues. Some techniques implement a Fourier transform of the audio signal to measure the spectral distance between it and an averaged noise signal that is updated in the absence of any voice activity. Other methods use sub-band analysis of the signal, which are close to the Fourier methods. The same applies to methods that make use of cepstrum analysis.
The time-domain measure of zero-crossing rate is a simple spectral cue that essentially measures the relation between high and low frequency contents in the spectrum. Techniques are also known to take advantage of periodic aspects of speech. All voiced sounds have determined periodicity—whereas noise is usually aperiodic. For this purpose, autocorrelation coefficients of the audio signal are generally computed in order to determine the second maximum of such coefficients, where the first maximum represents energy.
Some voice activity detection (VAD) algorithms are designed for specific speech coding applications and have access to speech coding parameters from those applications. An example is the G729 application, which employs four different measurements on the speech segment to be classified. The measured parameters are the zero-crossing rate, the full band speech energy, the low band speech energy, and 10 line spectral frequencies from a linear prediction analysis.
Problems with Conventional Solutions
Most VAD features are good at separating voiced speech from unvoiced speech. Therefore the classification scenario is to distinguish between three classes, namely, voiced speech, unvoiced speech, and inactivity. When the background noise becomes loud it can be difficult to distinguish between active unvoiced speech and inactive background noise. Virtually all VAD algorithms have problems with the situation where a single person is also talking over background noise that consists of other people talking (often referred to as babble noise) or an interfering talker.
Likelihood Ratio Detection
A classic detection problem is to determine whether a received entity belongs to one of two signal classes. Two hypotheses are then possible. Let the received entity be denoted r, then the hypotheses can be expressed:H1:rεS1H0:rεS0where S1 and S0 are the signal classes. A Bayes decision rule, also called a likelihood ratio test, is used to form a ratio between probabilities that the hypotheses are true given the received entity r. A decision is made according to a threshold τB:            L      B        ⁡          (      r      )        =                    P        ⁢                                  ⁢                  r          ⁡                      (                          r              |                              H                1                                      )                                      P        ⁢                                  ⁢                  r          ⁡                      (                          r              |                              H                0                                      )                                ⁢          {                                                  ≥                              τ                B                                                                        choose              ⁢                                                          ⁢                              H                1                                                                                        <                              τ                B                                                                        choose              ⁢                                                          ⁢                              H                0                                                        The threshold τB is determined by the a priori probabilities of the hypotheses and costs for the four classification outcomes. If we have uniform costs and equal prior probabilities then τB=1 and the detection is called a maximum likelihood detection. A common variant used for numerical convenience is to use logarithms of the probabilities. If the probability density functions for the hypotheses are known, the log likelihood ratio test becomes:       L    ⁡          (      r      )        =            log      ⁡              (                              P            ⁢                                                  ⁢                          r              ⁡                              (                                  r                  |                                      H                    1                                                  )                                                          P            ⁢                                                  ⁢                          r              ⁡                              (                                  r                  |                                      H                    0                                                  )                                                    )              =                  log        ⁡                  (                                                    f                                  H                  1                                            ⁡                              (                r                )                                                                    f                                  H                  0                                            ⁡                              (                r                )                                              )                    ⁢              {                                                            ≥                τ                                                                    choose                ⁢                                                                  ⁢                                  H                  1                                                                                                        <                τ                                                                    choose                ⁢                                                                  ⁢                                  H                  0                                                                        
Gaussian Mixture Modeling
Likelihood ratio detection is based on knowledge of parameter distributions. The density functions are mostly unknown for real world signals, but can be assumed to be of a simple, e.g. Gaussian, distribution. More complex distributions can be estimated with more general probability density function (PDF) models. In speech processing, Gaussian mixture (GM) models have been successfully employed in speech recognition and in speaker identification.
A Gaussian mixture PDF for d-dimensional random vectors, x, is a weighted sum of densities:             f      x        ⁢          (      x      )        =            ∑              k        =        1            M        ⁢                  ρ        k            ⁢                        f                                    μ              k                        ,                          Σ              k                                      ⁢                  (          x          )                    where ρk are the component weights, and the component densities to ƒμk,Σk (x) are Gaussian with mean vectors μk and covariance matrices Σk. The component weights are constrained by             ρ      k        >          0      ⁢                          ⁢      and      ⁢                          ⁢                        ∑                      k            =            1                    M                ⁢                  ρ          k                      =  1.
Adaptive Algorithms
The GM parameters are often estimated using an iterative algorithm known as an expectation-maximum (EM) algorithm. In classification applications, such as speaker recognition, fixed PDF models are often estimated by applying the EM algorithm on a large set of training data offline. The results are then used as fixed classifiers in the application. This approach can be used successfully if the application conditions (recording equipment, background noise, etc) are similar to the training conditions. In an environment where the conditions change over time, however, a better approach utilizes adaptive techniques. A common adaptive strategy in signal processing is called gradient methods where parameters are updated so that a distortion criterion is decreased. This is achieved by adding small values to the parameters in the negative direction of the first derivative of the distortion criterion with respect to the parameters.
In the appended figures, similar components and/or features may have the same reference label.