The present invention relates, in general, to data processing and, in particular, to speech signal processing for identifying voice activity.
A voice activity detector is useful for discriminating between speech and non-speech (e.g., fax, modem, music, static, dial tones). Such discrimination is useful for detecting speech in a noisy environment, compressing a signal by discarding non-speech, controlling communication devices that only allow one person at a time to speak (i.e., half-duplex mode), and so on.
A voice activity detector may be optimized for accuracy, speed, or some compromise between the two. Accuracy often means maximizing the rate at which speech is identified as speech and minimizing the rate at which non-speech is identified as speech. Speed is how much time it takes a voice activity detector to determine if a signal is speech or non-speech. Accuracy and speed work against each other. The most accurate voice activity detectors are often the slowest because they analyze a large number of features of the signal using computationally complex methods. The fastest voice activity detectors are often the least accurate because they analyze a small number of features of the signal using computationally simple methods. The primary goal of the present invention is accuracy.
Many prior art voice activity detectors only do a good job of distinguishing speech from one type of non-speech using one type of discriminator and do not do as well if a different type of non-speech is present. For example, the variance of the delta spectrum magnitude is an excellent discriminator of speech vs. music but it not a very good discriminator of speech vs. modem signals or speech vs. tones. Blind combination of specific discriminators does not lead to a general solution of speech vs. non-speech. A dimension reduction technique such as principal components reduction may be used when a large number of discriminators are analyzed in an attempt to compress the data according to signal variance. Unfortunately, maximizing variance may not provide good discrimination.
Over the past few years, several voice activity detectors have been in use. The first of these is a simple energy detection method, which detects increases in signal energy in voice grade channels. When the energy exceeds a threshold, a signal is declared to be present. By requiring that the variance of the energy distribution also exceed a threshold, the method may be used to distinguish speech from several types of non-speech.
FIG. 1 is an illustration of a voice activity detection method called the readability method 1. It is a variation of the energy method. A signal is filtered 2 by a pre-whitening filter. An autocorrelation 3 is performed on the pre-whitened signal. The peak in the autocorrelated signal is then detected 4. The peak is then determined to be within the expected pitch range 5 (i.e., speech) or not 6 (i.e., non-speech). Speech is declared to be present if a bulge occurs in the correlation function within the expected periodicity range for the pitch excitation function of speech. The readability method is similar to the energy method since detection is based on energy exceeding a threshold. The readability method 1 performs better that the energy method because the readability method 1 exploits the periodicity of speech. However, the readability method does not perform well if there are changes in the gain, or dynamic range, of the signal. Also, the readability method identifies non-speech as speech when non-speech exhibits periodicity in the expected pitch range (i.e., 75 to 400 Hz.). The pre-whitening filter removes un-modulated tones (i.e., non-speech) to prevent such tones from being identified as speech. However, such a filter does not remove other non-speech signals (e.g., modulated tones and FM signals) which may be present in a channel carrying speech. Such non-speech signals and may be falsely identified as speech.
FIG. 2 is an illustration of the NP method 20 which detects voice activity by estimating the signal to noise ratio (SNR) for each frame of the signal. A Fast Fourier Transform (FFT) is performed on the signal and the absolute value of the result is squared 21. The result of the last step is then filtered to remove un-modulated tones using a pre-whitening filter 22. The variance in the result of the last step is then determined 23. The result of the last step is then limited to a band of frequencies in which speech may occur 24. The power spectrum of each frame is computed and sorted 25 into either high energy components or low energy components. High energy components are assumed to be signal (speech which may include non-speech) or interference (non-speech) while low energy components are assumed to be noise (all non-speech). The highest energy components are discarded. The signal power is then estimated from the remaining high energy components 26. The noise power is estimated by averaging the low-energy components 27. The signal power is then divided by the noise power 28 to produce the SNR. The SNR is then compared to a user-definable threshold to determine whether or not the frame of the signal is speech or non-speech. Signal detection in the NP method is based on a power ratio measurement and is, therefore, not sensitive to the gain of the receiver. The fundamental assumption in the NP method is that spectral components of speech are sparse.
FIG. 3 illustrates a voice activity detector method named TALKATIVE 30 which detects speech by estimating the correlation properties of cepstral vectors. The assumption is that non-stationarity (a good discriminator of speech) is reflected in cepstral coefficients. Vectors of cepstral coefficients are computed in a frame of the signal 31. Squared Euclidean distances between cepstral vectors are computed 32. The squared Euclidean distances are time averaged 33 within the frame in order to estimate the stationarity of the signal. A large time averaged value indicates speech while a small time averaged value indicates a stationary signal (i.e., non-speech). The time averaged value is compared to a user-definable threshold 34 to determine whether or not the signal is speech or non-speech. The TALKATIVE method performs well for most signals, but does not perform well for music or impulsive signals. Also, considerable temporal smoothing occurs in the TALKATIVE method.
U.S. Pat. No. 4,351,983, entitled xe2x80x9cSPEECH DETECTOR WITH VARIABLE THRESHOLD,xe2x80x9d discloses a device for and method of detecting speech by adjusting the threshold for determining speech on a frame by frame basis. U.S. Pat. No. 4,351,983 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 4,672,669, entitled xe2x80x9cVOICE ACTIVITY DETECTION PROCESS AND MEANS FOR IMPLEMENTING SAID PROCESS,xe2x80x9d discloses a device for and method of detecting voice activity by comparing the energy of a signal to a threshold. The signal is determined to be voice if its power is above the threshold. If its power is below the threshold then the rate of change of the spectral parameters is tested. U.S. Pat. No. 4,672,669 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,255,340, entitled xe2x80x9cMETHOD FOR DETECTING VOICE PRESENCE ON A COMMUNICATION LINE,xe2x80x9d discloses a method of detecting voice activity by determining the stationary or non-stationary state of a block of the signal and comparing the result to the results of the last M blocks. U.S. Pat. No. 5,255,340 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,276,765, entitled xe2x80x9cVOICE ACTIVITY DETECTION,xe2x80x9d discloses a device for and a method of detecting voice activity by performing an autocorrelation on weighted and combined coefficients of the input signal to provide a measure that depends on the power of the signal. The measure is then compared against a variable threshold to determine voice activity. U.S. Pat. No. 5,276,765 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. Nos. 5,459,814 and 5,649,055, both entitled xe2x80x9cVOICE ACTIVITY DETECTOR FOR SPEECH SIGNALS IN VARIABLE BACKGROUND NOISE,xe2x80x9d discloses a device for and method of detecting voice activity by measuring short term time domain characteristics of the input signal, including the average signal level and the absolute value of any change in average signal level. U.S. Pat. Nos. 5,459,814 and 5,649,055 are hereby incorporated by reference into the specification of the present invention.
U.S. Pat. Nos. 5,533,118 and 5,619,565, both entitled xe2x80x9cVOICE ACTIVITY DETECTION METHOD AND APPARATUS USING THE SAME,xe2x80x9d discloses a device for and method of detecting voice activity by dividing the square of the maximum value of the received signal by its energy and comparing this ratio to three different thresholds. U.S. Pat. Nos. 5,533,118 and 5,619,565 are hereby incorporated by reference into the specification of the present invention.
U.S. Pat. Nos. 5,598,466 and 5,737,407, both entitled xe2x80x9cVOICE ACTIVITY DETECTOR FOR HALF-DUPLEX AUDIO COMMUNICATION SYSTEM,xe2x80x9d discloses a device for and method of detecting voice activity by determining an average peak value, a standard deviation, updating a power density function, and detecting voice activity if the average peak value exceeds the power density function. U.S. Pat. Nos. 5,598,466 and 5,737,407 are hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,619,566, entitled xe2x80x9cVOICE ACTIVITY DETECTOR FOR AN ECHO SUPPRESSOR AND AN ECHO SUPPRESSOR,xe2x80x9d discloses a device for detecting voice activity that includes a whitening filter, a means for measuring energy, and using the energy level to determine the presence of voice activity. U.S. Pat. No. 5,619,566 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,732,141, entitled xe2x80x9cDETECTING VOICE ACTIVITY,xe2x80x9d discloses a device for and method of detecting voice activity by computing the autocorrelation coefficients of a signal, identifying a first autocorrelation vector, identifying a second autocorrelation vector, subtracting the first autocorrelation vector from the second autocorrelation vector, and computing a norm of the differentiation vector which indicates whether or not voice activity is present. U.S. Pat. No. 5,732,141 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,749,067, entitled xe2x80x9cVOICE ACTIVITY DETECTOR,xe2x80x9d discloses a device for and method of detecting voice activity by comparing the spectrum of the a signal to a noise estimate, updating the noise estimate, computing a linear predictive coding prediction gain, and suppressing updating the noise estimate if the gain exceeds a threshold. U.S. Pat. No. 5,749,067 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,867,574, entitled xe2x80x9cVOICE ACTIVITY DETECTION SYSTEM AND METHOD,xe2x80x9d discloses a device for and method of detecting voice activity by computing an energy term based on an integral of the absolute value of a derivative of a speech signal, computing a ration of the energy to a noise level, and comparing the ratio to a voice activity threshold. U.S. Pat. No. 5,867,574 is hereby incorporated by reference into the specification of the present invention.
It is an object of the present invention to detect voice activity in a signal.
It is another object of the present invention to detect voice activity in a signal by squaring the absolute value of a signal, finding the low frequency components of the signal known as an AM envelope, subtracting the mean of the AM envelope from the AM envelope, padding the result with zeros if the result is not a power of two, transform the result using a Discreet Fast Fourier Transform, normalizing the result, computing a feature vector, and determining the presence of voice activity using Quadratic Discriminant Analysis.
It is another object of the present invention to remove music signals by observing threshold crossings of the AM envelope of the signal.
The present invention is a device for and method of detecting voice activity. A segment of a signal is received at an absolute value squarer, which computes the absolute value of the segment and then squares it.
The absolute value squarer is connected to a low pass filter, which blocks high frequency components of the output of the absolute value squarer and passes low frequency components of the output of the absolute value squarer.
The low pass filter is connected to a mean subtractor, which receives the AM envelope of the segment, computes the mean of the AM envelop and subtracts the mean of the AM envelope from the AM envelope.
The mean subtractor is connected to a zero padder, which pads the result of the mean subtractor with zeros to form a value that is a power of two.
The zero padder is connected to a Digital Fast Fourier Transformer (DFFT), which performs a Digital Fast Fourier Transform on the output of the zero padder.
The DFFT is connected to a normalizer, which computes a normalized magnitude vector of the DFFT of the AM envelope, computes the mean of the normalized magnitude vector, computes the variance of the normalized magnitude vector, and computes the power ratio of the normalized magnitude vector.
The normalizer is connected to a classifier, which receives the mean, variance, and power ratio of the normalizer magnitude vector and compares these features to models of similar features precomputed for known speech and known non-speech to determine whether the unknown segment received is speech or non-speech.
Alternate embodiments of the present invention may be realized by adding a threshold-crossing detector between the low pass filter and the mean subtractor to identify music as non-speech.