The present invention relates to voice activity detection (VAD). Embodiments of the invention relates to low-complexity voice activity detection (VAD) devices, systems, and methods.
Voice Activity Detection (VAD) is a technique used in speech processing in which the presence or absence of the human voice is detected. Known applications can include the following:                Storage compression: combined with lossy or lossless compression, may be done off-line.        Channel bandwidth reduction: e.g., GSM, G.729, combined comfort noise generator (CNG); this task must be done in real-time, where the hangover scheme is critical.        Near-end voice detection: as a means to control acoustic echo cancellation (AEC) model training; this task must be done in real-time.        
Recently, VAD has been employed as a wake-up trigger, a front-end of more sophisticated keyword speech detection (e.g., provided by speech recognition vendors). This task must be done in real-time such that the processor undergoing further speech processing can be activated to a high power mode in time.
Most commercially available low-complexity voice activity detection (VAD) devices, systems, and methods apply an energy-based approach to detect voice activities. The energy-based VAD, given very little computing resource, faithfully detects voices in substantially quiet or rather noisy environments. However, this VAD tends to respond to whatever sudden energy changes including footsteps, keystrokes, paper friction, chair creaking, spoon clinking in a bowl or mug, etc. Frequent wake-ups due to false alarms increase undesirable power consumption that is not desirable for portable devices with limited battery life.
In order to discriminate voices from other sounds with sudden energy change, one skilled in the art often applies frequency analysis. However, Fourier transform or the like requires significant amounts of computation that is not desirable for an always-on portable device. Zero crossing rate is widely used and relatively inexpensive. It may be useful to screen out very low frequency machine noises, but not the other noises that contain high frequency contents (which may be coincident with some consonants). Another standout characteristic is pitch that can be extracted via autocorrelation method. High correlation suggests the incoming sounds may be vowels. But some non-voice sounds, e.g., tones, have high correlation, too. The high computation complexity also hinders the autocorrelation-based pitch extraction from low power application.
Therefore, it would be desirable to have a low complexity method for reducing false alarms and preventing the system from being unnecessarily activated to high power mode.