A voice activity detector (VAD) is a device that analyzes an input electrical signal representing audio information to determine whether or not speech is present. Usually, a VAD delivers an output signal that takes one of two possible values, respectively indicating that speech is detected to be present or speech is detected not to be present. In general, the value of the output signal will change with time according to whether or not speech is detected to be present in each frame of the analyzed signal.
A VAD is often incorporated in a speech communication device such as a fixed or mobile telephone, a radio communication unit or a like device. Use of a VAD is an important enabling technology for a variety of speech based applications such as speech recognition, speech encoding, speech compression and hands free telephony. The primary function of a VAD is to provide an ongoing indication of speech presence as well to identify the beginning and end of each segment of speech, e.g. separately uttered words or syllables. Devices such as automatic gain controllers employ a VAD to detect when they should operate in a speech present mode.
While VADs operate quite effectively in a relatively quiet environment, e.g. a conference room, they tend to be less accurate in noisy environments such as in road vehicles and, in consequence, they may generate detection errors. These detection errors include ‘false alarms’ which produce a signal indicating speech when none is present and ‘mis-detects’ which do not produce a signal to indicate speech when speech is present in noise.
There are many known algorithms employed in VADs to detect speech. Each of the known algorithms has advantages and disadvantages. In consequence, some VADs may tend to produce false alarms and others may tend to produce mis-detects. Some VADs may tend to produce both false alarms and mis-detects in noisy environments.
Many of the known VAD algorithms have an operational relationship to a particular speech codec and are adapted to operate in combination with the particular speech codec. This leads to difficulty and expense needed to modify the VAD when the speech codec has to be modified or upgraded.
A common feature of many VADs is that they utilize an adaptive noise threshold based on an estimation of absolute signal level. The absolute signal level can vary rapidly. As a result, a significant problem occurs when there is a transition in the form of a relatively steep increase in noise level. The noise threshold tracking may fail even if speech is absent. In this case, the VAD may interpret the steep increase in noise level as an onset of speech. One known way to alleviate the effect of such a transition is to measure the short-term power stationarity (extent of being stationary) of the input signal over a long enough test interval. This approach requires a period of time to detect the noise transition from one level to another plus the time interval required to apply the stationarity test, typically a total delay period of from about one to about three seconds.
In addition, the power stationarity test known in the art does not address the problem of noise level increases which occur during and between closely spaced speech utterances unless there are relatively long gaps between the utterances (longer than the test interval) and the noise level is stationary within those gaps.
In another known method which is a development of the power stationarity test, the lower envelope or minimum of the signal energy is tracked so that an adaptive noise threshold can be properly updated to a new level at the end of a speech utterance. However, in practice this method is likely to require a longer delay than the conventional power stationarity test. The reason is that the rate of increase (slope) of the lower envelope of the signal energy has to be transformed to match, on average, the expected increase of a speech signal.
Some known VADs may mistakenly classify strong radio noise in an initial period of typically 1.5 to 2 seconds as speech, or speech and noise intermittently, by producing a VAD decision every frame, e.g. typically every 10 milliseconds (msec), within the initial period. Where the VAD is coupled to control a radio transmitter of a first terminal, the erroneous speech detection by the VAD can trigger an erroneous radio transmission by the first terminal. Where the radio signal transmitted erroneously by the first terminal is received by a second terminal which is also coupled to a VAD, a similar effect can occur at the second terminal causing a further erroneous radio signal to be sent back to the first terminal. An infinite loop of erroneous commands and radio transmissions can be created in this way. The radio transmissions contain only noise which users of the first and second terminals may find to be very unsatisfactory. Only after the initial period of typically 1.5 to 2 seconds has elapsed, does the VAD coupled to the first terminal become stabilized to provide a correct decision of noise, thereby allowing the loop of erroneous commands and transmissions to be cut. The initial period required for stabilization in known VADs when strong noise is detected is considered to be too long.
Thus, there exists a need for a VAD and method of operation which addresses at least some of the shortcomings of known VADs and methods.
Skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the drawings may be exaggerated relative to other elements to help to improve understanding of various embodiments. In addition, the description and drawings do not necessarily require the order illustrated. Apparatus and method components have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the various embodiments so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein. Thus, it will be appreciated that for simplicity and clarity of illustration, common and well-understood elements that are useful or necessary in a commercially feasible embodiment may not be depicted in order to facilitate a less obstructed view of these various embodiments.