This invention relates generally to the field of voice activated systems, and more particularly, to an improved voice activity detection scheme for speakerphones, and the like.
As is generally known in the art, speakerphones are used to provide xe2x80x9chands-freexe2x80x9d operation of telephone sets. A typical speakerphone includes a microphone coupled to an input of a transmit path and a speaker coupled to an output of a receive path. In half-duplex speakerphones, only one of the paths is active while the other is disabled or suppressed. Accordingly, most half-duplex speakerphones have three modes of operation, including a silence mode, a transmit mode, and a receive mode.
Prior art speakerphones generally rely on algorithms implemented in a voice activity detector (VAD) to determine the presence of speech and the direction of speech so that appropriate switching decisions can be made with respect to the various modes of operation. To facilitate the detection of speech, voice energy measurements of speech signals are provided as input to the algorithms. As is known to those skilled in the art, voice energy measurements typically involve summation and multiplication operations using the root means square (rms) voltage levels of the speech signals. Consequently, prior art techniques for sampling and converting amplitudes of voltage levels of the speech signals into voice energy measurements can be processor intensive. As such, prior art systems suffer the disadvantage of using valuable processing time for the measurement of voice energy in a system.
The direction of speech is also identified in a speakerphone so that appropriate switching decisions can be made between the various modes of operation. Speakerphones typically use algorithms to identify the direction having the highest voice energy level. The direction having the highest voice energy level would then be given a clear talking path while the signals in the opposite direction would be suppressed, such as by attenuating the signal by inserting loss into the path. If voice energy is not detected in either path, most speakerphones enter the silence mode. In the course of identifying and adapting to the direction of speech, speakerphones typically experience a clipping problem that normally occurs as a result of switching delays associated with transitions between the various modes of operation.
Acoustical feedback is also a known problem in speakerphone applications as a result of, among other factors, the proximity of the speaker to the microphone. For a typical speakerphone in the prior art, acoustic coupling occurs when signals from the receive path are coupled from the speaker to the transmit path via the microphone. The adverse effects of acoustical feedback can be manifested in the form of xe2x80x9csingingxe2x80x9d or xe2x80x9cringing.xe2x80x9d Signals from the transmit path can also be undesirably coupled to the receive path as a result of sidetones that occur at the hybrid interface of the speakerphone and a telephone line. As a result of acoustic coupling and sidetones, the effects of acoustical feedback may be present during any of the various modes of operation. For example, in the silence mode, background noise can be acoustically coupled from the receive path to the transmit path. To counter the effects of acoustic coupling, signals in both the transmit and receive paths are usually suppressed at some level during the silence mode. In the transmit and receive modes, a typical half-duplex speakerphone completely suppresses the inactive path (e.g., the transmit path in the receive mode) to guard against the acoustic coupling effects. However, in all of the operating modes, the suppression method itself contributes to the clipping problem during the transitions between the various modes. More specifically, the delays associated with the application and removal of suppression of the signals from the transmit and receive paths result in clipping the initial portion of the speech signal during the transition between modes. Consequently, prior art speakerphones suffer the disadvantage of not being able to rapidly switch between modes without clipping some portion of the signal.
Another problem in typical speakerphones is directional loss that occurs when the speaking party is not speaking directly into the microphone. Directional loss affects intelligibility of speech signals and, as a result, causes problems in detecting speech. Directional loss is more problematic at the higher frequencies, because higher frequency speech sounds (e.g., consonants) are more directional and thus more susceptible to directional loss.
Accordingly, there is a need for an improved speakerphone that efficiently utilizes processing resources to overcome the shortcomings of the prior art speakerphones.
These and other aspects of the invention may be obtained generally in a voice activity detection scheme that can be used in a half-duplex speakerphone which operates in a transmit mode, a receive mode, and a silence mode. To facilitate smooth and efficient switching between the various modes, the voice activity detection scheme utilizes a novel voice energy term which is derived from an integral of the absolute value of a derivative of a speech signal.
In one embodiment of the present invention, voice activity is detected in a transmit path and a receive path of the speakerphone during the silence mode by comparing a first ratio of a current voice energy value to a background noise value with a voice activity threshold value. Upon detecting voice activity above the voice activity threshold, the speakerphone transitions from the silence mode to either a transmit or receive mode, depending on the location of the voice activity. A change in direction of the speech signal is identified during the transmit mode or receive mode by comparing a second ratio of a transmit path voice energy value to a receive path voice energy value with a transmit threshold value and a receive threshold value. The voice activity detector initiates the appropriate transition (i.e., change in direction) between the transmit and receive modes according to values of the second ratio with respect to the transmit and receive threshold values.
Following the detection of voice activity in one of the directions, the speakerphone begins transitioning to the applicable mode by gradually suppressing the signal in the other direction. The first ratio for detecting the start of voice activity is used to enable the initial transition from the silence mode to the receive mode or from the silence mode to the transmit mode according to the source of the signal. The second ratio is used to either maintain the speakerphone in its current mode (i.e., the transmit or receive mode) or to begin transitioning towards the other direction (i.e., from the transmit to the receive mode or from the receive to the transmit mode). The transition between the transmit and receive mode is accomplished by gradually suppressing one of the paths (i.e., the transmit or receive path) while gradually removing the suppression from the other path. The gradual application and removal of suppression, when used in conjunction with other aspects of the invention, minimizes the amount of clipping that occurs during the transitions between the various modes of operation.
In another exemplary embodiment, further processor and power savings are achieved by using the steps for detecting the start of voice activity and for detecting the change in direction of the speech signals to control gain insertion and vocoder operations in a half-duplex speakerphone.
Advantageously, valuable processing time is conserved in the voice detection scheme of the present invention because of the efficiencies achieved with the measurement of voice energy using the new voice energy term. Moreover, the voice activity detection scheme of the present invention mitigates the adverse effects of clipping and acoustical coupling by the gradual application and removal of suppression according to ratios of voice energy values.