The present invention relates to a method and a device for detecting the source of a voice comprising microphone means for receiving a voice signal and detection means for the detection of the voice in the received voice signal.
A telephone conversation is often disturbed by echo. This concerns in particular full-duplex telephones which have four different speech states: idle, near-end speech, far-end speech and double-talk. The echo occurs usually when speech is coming from the far end, when the received far end signal is reproduced in a loudspeaker and is returned to the far end through a microphone. The echo problem occurs in particular in such hands-free solutions, in which a loudspeaker reproduces the voice with high volume to the surroundings and the voice from the loudspeaker thus is easily returned to the microphone.
Adaptive signal processing is used in order to remove the echo. In a hands-free application of a mobile telephone it is possible to effectively eliminate the very disturbing acoustic feedback from the loudspeaker to the microphonexe2x80x94the acoustic echoxe2x80x94by using prior known echo cancellers and echo suppressors. An echo canceller can be realized using an adaptive digital filter which usually suppresses the echo signal from an outgoing signal, i.e. the signal which has come from the far end, when a far-end signal is present at the reception. In this way it is striven for to prevent a far-end signal from returning to the far-end. The parameters of the adaptive filter are usually updated always when far-end speech occurs in order to take into account the conditions of any situation as accurately as possible. An echo suppressor on its behalf is used to attenuate the near-end signal to be transmitted.
Such a situation in which near-end and far-end speech occur simultaneously is called a double-talk situation. During double-talk an echo canceller is not capable of effectively removing an echo signal. This is due to the fact that the echo signal is summed in the near-end signal to be transmitted, in which case the echo canceller is not capable of forming an accurate model of the echo signal to be removed. In such a case the adaptive filter of the echo canceller is not capable of adapting in a correct way to the acoustic response of the space between the loudspeaker and the microphone and accordingly is not capable of removing the acoustic echo from the signal to be transmitted, if the near-end speech signal is present. A double-talk detector is often used because of this in order to eliminate the disturbing effect of double-talk on the echo canceller. A double-talk situation is usually detected by detecting whether there is near-end speech simultaneously with far-end speech. During double-talk the parameters of the adaptive filter of the echo canceller are not updated, but the updating of the adaptive filter has to be interrupted while the near-end person speaks. Also an echo suppressor requires the information about the speech activity of the near-end speaker in order to not incorrectly attenuate (too much) the signal to be transmitted while the near-end person is speaking.
In addition to echo cancelling and -suppressing, the information about near-end speech activity is needed for the interruptable transmission used in GSM-mobile telephones. The idea of the interruptable transmission is to transmit a speech signal only during speech activity, i.e. when the near-end speaker is quiet the near-end signal is not transmitted in order to save power. In order to avoid excessive variations of background noise level due to the interruptable transmission, it is possible to transmit in the idle-state some comfort noise and still save bits needed in the transmission. In order to that the interruptable transmission of the GSM would not reduce the quality of the transmitted speech, the near-end speech activity must be detected accurately, quickly and reliably.
FIG. 1 presents prior known arrangement 1 for echo cancelling and double-talk detection. Near-end signal 3 comes from microphone 2 and it is detected using near-end speech activity detector 4, VAD (Voice Activity Detector). Far-end signal 5 comes from input connection I (which can be the input connector of a hands-free equipment, the wire connector of a fixed telephone and in mobile telephones the path from an antenna to the reception branch of the telephone) and it is detected in far-end speech activity detector 6, a VAD, and finally it is reproduced with loudspeaker 7. Both near-end signal 3 and far-end signal 5 are fed to double-talk detector 8 for the detection of double-talk and to adaptive filter 9 for adapting to the acoustic response of echo path 13. Adaptive filter 9 gets as an input also the output of double-talk detector 8, in order to not adapt (parameters are not updated) the filter during double-talk. Model 10 formed by the adaptive filter is subtracted from near-end signal 3 in summing/subtracting unit 11 in order to perform the echo cancelling. To output connection O (which can be the output connector of a hands-free equipment, the wire connector of a fixed telephone and in mobile telephones the path through transmission branch to antenna) it is brought echo canceller output signal 12, from which some (of the) echo has been cancelled. It is possible to realize the echo canceller presented in FIG. 1 integrated in a telephone (comprising for example a loudspeaker and microphone for hands-free loudspeaker call) or in a separate hands-free equipment.
Several methods for the detection of double-talk have been presented. Many of these however are very simple and partly unreliable. Most double-talk detectors are based upon the power ratios between loudspeaker signal and/or microphone signal and/or the signal after an echo canceller. The advantages of these detectors are simplicity and quickness, their disadvantage is the unreliability.
Detectors based upon the correlation between a loudspeaker signal and/or microphone signal and/or the signal after an echo canceller are also prior known. These detectors are based upon an idea, according to which a loudspeaker and a mere echo signal in a microphone (the signal after an echo canceller) are strongly correlated, but when a near-end signal is summed in the microphone signal the correlation is reduced. The disadvantage of these detectors are slowness, the (partly incorrect) assumption of the non-correlation between near-end and far-end signals, and the effects of the changes on a loudspeaker signal caused by the echo path: a reduced correlation also with absent near-end signal.
It is also prior known a double-talk detector based upon the comparison of the autocorrelation of the same signals, according to which the detector recognizes the voice in a near-end signal and thus can detect the presence of the near-end signal. Such a detector has less calculation power, but it suffers from the same problems as the detectors based upon correlation.
In publication Kuo S. M., Pan Z., xe2x80x9cAcoustic Echo cancellation Microphone System for Large-Scale Video Conferencingxe2x80x9d, Proceedings of ICSPAT""94, pp. 7-12, 1994 it has been utilized two microphones directed to opposite directions for the removing of noise and acoustic echo and for the recognizing of the different speech situations mentioned in the beginning. The method in question does however not bring any particular improvement in the recognizing of double-talk, which is performed merely according to the output power of the echo canceller.
In publication Affes S., Grenier Y., xe2x80x9cA Source subspace Tracking array of Microphones for Double-talk Situationsxe2x80x9d, Proceedings of ICSPAT""96, Vol. 2, pp. 909-912, 1996, it has been presented an echo and background noise-canceller of microphone vector structure. The presented echo canceller filters signals coming from a spatially chosen direction maintaining the signals coming from a desired direction. The echo canceller in question is capable of operating also during double-talk situations. However, the publication does not present near-end speech activity detection nor double-talk detection using a multi-microphone solution (also called a microphone vector).
Now it has been invented a method and a device for the detection of near-end speech activity and the recognizing of double-talk situations. The invention is based upon the detection of a near-end speech signal based upon the direction it comes from. In hands-free applications, in which a loudspeaker signal comes from a direction clearly different from the direction of the speech signal of a near-end speaker, the near-end speech signal can be distinguished from the loudspeaker signal based upon their angles of arrival. In the invention the detection is performed using several microphones (a microphone vector), which pick the voice from different directions and/or different points.
The outputs of the microphone vector are band-pass filtered first into narrow-band signals and a direction of arrival angle estimate is performed on the signal matrix formed by the filtered signals. The estimating restores the spatial spectrum, from which the arrival directions are tracked based upon peaks occurring in the spectrum. The arrival directions of the near-end speech signal and that of the loudspeaker signal are updated based upon the obtained arrival directions. These assumed values of the arrival directions make making a final VAD decision easier. If the arrival direction estimator detects a sufficiently strong spectrum peak in the arrival direction, which is close enough to the assumed arrival direction of the near-end speech signal, the near-end speaker is regarded to be speaking, i.e. near-end speech activity can be detected.
For a double-talk decision it is required, in addition to near-end speech activity, the information about far-end speech activity, which can be detected by using a prior known voice activity detector, for example a voice activity detector based upon power levels (see FIG. 1).
A device according to the invention is characterized in that it comprises means for determining the direction of arrival of a received signal, means for storing the assumed direction of arrival of the voice of a certain source and means for comparing the directions of arrival of said received signal and said assumed direction of arrival and means for indicating that the voice has been originated in said certain source when said comparison indicates that the direction of arrival of said received signal matches with said assumed direction of arrival within a certain tolerance.