This invention relates to a signal source direction or position estimation method and apparatus for estimating the direction or the position in or at which a signal source such as a sound source is present, and more particularly to a signal source direction or position estimation method and apparatus wherein the direction or the position of a signal source or each of a plurality of signal sources are estimated based on cross correlation functions between signals received by a plurality of reception apparatus.
The present invention further relates to a signal emphasis method and apparatus wherein a signal of a signal source or a plurality of signals of different signal sources are emphasized based on cross correlation functions between signals received by a plurality of reception apparatus.
A technique of estimating the direction or the position of a signal source in such environment that much noise is present or a plurality of signal sources generate signals simultaneously is utilized for a system which adapts a reception apparatus so as to receive a signal from a signal source better, another system which automatically directs a video camera to the direction of a signal source to supervise the signal source, a television conference system which automatically directs a video camera toward the direction of a speaker to transmit a video of the speaker, and like systems.
In a technical field of signal processing of the type mentioned, it is a conventionally used technique to use a plurality of reception apparatus to receive an originated signal and estimate the direction or the position of the signal source based on cross correlation functions between the received signals of the reception apparatus.
For example, Japanese Patent Laid-Open No. Hei 11-83982 discloses a sound source direction detection apparatus wherein a delay time corresponding to an assumed sound source direction angle between outputs of two microphones selected from among a plurality of microphones is provided to determine cross correlation coefficients and a value obtained by adding the cross correlation coefficients is displayed with respect to the sound source direction angle. With the above-described sound source direction detection apparatus, the incoming direction of sound can be detected accurately irrespective of the type of sound wave even when the SN (signal to noise) ratio is low.
Japanese Patent Laid-Open No. Hei 11-304906 discloses a sound source position estimation method wherein cross correlation functions of signals received by a plurality of microphones are calculated for all pairs of the microphones and time differences which provide maximum values of the cross correlation functions are set as preliminary estimated time differences, and then time differences which provide a maximum power of a delayed sum regarding all of the microphones are searched for around the preliminary estimated time differences and set as estimated time differences and the position of the sound source is calculated based on the estimated time differences. A time difference corresponds to a directional angle of a sound source. By providing delays to the individual microphones and adding them, the reception sensitivity in a particular direction can be raised. The sound source position estimation method is superior in noise-resisting property and requires a comparatively small amount of arithmetic operation.
Japanese Patent No. 2982766 discloses a method and an apparatus wherein, from a plurality of audio signals obtained by a plurality of microphones, code time series of polarities extracted from the signals themselves or the signals after whitened by reverse filtering using autoregression coefficients calculated from the signals are produced, and cross correlation functions of the code time series are calculated. Then, normalized powers are calculated from the cross correlation functions and a time average of the normalized powers is calculated, and then the sound source direction is estimated based on the time average.
The prior art apparatus and methods described above, however, cannot estimate the direction or directions or the position or positions of a signal source or sources sufficiently where significant noise is present or where a comparatively great number of signal sources generate signals at a time.
In the apparatus of Japanese Patent Laid-Open No. Hei 11-304906, time differences which provide maximum values of cross correlation functions are set as preliminary estimated time differences, and then time differences which provide a maximum power of a delayed sum regarding all of the microphones are searched for around the preliminary estimated time differences and the position of the sound source is calculated based on the estimated time differences (described above). If it is tried to apply, in a situation wherein a plurality of signal sources are present, the method to estimation of the directions or the positions of the signal sources, it is necessary to determine preliminary estimated times corresponding to the individual signal sources from cross correlation functions and then determine time differences which provide a maximum power of the delay sum in the proximity of the individual preliminary estimated times. Therefore, the amount of calculation required for the search increases in proportion to the number of the signal sources.
Meanwhile, in the method disclosed in Japanese Patent No. 2982766, in order to reduce the hardware scale of the apparatus, cross correlation functions of signals are not calculated, but the direction of a signal source is estimated based on cross correlation functions of a code time series only of polarities extracted from the signals themselves or on the signals after whitened.
In the method which involves the extraction only of polarities of the signals themselves, where noise of a low frequency having a comparatively high level is included in the received signals, the extracted code time series exhibits successive appearances of −1 or +1 over a period of approximately one half the period of the noise. Accordingly, the code time series corresponds not to the signal of a sound source but to the low frequency noise, and therefore, the direction of the sound source cannot be determined from the cross correlation functions of the code time series.
Meanwhile, where the method wherein the polarities of the signals after whitened are extracted is utilized, a unique characteristic to codes from the sound source included in the received signals is lost through the process of the whitening process. Therefore, the cross correlation functions are influenced significantly by noise, and consequently, the estimation function of the sound source direction is deteriorated. It is to be noted that a whitening method is disclosed, for example, in “Improvement of the performance of cross correlation method for identifying aircraft noise with pre-whitening of signals”, The Journal of the Acoustical Society of Japan, vol. 13, No. 4, pp. 241–252, July 1992. This method, however, is directed to noise of an aircraft measured in the proximity of an airfield.
Japanese Patent No. 2985982 discloses a sound source direction estimation method wherein outputs of two microphones are band-divided first and then powers of the signals are determined for the individual frequency bands, and peak values of the powers are held and logarithms of the peak values are calculated. Then, cross correlation functions of the time differentiation processed signals for the individual frequency bands are determined and then weighted averaged, and then the sound source direction is calculated from the time differences with which the weighted averages take maximum values.
With the method just described, even where many reflected sounds are present, the direction of a sound source can be estimated directly based on sound. According to the method, however, a great amount of calculation is required for logarithmic calculation for the frequency bands of the input signals from the microphones, and hardware of a large scale is required to perform such calculation. Further, where the power of a signal is comparatively low or where the power of noise is comparatively high, it is sometimes impossible to reduce the influence of reflected sound through the logarithmic processing. For example, if it is assumed that the power of dark noise is 1 and direct sound with the power of 2 arrives and then reflected sound whose power is 3 arrives, then the value after the logarithmic processing is 3.0 dB for the dark noise and 4.7 dB for the reflected sound. Accordingly, although the magnitude of the reflected sound before the logarithmic processing is 1.5 times that of the direct sound, the magnitude of the reflected sound after the logarithmic processing is 1.57 times, and the influence of the reflected sound is not reduced numerically.
It is to be noted here that the “reflected sound” includes both of the continuing direct sound and reflected sound coming thereto additionally. Usually, the power of the reflected sound itself does not become higher than the power of the direct sound. In a special situation where a directive microphone is used, if it is assumed that the direction in which the microphone exhibits a higher directivity coincides with or is similar to the incoming direction of the reflected sound while the direction in which the microphone exhibits a lower directivity coincides with or is similar to the incoming direction of the direct sound, then the power of the reflected sound itself may possibly be higher than the power of the direct sound.
A technique of suppressing, in such a situation that much noise is present or a plurality of signal sources generate signals at a time, the influence of a signal from another signal source and emphasizing or separating a signal from a certain signal source is utilized in order to raise the recognition performance of a speech recognition apparatus in case where the signal is an audio signal, or to raise the identification performance of a signal source identification apparatus which compares a received signal with signals measured in advance for possible kinds of signal sources to specify the signal source.
In the field of such signal emphasis and separation techniques as described above, it is a common technique to receive a signal by means of a plurality of reception apparatus, estimate delay times of the individual reception apparatus which depend upon the direction or the position of a signal source and the positions of the reception apparatus based on cross correlation functions between the received signals and so forth, use the estimated delayed times to delay the received signals, and add the delayed received signals to emphasize or separate the signal of the signal source.
For example, Japanese Patent Laid-Open No. Hei 5-95596 discloses a noise reduction apparatus wherein audio signals received by a plurality of microphones are decomposed into signals of different frequency bands by respective band-pass filters and cross correlation functions between the different frequency band signals are determined, and time differences of the audio signals are detected from the cross correlation functions and then the audio signals are delayed based on the detection time differences and then added. With the noise reduction apparatus, by combining input audio signals, an audio signal can be emphasized and extracted while suppressing noise thereby to improve the SN ratio.
Japanese Patent Laid-Open No. Hei 9-251299 discloses a microphone array inputting type speech recognition apparatus and method wherein an input signal from a microphone array including a plurality of microphones is frequency-divided by a band-pass filter bank to obtain band-pass waveforms of the individual frequency bands for the different microphone channels, and band-pass power distributions are determined individually for assumed sound source positions or directions by a minimum variance method or the like. The band-pass power distributions of the different frequency bands are unified with regard to all of the frequency bands to estimate the sound source position or direction, and pertaining band-pass powers are extracted as audio parameters from the band-pass power distributions of the individual frequency bands based on the estimated sound source position or direction to perform speech recognition.
Japanese Patent No. 2928873 discloses a signal processing apparatus wherein signals of different frequency bands of wave motions detected by a plurality of wave motion collection circuits or time level variations of the signals are determined as signal components, and cross correlation functions of the signal components of the individual frequency bands are calculated and time differences between the signal components whose correlation value exceeds a preset threshold value are determined. Then, the signal components of the frequency bands whose time difference is included within a predetermined delay time are extracted, and the wave motion components arriving from a particular position corresponding to the predetermined delay time are outputted for the individual frequency bands, or such signal components are added to output a wave motion component arriving from the particular position.
However, the noise reduction apparatus disclosed in Japanese Patent Laid-Open No. Hei 5-95596 is designed so as to be utilized principally for a car telephone, and the sound source whose signal is to be emphasized is the voice of the driver of the automobile and it is assumed that a single signal source is involved. In other words, the noise reduction apparatus is not directed to emphasis of signals from a plurality of sound sources. Further, since the noise reduction apparatus presumes a rough position of the single sound source in advance, it is not directed to a process regarding a sound source positioned at an arbitrary position. Further, since the emphasis signal is obtained by delaying a signal prior to decomposition into signals of different frequency bands as it is, if comparatively high noise is included in a certain frequency band, then the noise cannot be suppressed sufficiently.
Meanwhile, in the microphone array inputting type speech recognition apparatus and method disclosed in Japanese Patent Laid-Open No. Hei 9-251299, the band-pass power distributions of the individual assumed sound source positions or directions are determined by a minimum variance method or the like. However, the minimum variance method is one of methods which can be applied where the number of sound sources is smaller than the total number of microphones included in the microphone array, and involves a great amount of calculation. If a delay sum method is used in place of the minimum variance method, then the amount of calculation can be reduced by a certain amount, but this deteriorates the estimation accuracy of the sound source position or direction. Accordingly, where the number of signal sources and the positions or the directions of the signal sources are unknown, it is necessary to prepare a number of microphones which can be considered sufficient and apply the minimum variance method or the like while the assumed signal source position or direction is changed or set a plurality of presumed signal source positions or directions and apply the minimum variance method parallelly. Expensive or vary large-scale hardware is required to realize such processing as just described.
The signal processing apparatus disclosed in Japanese Patent No. 2928873 is designed to extract or separate a wave motion component arriving from a predetermined position, but is not directed to extraction, separation or emphasis of a signal from a direction or a position of a signal source obtained by estimation where the direction or the position of the signal source is unknown and the incoming direction of the signal is unknown.
In summary, with the prior art apparatus and methods described above, where the direction or the position of a signal source is not known in advance and much noise is present or a comparatively great number of signal sources generate signals at a time and particularly the number of signal sources is greater than the number of reception apparatus, the direction or directions or the position or positions of one signal source or a plurality of signal sources cannot be estimated with a sufficient degree of accuracy, and a signal or signals from one signal source or a plurality of signal sources cannot be emphasized sufficiently.