The invention relates to a method of separating/extracting a signal of at least one sound source from a complex signal comprising a mixture of a plurality of acoustic signals produced by a plurality of sound sources such as voice signal sources and various environmental noise sources, an apparatus for separating a sound source which is used in implementing the method, and a recorded medium having a program recorded therein which is used to carry out the method in a computer.
An apparatus for separating a sound source of the kind described is used in a variety of applications including a sound collector used in a television conference system, a sound collector used for transmission of a voice signal uttered in a noisy environment, or a sound collector in a system which distinguishes between the types of sound sources, for example:
A conventional technology for separating a sound source comprises estimating fundamental frequencies of various signals in the frequency domain, extracting harmonics structures, and collecting components from a signal source for synthesis.
However, the technology suffers from (1) the problem that signals which permit such a separation are limited to those having harmonic structures which resemble the harmonic structures of vowel sounds of voices or musical tones; (2) the difficulty of separating sound sources from each other in real time because the estimation of the fundamental frequencies generally requires an increased length of time for processing; and (3) the insufficient accuracy of separation which results from erroneous estimations of harmonic structures which cause frequency components from other sound sources to be mixed with the extracted signal and cause such components to be perceived as noise.
A conventional sound collector in a communication system also suffers from the howling effect that a voice reproduced by a loudspeaker on the remote end is mixed with a voice on the collector side. A howling suppression in the art includes a technique of suppressing unnecessary components from the estimation of the harmonic structures of the signal to be collected and a technique of defining a microphone array having a directivity which is directed to a sound source from which a collection is to be made.
The former technique is effective only when the signal has a high pitch response while signals to be suppressed have a flat frequency response as a consequence of utilizing the harmonic structures. Thus, the howling suppression effect is reduced in a communication system in which both the sound source from which a collection is desired and the remote end source deliver a voice. The latter technique of using the microphone array requires an increased number of microphones to achieve a satisfactory detectivity, and accordingly, it is difficult to use a compact arrangement. In addition, if the directivity is enhanced, a movement of the sound source results in an extreme degradation in the performance, with concomitant reduction in howling suppression effect.
As a technique of detecting a zone in which a sound source uttering a voice or speaking source is located in a space in which a plurality of sound sources are disposed, a technique is known in the art which uses a plurality of microphones and detects the location of the sound source from differences in the time required for an acoustic signal from the source to reach individual microphones. This technique utilizes a peak value of cross-correlation between output voice signals from the microphones to determine a difference in time required for the acoustic signal to reach each microphone, thus detecting the location of the sound source.
Unfortunately, this detection technique requires an increased length of time for calculation of cross-correlation functions which must be performed by additions and multiplications of a data length which is twice the data length read already.
The use of a histogram is effective in detecting a peak among the cross-correlations. However, a histogram formed on a time axis causes a time delay. To provide a histogram without causing a time delay, it is contemplated to divide the signal into bands, and to form a histogram over all the bands. However, it is necessary to employ a signal having a bandwidth greater than a given value to form a cross-correlation function, and accordingly, the division of the signal is limited to several bands at most. Hence, the histogram must be formed on the time axis using a signal having a certain length, but it is difficult with this technique to detect the location of the sound source in real time.
An estimation of direction of a sound source by a processing technique in which outputs from a pair of microphones are each divided into a plurality of bands is disclosed in Japanese Laid-Open Patent Application Number 87, 903/93. The disclosed technique requires a calculation of a cross-correlation between signals in corresponding divided bands, and hence suffers from an increased length of processing time.
It is an object of the invention to provide a method and an apparatus which separates/extracts an acoustic signal from a sound source that does not have a harmonic structure, and thus enables a separation of a sound source without dependence on the variety of the sound source and enables such a separation in real time, and a program recorded medium therefor.
It is another object of the invention to provide a method and an apparatus for the separation of a sound source with a high accuracy and with a reduced level of noise, and a program recorded medium therefor.
It is a further object of the invention to provide a method and an apparatus for separation of a sound source which permits the howling to be suppressed to a sufficiently low level for any signal, and a program recorded medium therefor.
It is still another object of the invention to provide a method and an apparatus for detection of a sound source zone in real time, and a program recorded medium therefor.