1. Field of the Invention
The present invention relates to acoustic signal processing, particularly to estimation of the number of sound sources propagating through a medium, a direction of the acoustic source, frequency components of acoustic waves coming from the sound sources, and the like.
2. Description of the Related Art Recently, a sound source localization and separation system is proposed in a field of robot auditory research. In the system, the number of plural target sound sources and the directions of the target sound sources are estimated under a noise environment (sound source localization), and each of the source sounds are separated and extracted (sound source separation). For example, F. Asano, “dividing sounds” Instrument and Control vol. 43, No. 4, p 325-330 (2004) discloses a method, in which N source sounds are observed by M microphones in an environment in which background noise exists, a spatial correlation matrix is generated from data in which short-time Fourier transform (FFT) process of each microphone output is performed, and a main eigenvalue having a larger value is determined by eigenvalue decomposition, thereby estimating a number N of sound sources as the main eigenvalue. In this case, characteristics in which the signal having no directional property such as the source sound having a directional property is mapped to the main eigenvalue while the background noise is mapped to all the eigenvalues are utilized.
Namely, an eigenvector corresponding to the main eigenvalue becomes a basis vector of a signal part space developed by the signal from the sound source, and the eigenvector corresponding to the remaining eigenvalue becomes the basis vector of the noise part space developed by the background noise signal. A position vector of each sound source can be searched for by utilizing the basis vector of the noise part space to apply a MUSIC method, and the sound from the sound source can be extracted by a beam former in which directivity is given to a direction obtained as a result of the search.
However, the noise part space cannot be defined when the number N of sound sources is equal to the number M of microphones, and the undetectable sound source exists when the number N of sound sources exceeds the number M of microphones. Therefore, the number of estimable sound sources is lower than the number M of microphones. In this method, there is no particularly large limitation with respect to the sound source, and it is a mathematically simple. However, in order to deal with many sound sources, there is a limitation that the number of microphones needed is higher than the number of sound sources.
A method in which the sound source localization and the sound source separation are performed using a pair of microphones is described in K. Nakadai et al., “real time active chase of person by hierarchy integration of audio-visual information” Japan Society for Artificial Intelligence AI Challenge Kenkyuukai, SIG-Challenge-0113-5, p 35-42, June 2001. In this method, by focusing attention on a harmonic structure (frequency structure including a fundamental wave and its harmonics) unique to the sound generated through a tube (articulator) like human voice, the harmonic structure having a different frequency of the fundamental wave is detected from data in which the Fourier transform of a sound signal obtained by the microphone is performed. The number of detected harmonic structures is set at the number of speakers, the direction with a certainty factor is estimated using interaural phase difference (IPD) and interaural intensity difference (IID) in each harmonic structure, and each source sound is estimated by the harmonic structure itself. In this method, the number of sound sources which is not lower than the number of microphones can be dealt with by detecting the plural harmonic structures from the Fourier transform. However, since the estimation of the number of sound sources, the direction, and the sound source is performed based on the harmonic structure, the sound source which can be dealt with is limited to the sounds such as the human voice having the harmonic structure, and the method cannot be adapted to the various sounds.
Thus, in the conventional methods, there is a problem of an antinomy that (1) the number of sound sources cannot be set at the number not lower than the number of microphones when no limitation is provided in the sound source, and (2) there is limitation such as assumption of the harmonic structure in the sound source when the number of sound sources is set at the number not lower than the number of microphones. Currently, the system of being able to deal with the number of sound sources not lower than the number of microphones without limiting the sound source is not established yet.