1. Field of the Invention
The present invention relates to an apparatus for processing acoustic signals, and in particular, to an apparatus capable of estimating the number of sources of sound waves propagating through a medium, the directions of the sources, and the frequency components of the sound waves arriving from the sources.
2. Related Art
Over recent years, in the field of robot audition research, methods have been proposed for estimating a number of a plurality of object source sounds in a noise environment and directions thereof (sound source specification), and separating and extracting the respective source sounds (sound source separation).
For instance, according to Asano, Futoshi, “Separating Sound”, Journal of the Society of Instrument and Control Engineers, Vol. 43, No. 4, 325-330, April 2004 described below, a method is presented in which, in a given environment with background noise, “N” number of source sounds are observed using “M” number of microphones, a spatial correlation matrix is generated from data obtained by performing fast Fourier transform (FFT) processing on the respective microphone outputs, and obtaining major eigenvalues with large values by performing eigenvalue decomposition on the matrix in order to estimate “N” number of sound sources in the form of the number of the major eigenvalues. This method is based on the characteristics that directional signals such as a source sound are mapped on major eigenvalues while nondirectional background noise is mapped on all eigenvalues. Eigenvectors corresponding to major eigenvalues become basis vectors of a signal partial space spread by signals from the sound sources, and the eigenvectors corresponding to the remaining eigenvalues become basis vectors of a noise partial space spread by background noise signals. By applying the MUSIC method using the basis vectors in the noise partial space, position vectors of the respective sound sources may be retrieved, and sound from the sound sources may be extracted by a beam former provided with directivities in the retrieved directions. However, when the number of sound sources “N” is equivalent to the number of microphones “M”, a noise partial space cannot be defined. In addition, when the number of sound sources “N” exceeds the number of microphones “M”, undetectable sound sources will exist. Accordingly, the number of sound sources which may be estimated will never equal or exceed the number of microphones “M”. While this method does not particularly impose any significant limitations regarding sound sources and is also mathematically aesthetic, the method does impose a limitation in that addressing a large number of sound sources will require a greater number of microphones.
Additionally, for instance, according to Nakadai, Kazuhiro, et al., “Real-Time Active Human Tracking by Hierarchical Integration of Audition and Vision”, The Japanese Society for Artificial Intelligence AI Challenge Study Group, SIG-Challenge-0113-5, 35-42, June 2001 described below, a method is proposed in which sound source specification and sound source separation are performed using a single pair of microphones. This method focuses on a harmonic structure (a frequency structure made up of a basic frequency and harmonics thereof) that is unique to sound produced through a tube (articulator) such as a human voice. By detecting harmonic structures with different basic frequencies from data obtained by Fourier-transforming acoustic signals captured by microphones, the method deems the number of detected harmonic structures to be the number of speakers, and estimates the directions of the speakers with belief factors using an interaural phase difference (IPD) and interaural intensity difference (IID) of each harmonic structure to estimate each source sound from the harmonic structures themselves. By detecting a plurality of harmonic structures from Fourier-transformed data, this method is capable of processing a greater number of sound sources than microphones. However, since a fundamental portion of the estimation of the number and directions of sound sources and source sounds is based on harmonic structures, the method is only capable of handling sound sources that have harmonic structures such as a human voice, and is unable to sufficiently respond to various sounds.
As described above, conventional techniques are faced with warring problems in that (1) if no limitations are imposed on sound sources, the number of sound sources may not equal or exceed the number of microphones, and (2) when arranging the number of sound sources to equal or exceed the number of microphones, limitations such as assumption of a harmonic structure must be imposed on sound sources. As a result, no methods have been established which is capable of handling a number of sound sources that exceeds the number of microphones without limiting sound sources.