1. Field of the Invention
The present invention relates to a method for extracting and recovering target speech from mixed signals, which include the target speech and noise observed in a real-world environment, by utilizing sound sources' locational information.
2. Description of the Related Art
Recently the speech recognition technology has significantly improved and achieved provision of speech recognition engine with extremely high recognition capabilities for the case of ideal environments, i.e. no surrounding noise. However, it is still difficult to attain a desirable recognition rate in a household environment or offices where there are sounds of daily activities and the like. In order to take advantage of the inherent capability of the speech recognition engine in such environments, pre-processing is needed to remove noises from the mixed signals and pass only the target speech such as a speaker's speech to the engine.
From the above aspect, the Independent Component Analysis (ICA) has been known to be a useful method. By use of this method, it is possible to separate the target speech from the observed mixed signals, which consist of the target speech and noises overlapping each other, without information on the transmission paths from individual sound sources, provided that the sound sources are statistically independent.
In fact, it is possible to completely separate individual sound signals in the time domain if the target speech and the noise are mixed instantaneously, although there exist some problems such as amplitude ambiguity (i.e., output amplitude differs from its original sound source amplitude) and permutation (i.e., the target speech and the noise are switched with each other in the output). In a real-world environment, however, mixed signals are observed with time lags due to microphones' different reception capabilities, or with sound convolution due to reflection and reverberation, making it difficult to separate the target speech from the noise in the time domain.
For the above reason, when there are time lags and sound convolution, the separation of the target speech from the noise in mixed signals is performed in the frequency domain after, for example, the Fourier transform of the time-domain signals to the frequency-domain signals (spectra). However, for the case of processing superposed signals in the frequency domain, the amplitude ambiguity and the permutation occur at each frequency. Therefore, without solving these problems, meaningful signals cannot be obtained by simply separating the target speech from the noise in the mixed signals in the frequency domain and performing the inverse Fourier transform to get the signals from the frequency domain back to the time domain.
In order to address these problems, several separation methods have been invented to date. Among them, the Fast ICA is characterized by its capability of sequentially separating signals from the mixed signals in descending order of non-Gaussianity. Since speech generally has higher non-Gaussianity than noises, it is expected that the permutation problem diminishes by first separating signals corresponding to the speech and then separating signals corresponding to the noise by use of this method.
Also, the amplitude ambiguity problem has been addressed by Ikeda et al. by the introduction of the split spectrum concept (see, for example, N. Murata, S. Ikeda and A. Ziehe, “An Approach To Blind Source Separation Based On Temporal Structure Of Speech Signals”, Neurocomputing, vol. 41, Issue 1-4, pp. 1–24, 2001; S. Ikeda and N. Murata, “A Method Of ICA In Time Frequency Domain”, Proc. ICA '99, pp. 365–371, Aussions, France, January 1999).
In order to address the permutation problem, additionally proposed is a method wherein estimated separation weights of adjacent frequencies are used for the initial values of separation weights. However, this method is not effective for the real-world environment due to its approach that is not based on a priori information. Also it is difficult to identify the target speech among separated output signals in this method; thus, a posteriori judgment is needed for the identification, slowing down the recognition process.