Despite many years of intensive efforts by a large research, community, automatic separation of competing or simultaneous speakers is still an unsolved, outstanding problem. Such competing or simultaneous speech commonly occurs in telephony or broadcast situations where either two speakers, or a speaker and some other sound (such as ambient noise) are each simultaneously received by the same channel. To date, efforts that exploit speech-specific information to reduce the effects of multiple speaker interference have been largely unsuccessful. For example, the assumptions of past blind signal separation approaches often are not applicable in normal speaking and telephony environments.
The extreme difficulty that automated systems face in dealing with competing sound sources stands in stark contrast to the remarkable ease with which humans and most animals perceive and parse complex, overlapping auditory events in their surrounding world of sounds. This facility, known as auditory scene analysis, has recently been the focus of intensive research and mathematical modeling, which has yielded fascinating insights into the properties of the acoustic features and cues that humans automatically utilize to distinguish between simultaneous speakers.
A related yet more general problem occurs when the competing sound source is not speech, but is instead arbitrary yet distinct from the desired sound source. For example, when on location recording for a movie or news program, the sonic environment is often not as quiet as would be ideal. During sound production, it would be useful to have available methods that allow for the reduction of undesired background or ambient sounds, while maintaining desired sounds, such as dialog.
The problem of speaker separation is also called “co-channel speech interference.” One prior art approach to the co-channel speech interference problem is blind signal separation (BSS), which approximately recovers unknown signals or “sources” from their observed mixtures. Typically, such mixtures are acquired by a number of sensors, where each sensor receives a different combination of the source signals. The term “blind” is employed, because the only a priori knowledge of the signals is their statistical independence. An article by J. Cardoso (“Blind Signal Separation: Statistical Principles” IEEE Proceedings, Vol. 86, No 10, October 1998, pp. 2009-2025) describes the technique.
In general, BSS is based on the hypothesis that the source signals are stochastically mutually independent. The article by Cardoso noted above, and a related article by S. Amari and A. Cichocki (“Adaptive Blind Signal Processing-Neural Network Approaches,” IEEE Proceedings, Vol. 86, No 10, October 1998, pp. 2026-2048) provide heuristic algorithms for BSS of speech. Such algorithms have originated from traditional signal processing theory, and from various other backgrounds such as neural networks, information theory, statistics, system theory, and information theory. However, most such algorithms deal with the instantaneous mixture of sources and only a few methods examine the situation of convolutive mixtures of speech signals. The case of instantaneous mixture is the simplest case of BSS and can be encountered when multiple speakers are talking simultaneously in an anechoic room with no reverberation effects and sound reflections. However, when dealing with real room acoustics (i.e., in a broadcast studio, over a speakerphone, or even in a phone booth), the effect of reverberation is significant. Depending upon the amount and the type of the room noise, and the strength of the reverberation, the resulting speech signals that are received by the microphones may be highly distorted, which will significantly reduce the ability of such prior art speech separation algorithms.
To quote a recent experimental study: “ . . . reverberation and room noise considerably degrade the performance of BSSD (blind source separation and deconvolution) algorithms. Since current BSSD algorithms are so sensitive to the environments in which they are used, they will only perform reliably in acoustically treated spaces devoid of persistent noises.” (A. Westner and V. M. Bove, Jr., “Applying Blind Source Separation and Deconvolution to Real-World Acoustic Environments,” Proc. 106th Audio Engineering Society (AES) Convention, 1999.)
Thus, BSS techniques, while representing an area of active research, have not produced successful results when applied to speech recognition under co-channel speech interference. In addition, BSS requires more than one microphone, which often is not practical in most broadcast and telephony speech recognition applications. It would be desirable to provide a technique capable of solving the problem of simultaneous speakers, which requires only one microphone, and which is inherently less sensitive to non-ideal room reverberation and noise.
Therefore, neither the currently popular single microphone nor known multiple microphone approaches, which have been proven successful for addressing mild acoustic distortion, have provided satisfactory solutions for dealing with difficult co-channel speech interference and long-delay acoustic reverberation problems. Some of the inherent infrastructure of the existing state-of-the-art speech recognizers, which requires relatively short, fixed-frame feature inputs or which requires prior statistical information about the interference sources, is responsible for this current challenge.
If automatic speech recognition (ASR) systems, speakerphones, or enhancement systems for the hearing impaired are to become truly comparable to human performance, they must be able to segregate multiple speakers and focus on one among many, to “fill in” missing speech information interrupted by brief bursts of noise, and to tolerate changing patterns of reverberation due to different room acoustics. Humans with normal hearing are often able to accomplish these feats through remarkable perceptual processes known collectively as auditory scene analysis. The mechanisms that give rise to such an ability are an amalgam of relatively well-known bottom-up sound processing stages in the early and central auditory system, and less understood top-down attention phenomena involving whole brain function. It would be desirable to provide ASR techniques capable of solving the simultaneous speaker problem noted above. It would further be desirable to provide ASR techniques capable of solving the simultaneous speaker problem modeled at least in part, on auditory scene analysis.
Preferably, such techniques should be usable in conjunction with existing ASR systems. It would thus be desirable to provide enhancement preprocessors that can be used to process input signals into existing ASR systems. Such techniques should be language independent and capable of separating different, non-speech sounds, such as multiple musical instruments, in a single channel.