In many hands-free sound capture scenarios (e.g., gaming, speech recognition, communication and so forth) there are two or more human speakers talking at the same time. Speech separation, which refers to simultaneous capture and separation of human voices by audio processing, is desirable in many such scenarios.
For example, in some game applications that involve speech recognition and voice commands, it is highly desirable to separate the voices of simultaneous talkers located in the same general area. These separated voices may be each sent for speech recognition such that the recognized commands may be applied to each player separately. Also, speech from one speaker may be sent to a corresponding recipient in case of multiparty online gaming.
Sound source separation is generally similar, except that not all captured sounds need be speech. For example, sound source separation can be used as a speech or other sound enhancement technique, such as to separate the desired speech or sounds from undesired signals such as noise or ambient speech. As one more particular example, sound source separation may facilitate voice control of multimedia equipment, for example, in which the voice control commands from one or more speakers are received in various acoustic environments (e.g., with differing noise levels and reverberation conditions).
Sound source/speech separation may be accomplished via a beamformer, which uses spatial separation of the sources to separately weigh the signals from an array of microphones, and thereby amplify/boost signals received from different directions differently. A nullformer operates similarly, but nulls/suppresses interferences based on such spatial information. Beamformers are relatively simple, converge quickly, and are robust, however they are somewhat imprecise and do not separate interfering signals as well in a real world situation where reflections of the interfering source come from many different angles.
Sound source/speech separation also may be accomplished by independent component analysis. This technique is based on statistical independence, and works by maximizing non-Gaussianity or mutual independence of sound signals. While independent component analysis can result in a high degree of separation, because it has many parameters independent component analysis is more difficult to converge and can provide bad results; indeed, independent component analysis depends more on the initial conditions, because it takes a while to learn the coefficients, and the sources may have moved in that timeframe.
While these technologies provide sound source/speech separation to an extent, there is still room for improvement. Attempts to combine these technologies have heretofore not provided any improvement over existing techniques.