Listening to and understanding the speech of two or more people when they talk simultaneously is a difficult task and has been considered one of the most challenging problems for automatic speech recognition.
Single-channel speech separation has previously been attempted using Gaussian mixture models (GMMs) on individual frames of acoustic features. However such models tend to perform well only when speakers are of different gender or have rather different voices (T. Kristjansson, J. Hershey, and H. Attias, “Single microphone source separation using high resolution signal reconstruction” ICASSP, 2004). When speakers have similar voices, speaker-dependent mixture models do not unambiguously identify the component speakers. Although, several models in the literature have attempted to do so either for recognition: (P. Varga and R. K. Moore, “Hidden Markov model decomposition of speech and noise,” ICASSP, pp. 845-848, 1990; M. Gales and S. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 352-359, September 1996), or enhancement of speech: (Y. Ephraim, “A Bayesian estimation approach for speech enhancement using hidden Markov models.,” vol. 40, no. 4, pp. 725-735, 1992; Sam T. Roweis, “One microphone source separation.,” in NIPS, 2000, pp. 793-799). Such models have typically been based on a discrete-state hidden Markov model (HMM) operating on a frame-based acoustic feature vector.
The field of speech recognition goes back many years and contains commonly used techniques, methods, and approaches. The following U.S. Pat. Nos. 7,062,433, 7,054,810, 6,950,796, 6,154,722, 6,023,673 all of which are incorporated herein by reference, may serve as general references for techniques of the speech recognition art.
There clearly is a need for improving capabilities in the art of speech recognition when it comes to separate and understand simultaneous speech of two or more speakers.