While progress has been made in improving the noise robustness of speech recognition systems, recognizing speech in the presence of a competing talker (mixed speech) remains a challenge. For the case of single-microphone speech recognition in the presence of a competing talker, researchers apply a variety of techniques on a mixed speech sample and make comparisons between them. These techniques include model-based approaches that use factorial Gaussian Mixture Models-Hidden Markov Models (GMM-HMM) for the interaction between the target and competing speech signals, and their temporal dynamics. With this technique, a joint inference, or decoding, identifies the two most likely speech signals, or spoken sentences.
In computational auditory scene analysis (CASA) and “missing feature” approaches, segmentation rules operate on low-level features to estimate a time-frequency mask that isolates the signal components belonging to each speaker. This mask may be used to reconstruct the signal or to inform the decoding process. Other approaches use non-negative matrix factorization (NMF) for separation and pitch-based enhancement.
In one approach, a separation system uses factorial GMM-HMM generative models with 256 Gaussians to model the acoustic space for each speaker. While this useful for a small vocabulary, it is a primitive model for a large vocabulary task. With a larger number of Gaussians, performing inference on the factorial GMM-HMM becomes computationally impractical. Further, such a system assumes the availability of speaker-dependent training data and a closed set of speakers between training and test, which may be impractical for large numbers of speakers.