In recent years, the accuracy of automatic speech recognition (“ASR”) systems has significantly improved thanks to the deep learning techniques exploited in recent ASR systems. In 2010, the word error rate (WER) on the widely accepted Switchboard conversation transcription benchmark task was over 20% and in 2016, due to developments in deep learning, it has been reduced to below 7%.
Although this impressive progress has been made for dictating single-speaker speech, progress in ASR for multi-talker mixed speech separation, tracing, and recognition, often referred to as the cocktail-party problem, has been less impressive. Although human listeners can easily perceive separate sources in an acoustic mixture the same task seems to be difficult for automatic computing systems, especially when only a single-channel of mixed-speech is available.
Current solutions are limited by only functioning for a closed-set of talkers, failing to scale with increased speakers or vocabularies; only separating highly different signals (e.g., separating music from a talker) instead of the more difficult task of separating similar signals, such as multiple talkers; relying on talker-dependent models that require identifying talkers at training time and collecting data from the talkers, resulting in a limited vocabulary, grammar, and talker set; assuming that time-frequency bins only belong to one speaker; or having portions that are not jointly-trainable and therefore limit system performance.