Current automatic speech recognition systems continued to show-difficulty in processing speech in multi-speaker and high noise environments despite the availability of substantially increased processing power. To achieve high-performance speech recognition in multi-speaker and high noise environments, a speech recognition system that uses the same cues for recognition and noise robustness that human beings do is needed. Such a system should be based on detailed neurobiological and psychoacoustic knowledge of human auditory function, accomplishing noise robustness via auditory stream separation and by using noise-robust phonetic cues.
The human approach to noise robustness is based on a high-resolution spectral analysis, followed by intelligent groupings of fine-grained sound features that can be ascribed to a common source. By contrast, conventional speech recognition systems achieve the groupings of fine-grained sound features by indiscriminately blurring them together in a 22-point mel-scale filterbank and a 10-20 millisecond frame. This approach works passably in quiet environments, but is the major limiting factor preventing conventional recognizers from achieving noise robustness—once the signal features have been blurred in with the other sounds, they can never be recovered.