1. Technical Field
The present invention relates generally to speech recognition and, more particularly, to a system and method for rescoring N-best hypotheses output from an automatic speech recognition system by utilizing an independently derived text-to-speech (TTS) system to generate a synthetic waveform for each N-best hypothesis and comparing each synthetic waveform with the original speech waveform to select the final system output.
2. Description of Related Art
A common technique which is utilized in speech recognition is to first produce a list of the N most-likely (“N-best”) hypotheses for each utterance and then rescore each of the N-best hypotheses using one or more knowledge sources not necessarily modeled by the speech recognition system which produced the N-best hypotheses. Advantageously, this “N-best rescoring” method enables additional knowledge sources to be brought to bear on the recognition task without having to integrate such sources into the initial decoding system.
One such “N-best rescoring” method is disclosed in “An Articulatory-Like Speech Production Model with Controlled Use of Prior Knowledge” by R. Bakis, Frontiers in Speech, CD-Rom, 1993. With this method, an articulatory model which generates acoustic vectors (not speech waveforms) given a phonetic transcription is utilized to produce acoustics against which the original speech may be compared. Other “rescoring” methods are known to those skilled in the art.
As is understood by those skilled in the art, the techniques utilized for speech recognition and speech synthesis are inherently related. Consequently, increased knowledge and understanding and subsequent improvements for one technique can have profound implications for the other. Due to the recent advances in text-to-speech (TTS) systems which have enabled high quality synthesis, it is to be appreciated that a TTS system can sufficiently provide a source of knowledge about what the speech signal associated with each of the N-hypothesis would look like. Currently, there exists no known systems or methods which utilize a TTS system for rescoring N-best hypotheses. Therefore, based on the similarities between speech recognition and speech synthesis, it is desirable to employ a TTS system as a knowledge source for use in rescoring N-best hypotheses.