1. Technical Field
The present invention relates to speech recognition and machine translation and more particularly to systems and methods for predicting a best performing model or model combination using past “utterance context” information.
2. Description of the Related Art
All automatic speech recognition (ASR) applications including voice-based information retrieval, speech-to-speech translation and spoken dialog systems are sensitive to environmental, speaker, channel and domain mismatch with respect to the training conditions under which the system is trained. This problem is more pronounced when the application is used in real world settings. For example, hand-held speech-to-speech translation systems typically are not used in quiet rooms. The translation systems are used in the street, in a vehicle, etc. where there is a background interference. Moreover, the translation systems may be used by more than one person who may have a different accent, gender, etc.
An ASR task is a sequential process where human-human or human-machine interaction has a structure. More often than not, there is an environment/speaker/channel/topic dependency between consecutive utterances. For example, the topic dependency has been exploited in many spoken dialog systems in the form of a dialog state, or a speaker dependency is exploited in the form of a speaker adaptation.
Model adaptation when it is applicable for ASR is one way of addressing this issue with limited success. The real world speech recognition performance improvements from adaptation hardly match those improvements obtained in an offline controlled experimental setting. A single acoustic model built by multi-style training to account for various acoustic and environmental conditions may be suggested. However, better performance is achieved if multiple acoustic models are trained separately for different conditions (including multi-style training) and a best model is selected during decoding. Moreover, combining the multiple decoding outputs automatically is needed.
Another problem involves a “context independent” use of ROVER (See J. Fiscus, “A Post-Processing System To Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction”, ASRU, 1997, incorporated herein by reference) or consensus based hypothesis combination (CHC) (See L. Mangu, E. Brill and A. Stolcke, “Finding Consensus Among Words: Lattice-Based Word Error Minimization”. Eurospeech 1999, incorporated herein by reference) for improved speech recognition accuracy. The CHC method is a well-established and widely used speech recognition hypothesis combination method. It combines multiple speech recognition hypothesis presented in the form of lattices obtained using different acoustic and/or language models.