Aspects of the present invention relate to speech processing. Other aspects of the present invention relate to speech understanding.
Most automated speech recognition systems employ a graph decoder to decode an acoustic feature sequence, measured from input speech data, into a word sequence that is allowed by an underlying language. A graph decoder may use acoustic models of words or phonemes (e.g., Hidden Markov Model or HMM) to translate an acoustic feature sequence into the most likely word sequence based on a language model that describes the allowed word sequences.
Such an automated speech recognition system with a graph decoder can recognize only word sequences that are explicitly allowed in the corresponding language model. This introduces limitations to the speech recognition system. For example, the sentence “change to channel two, please” may correspond to a valid word sequence according to a language model but the sentence “change to, umm, channel two, please” may not, even though the two sentences actually mean the same thing, both linguistically and semantically.
Different solutions have been used to improve the flexibility of a graph decoder based speech recognition system. In some recognition systems, different patterns of a same sentence may be explicitly modeled. In other recognition systems, the recognition of a word sequence may merely use the vocabulary without imposing any pre-defined sentence structure. In the former case, the modeling task may become overwhelming. In the latter case, the recognition result may become less meaningful because any word is now allowed to follow the previously recognized word even though most of the possible combinations may not correspond to meaningful sentences at all.