There exist many methods of post-processing of speech recognition results with the goal to improve word error rate, see (Ringer & Allen, Error correction via a Post-Processor for continuous Speech recognition, 1996; Ringer & Allen, Robust Error Correction of Continuous Speech Recognition, 1997; Jeong, et al., Speech Recognition Error Correction Using Maximum Entropy Language Model, 2004; U.S. Pat. No. 6,064,957). These post-processing methods are usually based on the paradigm of a channel with noise that takes as input a user utterance with a sequence of words contained in it, recognizes it and returns the recognized sequence of words as a distorted (noised) version. Usually these methods rely on a corpus of utterances that need to be transcribed in terms of words contained in them. Such transcribing is a time consuming, expensive and error-prone process.
The present invention can also be viewed through the same paradigm of the noisy channel, but with the following differences. The channel takes as input: a user utterance with (a) a sequence of words contained in it, and (b) a concept—a semantic tag representing the meaning of the uttered words. The channel outputs: (a) a recognized sequence of words and (b) a recognized concept—a recognized semantic tag.
Another difference consists in the channel quality criterion: instead of word error rate, we are interested in semantic tag error rate. So the post-processing aims at reducing the semantic tag recognition error rate using the recognized words just as a means, not as a goal.