In the standard Maximum Aposteriori Probability (MAP) approach to speech recognition, the goal is to find the word sequence with the highest posterior probability given the acoustic observation. Recently, a number of alternate approaches have been proposed for directly optimizing the word error rate, the most commonly used evaluation criterion. For instance, a consensus decoding approach is described in Mangu et al., “Finding Consensus in Speech Recognition: Word Error Minimization and Other Application of Confusion Networks,” Computer, Speech and Language, 14(4), pp. 373-400, 2000, the disclosure of which is incorporated herein by reference.
In the consensus decoding approach, a word lattice is converted into a confusion network, which specifies the word-level confusions at different time intervals. In this system, the word with the highest score for each confusion set is selected and output. A benefit of the consensus decoding approach is that it converts extremely confusing word lattices into a much simpler form. Unfortunately, analyses of the confusion sets reveal that the word with the highest score is not always the correct word. This means that selecting the word with the highest score will result in errors. Consequently, the consensus decoding approach is not ideal.
Thus, what is needed is a way of improving speech recognition when using consensus decoding.