1. The Field of the Invention
The present invention relates to systems and methods for generating confidence scores. More particularly, the present invention relates to systems and methods for generating confidence scores from word lattices.
2. Background and Relevant Art
Confidence scores play an important role in a variety of different technologies such as speech recognition systems and dialog systems and are particularly important in unsupervised learning, active learning, and word understanding. Generally, confidence scores are used to determine whether a word was recognized or understood correctly and enables certain actions to be taken. In a speech recognition system, for example, confidence scores represent how sure the speech recognition system is that the words were recognized correctly.
In the last decade, people have been working on two fundamentally different approaches for computing confidence scores. The first approach is based on acoustic measurements. The acoustic-based approach uses a two-pass algorithm on the speech. The first pass computes the best word hypotheses. The best word hypotheses are then re-scored to compute the confidence scores for each word in the best hypotheses. The first pass uses standard acoustic models, while the second pass uses acoustic models that normalize the log-likelihood functions. The acoustic approach is data-driven and requires the speech recognition system to explicitly model the acoustic channel.
The second approach for computing confidence scores is a lattice-based approach. In general, data can be organized or transformed into a lattice structure where each transition of the lattice represents a flot. Confidence scores can be assigned to each flot and those confidence scores can be used to determine which flots are preferable. However, lattices can be quite complex and it is often beneficial to reduce the complexity of a lattice. Simplifying the structure of a lattice can be done in a variety of different ways that are known in the art. Significantly, reducing the complexity of a lattice often results in lost data, which corresponds to less accurate confidence scores. Because a lattice can represent different types of data, confidence scores are relevant to many technologies, including dialog management, parsing technologies, and speech recognition.
In a lattice-based approach, confidence scores are typically computed in a single pass. The lattice-based approach does not require the transition hypotheses to be re-scored and is portable across various tasks and acoustic channels. In addition, the lattice based approach requires no training and is suitable for unsupervised or on-line learning.
For example, a word lattice is often used in speech recognition and one word lattice-based approach computes word posterior probabilities or confidence scores for a lattice structure that is referred to herein as a “sausage.” The sausage corresponds to the confusion networks created from the word lattice output of a speech recognizer. A sausage is thus a simplification of the original word lattice and has a particular topology. As a result, some of the original data is lost when the original word lattice is transformed into a sausage and the confidence scores generated from the sausage are therefore less reliable. FIG. 1, for example, illustrates a sausage 102 or the confusion networks created from the lattice 100.
As illustrated, the topology of the sausage 102 is more straightforward than the topology of the word lattice 100. The sausage 102 is a sequence of confusion sets, where each confusion set is a group of words that compete in the same time interval. Each word has a posterior probability, which is the sum of the probabilities of all the paths of that word occurrence in the lattice 100. In each confusion set, the sum of all posterior probabilities equals one. In addition, the sausage 102 preserves the time order of words, but loses the time information. One advantage of the sausage 102 is that it tends to minimize the Word Error Rate (WER) rather than the Sentence Error Rate (SER).
The sausage is formed by taking a word lattice as input and performing the following steps. First, the low probability links are pruned from the word lattice. A posterior probability for each link in the word lattice is then computed. Next, different occurrences of the same word around the same time interval are merged (intra-word clustering) and their posterior probabilities are summed. Finally, different words which compete around the same time interval are grouped (inter-word clustering) and confusion sets are formed as illustrated in FIG. 1.
A consensus hypothesis, which is the word sequence obtained by choosing a word from each confusion set with the highest posterior probability, can be easily extracted from the sausage 102. The consensus word hypothesis of the sausage 102, however, may vary from the best path hypothesis inside the word lattice 100. The posterior probability estimates of words in the sausage are used as word confidence scores.
In addition to the posterior probability, a local entropy can also be used as a confidence score. The local entropy is computed on each confusion set and accounts for more information than the posterior probability. The local entropy uses both the posterior probability of the winning word as well as the distribution of the posterior probabilities of competing words. In these examples, however, obtaining confidence scores for a word lattice requires that the lattice first be transformed into a set of confusion networks. This results in a loss of time information and does not account for the posterior probabilities of links that were pruned from the original word lattice. In addition, the consensus word hypothesis of the sausage may not correspond to the best path hypothesis of the original lattice from which the sausage was constructed.