The present disclosure relates generally to a dynamic N-best algorithm to reduce recognition errors in computer based recognition systems and, in particular, to a method of dynamically re-scoring an N-best list created in response to a given input.
One type of computer based recognition system is a speech recognition system. Speech recognition is the process by which an acoustic signal received by microphone or telephone is converted to a set of text words, numbers, or symbols by a computer. Speech recognition systems model and classify acoustic symbols to form acoustic models, which are representations of basic linguistic units referred to as phonemes. Upon receiving and digitizing an acoustic speech signal, the speech recognition system analyzes the digitized speech signal, identifies a series of acoustic models within the speech signal, and derives a list of potential word candidates corresponding to the identified series of acoustic models. Notably, the speech recognition system can determine a measurement reflecting the degree to which the potential word candidates phonetically match the digitized speech signal. Speech recognition systems return hypotheses about the user's utterance in the form of an N-best list that consists of utterance hypotheses paired with numeric confidence values representing the recognition engine's assessment of the correctness of each hypothesis.
Speech recognition systems are utilized to analyze the potential word candidates with reference to a contextual model. This analysis determines a probability that one of the word candidates accurately reflects received speech based upon previously recognized words. The speech recognition system factors subsequently received words into the probability determination as well. The contextual model, often referred to as a language model, can be developed through an analysis of many hours of human speech or, alternatively, a written corpus that reflects speaking patterns. Typically, the development of the language model is domain specific. For example, a language model may be built reflecting language usage within an automotive context, a medical context, or for a general user.
Post-recognition N-best processing algorithms that reorder N-best candidates created by a speech recognition system are sometimes used in production speech understanding applications to improve upon the accuracy obtained by always using the top candidate returned on the N-best list. Previous research into N-best processing algorithms has generally emphasized the use of domain knowledge encoded in the language models. For example, knowledge sources such as syntactic and semantic information encoded in the language models have been utilized as well as confidence values and class N-gram scores computed from valid utterances.
The accuracy of a speech recognition system is dependent on a number of factors. One such factor is the context of a user spoken utterance. In some situations, for example where the user is asked to spell a word, phrase, number, or an alphanumeric string, little contextual information is available to aid in the recognition process. In these situations, the recognition of individual letters or numbers, as opposed to words, can be particularly difficult because of the reduced contextual references available to the speech recognition system. This can be particularly acute in a spelling context, such as where a user provides the spelling of a name. In other situations, such as a user specifying a password, the characters can be part of a completely random alphanumeric string. In that case, a contextual analysis of previously recognized characters offers little, if any, insight as to subsequent user speech.
Recognizing the names of the letters of the alphabet is known to be difficult for speech systems, yet it is also very important in speech systems where spelling is needed (e.g., to capture new names of entities such as persons or place names). In current speech systems that are not tuned to any particular user's voice, the only way to reliably capture letter names is to use proxies for the letter names (e.g., “alpha” represents “a”, “bravo represent “b”, and so forth). The longer phonetic value of the proxies make them easier to distinguish from one another. The drawback for commercial systems is that the user cannot be reasonably expected to memorize some arbitrary list of proxies. Spelling is a desired feature in speech systems because the larger problem of arbitrary entity name recognition such as person or place names is even more difficult.
Other computer based recognition systems face similar issues when attempting to increase recognition accuracy. It is desirable to increase recognition accuracy without requiring the user to provide input to the computer based recognition engine via particular languages or symbols defined for the specific computer based recognition system.