1. Field of the Invention
The present invention is directed to text generation systems, such as speech-to-text, automatic character recognition (e.g., OCR) and fact extraction systems and, more particularly, to producing more meaningful confidence scores for text that is generated by such systems.
2. Description of the Related Art
In general, spoken document retrieval (SDR) is composed of two stages: transcription of speech and information retrieval (IR). Transcription of the speech is often referred to as speech-to-text (STT) or automatic speech recognition (ASR), and is often performed using a large vocabulary continuous speech recognizer (LVCSR). Information retrieval (IR) is a general term referring to all forms of data mining. One common form of data mining, for example, is query-based retrieval, where, based on a user's query, documents are retrieved and presented to the user, ordered by an estimated measure of their relevance to the query. Traditionally, this stage is performed on the text output of the first stage.
In transcribing spoken words to text, there is always a question of whether the words are transcribed correctly, particularly when the transcription is obtained automatically by an ASR system. The most accurate large vocabulary ASR systems receive clear voice signals and are trained to recognize speech by each individual using the system in a time-consuming process. In applications with numerous users, many of whom may use the system only once without first training the system and which receive low grade audio signals, such as those obtained via a telephone system, transcribing text is difficult and the resulting accuracy is low.
To improve the accuracy of transcription or speech recognition in applications with many users for whom the system has not been trained, the context of the speech is commonly used. For example in an interactive voice response (IVR) system that has speech output as well as input, communication with the system typically uses a very small vocabulary, often just “yes” or “no” and when more words may be included, a syntax may define where only certain words can be recognized in a predefined order, such as “City, Boston” or “City, Chicago”. An example where a larger vocabulary is used is the transcription of communication between air traffic controllers and aircraft cockpits which follow a predictable pattern. In this case the pattern is known and as a result it is possible to produce an ASR system that can generate more accurate transcriptions of air traffic control communications than a general-purpose ASR system could.
However, there are many potential applications of ASR for which it is difficult to determine the rules that are followed in conversations, if any rules exist. LVCSRs solve this problem by approximating conversational speech through a Markovian model, where the probability of each word to appear is determined by the last few words that were uttered.
Most ASRs output recognition confidence scores or other additional information along with their text output. This output can then be used by IR systems that operate on the outputs of the ASR, as discussed in the concurrently filed application entitled METHOD FOR AUTOMATIC AND SEMI-AUTOMATIC CLASSIFICATION AND CLUSTERING OF NON-DETERMINISTIC TEXTS. For such systems it is beneficial that the output of the ASR will be as rich and as accurate as possible, even in its non-textual outputs.
It would be possible to improve the operation of ASRs and of these client IR systems, if a way could be found to augment and calibrate the outputs of ASRs, such as by an automatic way to map how well various parts of the model of the ASRs fit real conversations, and by correcting the outputs accordingly. Furthermore, it would be beneficial if such augmentation and calibration could be done by a person who has no access or knowledge of the internal operation of the ASR.