The present invention relates generally to a system and method for classifying the confidence or quality of an automatically transcribed report or document.
Today's speech recognition technology enables a computer to transcribe spoken words into computer recognized text equivalents. Automatic Speech Recognition (ASR) is the process of converting an acoustic signal, captured by a transducive element, such as a microphone or a telephone, to a set of text words in a document. These words can be used for numerous applications including data entry and word processing. The development of speech recognition technology is primarily focused on accurate speech recognition.
The accuracy of a speech recognition system or a recognizer depends on many different variables including accents, regional language differences, subject matter and speech patterns. Because of this variability in accuracy, automatically transcribed documents typically require editing to correct errors made by the recognizer during transcription. In some cases, the error rate of a recognizer may be too high and the amount of editing required for a given document with a low recognition accuracy may require more effort, time, and cost to edit than if the given document had been transcribed by a human transcriptionist in the first place. This dilemma often results in low consumer confidence in speech recognition systems or even abandonment of automatic speech recognition systems in environments where the recognizer accuracy is low.
As a result, Report Confidence Modeling (RCM) systems have been devised to rate and score a particular ASR system. A typical RCM system includes a mechanism to predict recognition accuracy by an ASR system. Predicted accuracy allows an ASR system to sort recognized documents based on their estimated accuracy (quality) and route them appropriately for further processing (for example, editing and formatting).
The idea of sorting recognized documents by predicted recognition accuracy comes from the assumption that editing recognized documents (correcting misrecognitions) provides productivity gains compared to typing if recognition quality is good (higher than a certain threshold). If recognition accuracy is not good enough, it is more efficient to type the document rather than correct misrecognitions. Accuracy of text generated by ASR can be predicted based on several factors, including, but not limited to, (a) the confidence values of recognized words; (b) the lexical classes of certain recognized words; (c) temporal characteristics of the recognized report; and (d) speaker's historical behavior. RCM models may be static, factory models developed without reference to site-specific or user-specific data, or adapted models developed by collecting site-specific and user-specific data.
“Good” (i.e., those with high recognition accuracy) documents could be routed to transcriptionists or self-editing doctors for editing (error correction, editing, and formatting) while “bad” (i.e., those with low recognition accuracy) documents could be routed for further ASR processing or for being typed from scratch by a transcriptionist.
There has been significant research on, and development of, confidence rating systems and measures of ASR systems. Some traditional confidence rating systems are based on the probability of the acoustic observation given the speech segment normalized by the general probability of the acoustic observation. There have also been attempts to develop techniques for word confidence estimation that are independent of the architecture and operation of the word recognizer. Other confidence measurements systems use content level and semantic attributes, using the 10 best outputs of a speech recognizer and parsing the output with phrase level grammar. Still others use out-of-vocabulary words and errors due to additive noise to produce an acoustic confidence measure.
A drawback of each of the above mentioned systems is that they focus on the confidence of ASR system at the word level. This principle is known as dialogue management. Use of dialogue management is helpful to determine whether a particular statement has been reliably recognized and converted to text. This confidence rating can be combined with other tools to improve automatic transcription accuracy, such as a parser, to mitigate the dilemma of excessive editing of automatically transcribed documents. However, it is important to note again that these confidence measurements and tools are focused on a word level and are not combined to produce a document level confidence measurement.
Additional attempts to control the amount of editing include identifying certain speakers and accents that are poorly recognized by a recognizer or speech recognition engine. Those speakers can be identified in advance and dictations by those speakers may be routed directly to a human transcriptionist rather than a speech recognition system.
Despite this, traditional speech recognition systems suffer from an all or nothing condition. In other words, traditional systems are incapable of determining, in advance of editing an automatically transcribed document, the most efficient and cost effective workflow. This latter principle is known as document management. As stated above, traditional confidence measurement systems are limited to dialogue management and cannot determine whether it is more effective to employ a recognizer and subsequently edit the automatically transcribed document or to abandon the recognizer and the automatically transcribed document entirely for a traditional, strictly human transcription approach.
Therefore, there exists a need for a system and method of determining a document level confidence measurement for an entire automatically transcribed document. The document level confidence measurement may include not just quantifying the quality of the automatic transcription, but also quantifying additional factors affecting how easily the document may be edited or transcribed.
There also exists a need for a system and method for optimizing the workflow of transcribing dictations based the document level confidence measurement.
What is also needed is a system and method for, based on the document level confidence, determining the more efficient of two options: editing an automatically transcribed document or abandoning the automatically transcribed document for the traditional human transcription.
What is further needed is a speech application to implement a strategy of combining self editing and transcriptionists in a cost-effective manner.