In traditional transcription systems, a speaker is identified and recorded onto a recording medium, where the recording medium is either analog or digitally based. A speaker may dictate reports and the like into the system using a traditional recording device such as a standard telephone, a hand-held recording device, or a microphone connected to a computer. The recording or audio file may be transmitted physically or electronically to a central database known as a voice server. A voice server may store multiple records for multiple speakers and may be located offsite from the speaker, for example on the premises of a professional database administrator. An audio file may be routed from the voice server to a predetermined transcriptionist or a recording may be physically transmitted to a transcriptionist, who may transcribe the audio file the traditional way by listening to the full audio file or recording and type up a new document.
Such transcription systems are quite common in the medical field. In the medical field, it is often critical that transcriptions be legible, accurate, and timely completed. One common way medical transcripts are generated is that a physician or other medical processional dictates a report over a telephone line into a central recording system. A medical transcriptionist (“MT”), who may be employed by a hospital, clinic or a third party transcription service, may access the central recording system via telephone, computer, or other traditional manner. The physician dictated recording may be played back as the MT transcribes a medical document. Once the medical document is complete it may be forwarded to the physician or medical professional for final editing. This method has been found to be costly and time consuming. Given the nature of information contained in medical records it is important to transcribe medical records quickly, efficiently, and accurately.
A drawback of traditional transcription methods as applied to the medical field is that the recording that MT's must listen to, and type a document from, tend to have low audio quality. Most telephone systems typically are not optimized to produce high quality audio and instead they tend to produce very poor quality audio. Physicians and medical professionals may also be dictating in a noisy environment such as a busy hospital or clinic, thus resulting in a large amount of background noise. Consequently, there is often significant variance in the audio quality in the recording which requires transcription.
Additionally, transcribed medical records produced from audio files created by medical professionals often require a great deal of editing in order to provide a specified medical document. The nature of medical documents typically requires customized text formatting, substantive text location, organizing, and technical word recognition not required in other professions. General patient history information may also be included in a particular document with customized formatting and text location and placement before the physician is provided with a final document. Therefore, the MT may not type a literal or truthful transcript of the recording, but rather may produce a formatted document often referred to as a finished document. Customization to produce finished documents, as many medical documents require, increases the costs of transcription and the time associated with producing such documents.
Modem transcription methods often incorporate the use of an Automatic Speech Recognition (“ASR”) system in which a digital audio file undergoes analysis by a computer software program commonly known as a recognition engine, which produces a text document from the audio file. ASR systems map an acoustic signal generated by spoken words to a string of words that most likely represent the spoken words. The underlying techniques to perform this mapping are data-driven statistical pattern-recognition methods.
A typical ASR system consists of five basic components: (a) a signal processor module, (b) a decoder module, (c) an adaptation module, (d) language models, and (e) acoustic models. The signal processor module extracts feature vectors from the voice signal for the decoder. The decoder uses both acoustic and language models to generate the word sequence that is most likely for the given feature vectors. The feature vectors and resulting word sequences can also provide information used by the adaptation module to modify either the acoustic or language models for improved future speech recognition performance.
Different applications of speech recognition technology place different constraints on these ASR systems and require different algorithms. ASR systems used for transcription typically are Large-Vocabulary Continuous Speech Recognition (LVCSR) systems with vocabularies ranging from roughly 5,000 to 50,000 words. The term “continuous” denotes that speech has words that are run together as in natural speech (in contrast to “isolated word” speech recognition in which each word is surrounded by pauses). ASR systems used for transcription usually are “speaker independent.” Speaker independent systems can recognize speech from a speaker whose speech has never been presented to the system before. Recognition can be improved by adapting the speaker independent acoustic models to more closely model an individual speaker's voice thus creating a speaker dependent model, and by adding user or site specific vocabulary and word usage to language models thus creating topic language models. Although ASR systems may improve overall efficiency in modem transcription methods, it has been found that not all speakers are good candidates for transcription systems that apply ASR methods.
In cases where a speaker may not be a good candidate for ASR it has been found that traditional manual transcriptions or a combination of manual and ASR methods may be more efficient. However, in order to determine which method of transcription is best suited for a particular speaker, a modem transcription system incorporating ASR must typically be implemented and the results of the transcription must be analyzed. An ASR transcription system is a costly investment which should be made only when it has been determined that enough of the physicians or medical professionals are well suited to use such an ASR transcription system that such a system would be economical. Even when it is known that a number of users of a transcription system are well suited to use ASR, the efficiency of transcription systems can be improved if it can be determined which of the users is well suited for ASR. There is thus a need for a system or method of determining whether a speaker is well suited for ASR aided transcription.
Prior methods for determining whether a speaker is suitable for an ASR system required the generation or creation of documents whose sole purpose was for scoring. These methods required a speaker to read predetermined text to create an voice file that was transcribed by ASR, and the transcription was compared to the original document from which the speaker read. Other prior methods involved a simple manual comparison of a transcript produced by a transcriptionist with an ASR produced transcript. A subjective evaluation of whether there were too many discrepancies between the two was the only measure of speaker suitability. Such prior systems produce no objective estimates of the amount of work necessary to transform recognized text into a finished document, and do not include multiple evaluations. Thus there is a need in the art for an automated and objective test of whether a speaker is suitable for ASR in order to provide estimates of the amount of work necessary to transform recognized text, where the test may include multiple evaluations.
Furthermore, methods involving transcript comparison in the medical field have been found to be inefficient because many medical documents require customized editing, text location and placement, and text of a technical nature. It has been found that transcription time in these cases is significantly increased parallel with the time and cost of manually evaluating a large number of documents produced by this system without a guaranteed measure of the accuracy of such transcription.
Another prior method to determine which method of transcription best suits a particular speaker incorporates analysis of audio file signal quality. This method has been found inaccurate because factors other than signal quality may contribute to overall poor audio quality such as background noise, a speaker's voice tone, inflection, or accent. Although the level of recognition accuracy and signal quality are necessary to determine a best method of transcription for a particular speaker, they do not reflect all of the information needed to confirm that a certain type of speech recognition is the most efficient method to produce a document.