It is desirable in many contexts to generate a structured textual document based on human speech. In the legal profession, for example, transcriptionists transcribe testimony given in court proceedings and in depositions to produce a written transcript of the testimony. Similarly, in the medical profession, transcripts are produced of diagnoses, prognoses, prescriptions, and other information dictated by doctors and other medical professionals. Transcripts in these and other fields typically need to be highly accurate (as measured in terms of the degree of correspondence between the semantic content (meaning) of the original speech and the semantic content of the resulting transcript) because of the reliance placed on the resulting transcripts and the harm that could result from an inaccuracy (such as providing an incorrect prescription drug to a patient).
It may be difficult to produce an initial transcript that is highly accurate for a variety of reasons, such as variations in: (1) features of the speakers whose speech is transcribed (e.g., accent, volume, dialect, speed); (2) external conditions (e.g., background noise); (3) the transcriptionist or transcription system (e.g., imperfect hearing or audio capture capabilities, imperfect understanding of language); or (4) the recording/transmission medium (e.g., paper, analog audio tape, analog telephone network, compression algorithms applied in digital telephone networks, and noises/artifacts due to cell phone channels).
The first draft of a transcript, whether produced by a human transcriptionist or an automated speech recognition system, may therefore include a variety of errors. Typically it is necessary to proofread and edit such draft documents to correct the errors contained therein. Transcription errors that need correction may include, for example, any of the following: missing words or word sequences; excessive wording; mis-spelled, -typed, or -recognized words; missing or excessive punctuation; and incorrect document structure (such as incorrect, missing, or redundant sections, enumerations, paragraphs, or lists).
In some circumstances, however, a verbatim transcript is not desired. In fact, transcriptionists may intentionally introduce a variety of changes into the written transcription. A transcriptionist may, for example, filter out spontaneous speech effects (e.g., pause fillers, hesitations, and false starts), discard irrelevant remarks and comments, convert data into a standard format, insert headings or other explanatory materials, or change the sequence of the speech to fit the structure of a written report.
Furthermore, formatting requirements may make it necessary to edit even phrases that have been transcribed correctly so that such phrases comply with the formatting requirements. For example, abbreviations and acronyms may need to be fully spelled out. This is one example of a kind of “editing pattern” that may need to be applied even in the absence of a transcription error.
Such error correction and other editing is often performed by human proofreaders and can be tedious, time-consuming, costly, and itself error-prone. In some cases, attempts are made to detect and correct errors using automatically-generated statistical measures of the uncertainty of the draft-generation process. For example, both natural language processors (NLPs) and automatic speech recognizers (ASRs) produce such “confidence measures.” These confidence measures, however, are often unreliable, thereby limiting the usefulness of the error detection and correction techniques that rely on them.
Furthermore, it may be desirable for a report or other structured document to include not only text but data. In such a case the goal is not merely to capture spoken words as text, but also to extract data from those words, and to include the data in the report. The data, although included in the report, may or may not be explicitly displayed to the user when the document is rendered. Even if not displayed to the user, the computer-readable nature of the data makes it useful for various kinds of processing which would be difficult or impossible to perform on bare text.
Consider, for example, a draft report generated from the free-form speech of a doctor. Such a draft report may include both: (1) a textual transcript of the doctor's speech, and (2) codes (also referred to as “tags” or “annotations”) that annotate the transcribed speech. Such codes may, for example, take the form of XML tags.
The doctor's speech may be “free-form” in the sense that the structure of the speech may not match the desired structure of the written report. When dictating, doctors (and other speakers) typically only hint at or imply the structure of the final report. Such “structure” includes, for example, the report's sections, paragraphs, and enumerations. Although an automated system may attempt to identify the document structured implied by the speech, and to create a report having that structure, such a process is error prone. The system may, for example, put the text corresponding to particular speech in the wrong section of the report.
Similarly, the system may incorrectly classify such text as describing an allergy rather than as text corresponding to some other kind of data. Such an error would be reflected in the document by an incorrect coding being applied to the text. Consider, for example, the sentence fragment “penicillin causes hives.” This text may be coded incorrectly by, for example, coding the text “penicillin” as a current medication rather than as an allergen.
When data are extracted from speech, it is desirable that such data be coded accurately. Some existing systems which extract data from speech to produce structured documents, however, do not provide a mechanism for the accuracy of the extracted data to be human-verified, thereby limiting the confidence with which the accuracy of such documents may be relied upon.
Some systems allow the accuracy of extracted data to be verified, but only do so as a separate work step after the textual content of the document has been verified for speech recognition errors. This data verification process involves displaying the extracted codes themselves, which makes the verification process difficult due to the complexities of the coding systems, such as the Controlled Medical Vocabulary (CMV) coding system, that are commonly used to encode data in documents. Such existing techniques for verifying extracted data are therefore of limited utility.
What is needed, therefore, are improved techniques for verifying the correctness of data extracted from speech into documents.