Automatic speech-to-text systems convert spoken dictation into text. In one typical application, an author (e.g. a doctor or a lawyer) dictates information into a telephone handset or a portable recording device. The speech-to-text system then processes the dictation audio automatically to create a draft text document. Optionally, a human transcriptionist would then verify the accuracy of the document and fix occasional errors. Typically, authors want to spend as little time as possible dictating. They usually focus only on the content and rely on the transcriptionist to compose a readable, syntactically correct, stylistically acceptable and formally compliant document. For this reason, there is a considerable discrepancy between what the speaker has literally said and the final document.
In particular, in the specific application of medical dictation, there are many kinds of differences between the literal dictated speech and the final document, including, for example:
Punctuation marks are typically not dictated.
No instructions on the formatting of the report are dictated.
Frequently section headings are only implied. (“vitals are” becomes “PHYSICAL EXAMINATION: VITAL SIGNS:”)
In enumerated lists, typically speakers use phrases like “number one . . . next number . . . ” which need to be turned into “1 . . . 2 . . . ”
The dictation usually begins with a preamble (e.g. “This is doctor XYZ . . . ”) which does not appear in the final report. Similarly, there are typically phrases at the end of the dictation which should not be transcribed (e.g. “End of dictation. Thank you.”)
There are specific standards regarding the use of medical terminology—transcriptionists frequently expand dictated abbreviations (e.g. “CVA” becomes “cerebrovascular accident”) or otherwise use equivalent but different terms (e.g. “nonicteric sclerae” becomes “no scleral icterus”)
The dictation typically has a more narrative style (e.g. “She has no allergies.”, “I examined him”). In contrast, the final report is normally more impersonal and structured (e.g. “ALLERGIES: None.”, “he was examined”).
For the sake of brevity, speakers frequently omit function words. (“patient” vs. “the patient”, “denies fever pain” vs. “he denies any fever or pain”)
Because the dictation is spontaneous, disfluencies are quite frequent, in particular false starts, corrections, and repetitions. (e.g. “22-year-old female, sorry, male 22-year-old male” vs. “22-year-old male”)
Instructions to the transcriptionist and so-called normal reports such as pre-defined text templates which are invoked by short phrase like “This is a normal chest x-ray.”
In addition to the above, speech recognition output contains certain recognition errors, some of which may occur systematically. Other application domains (e.g. law) may show different or additional discrepancies. (e.g. instructions to insert an address or a legal citation).
These phenomena pose a problem that goes beyond the actual literal speech recognition. The speech recognizer is meant to produce an accurate verbatim transcription of the recorded utterance. But, even with a perfectly accurate verbatim transcript of the user's utterances, the transcriptionist would still need to perform a significant amount of editing to obtain a document that conforms to the customary standards. Preferably, this manual editing should be reduced as far as possible. We refer to such efforts to transform the unstructured speech recognition text result into well-formed structured document text as transformation modeling. Transformation modeling also has the general capacity to correct some of the systematic speech recognition errors.