1. Field of the Invention
The present invention relates to automatic speech recognition and, more particularly, to techniques for automatically transcribing speech.
2. Related Art
It is desirable in many contexts to generate a written document based on human speech. In the legal profession, for example, transcriptionists transcribe testimony given in court proceedings and in depositions to produce a written transcript of the testimony. Similarly, in the medical profession, transcripts are produced of diagnoses, prognoses, prescriptions, and other information dictated by doctors and other medical professionals. Transcripts in these and other fields typically need to be highly accurate (as measured in terms of the degree of correspondence between the semantic content (meaning) of the original speech and the semantic content of the resulting transcript) because of the reliance placed on the resulting transcripts and the harm that could result from an inaccuracy (such as providing an incorrect prescription drug to a patient). High degrees of reliability may, however, be difficult to obtain consistently for a variety of reasons, such as variations in: (1) features of the speakers whose speech is transcribed (e.g., accent, volume, dialect, speed); (2) external conditions (e.g., background noise); (3) the transcriptionist or transcription system (e.g., imperfect hearing or audio capture capabilities, imperfect understanding of language); or (4) the recording/transmission medium (e.g., paper, analog audio tape, analog telephone network, compression algorithms applied in digital telephone networks, and noises/artifacts due to cell phone channels).
At first, transcription was performed solely by human transcriptionists who would listen to speech, either in real-time (i.e., in person by “taking dictation”) or by listening to a recording. One benefit of human transcriptionists is that they may have domain-specific knowledge, such as knowledge of medicine and medical terminology, which enables them to interpret ambiguities in speech and thereby to improve transcript accuracy. Human transcriptionists, however, have a variety of disadvantages. For example, human transcriptionists produce transcripts relatively slowly and are subject to decreasing accuracy over time as a result of fatigue.
Various automated speech recognition systems exist for recognizing human speech generally and for transcribing speech in particular. Speech recognition systems which create transcripts are referred to herein as “automated transcription systems” or “automated dictation systems.” Off-the-shelf dictation software, for example, may be used by personal computer users to dictate documents in a word processor as an alternative to typing such documents using a keyboard.
Automated dictation systems typically attempt to produce a word-for-word transcript of speech. Such a transcript, in which there is a one-to-one mapping between words in the spoken audio stream and words in the transcript, is referred to herein as a “verbatim transcript.” Automated dictation systems are not perfect and may therefore fail to produce perfect verbatim transcripts.
In some circumstances, however, a verbatim transcript is not desired. In fact, transcriptionists may intentionally introduce a variety of changes into the written transcription. A transcriptionist may, for example, filter out spontaneous speech effects (e.g., pause fillers, hesitations, and false starts), discard irrelevant remarks and comments, convert data into a standard format, insert headings or other explanatory materials, or change the sequence of the speech to fit the structure of a written report.
In the medical domain, for example, spoken reports produced by doctors are frequently transcribed into written reports having standard formats. For example, referring to FIG. 1B, an example of a structured and formatted medical report 111 is shown. The report 111 includes a variety of sections 112-138 which appear in a predetermined sequence when the report 111 is displayed. In the particular example shown in FIG. 1B, the report includes a header section 112, a subjective section 122, an objective section 134, an assessment section 136, and a plan section 138. Sections may include text as well as sub-sections. For example, the header section 112 includes a hospital name section 120 (containing the text “General Hospital”), a patient name section 114 (containing the text “Jane Doe”), a chart number section 116 (containing the text “851D”), and a report date section 118 (containing text “10/1/1993”).
Similarly, the subjective section 122 includes various subjective information about the patient, included both in text and in a medical history section 124, a medications section 126, an allergies section 128, a family history section 130, and a social history section 132. The objective section 134 includes various objective information about the patient, such as her weight and blood pressure. Although not illustrated in FIG. 1B, the information in the objective section may include sub-sections for containing the illustrated information. The assessment section 136 includes a textual assessment of the patient's condition, and the plan subsection 138 includes a textual description of a plan of treatment.
Note that information may appear in a different form in the report 111 from the form in which such information was spoken by the dictating doctor. For example, the date in the report date section 118 may have been spoken as “october first nineteen ninety three, “the first of october ninety three,” or in some other form. The transcriptionist, however, transcribed such speech using the text “10/1/1993” in the report date section 118, perhaps because the hospital specified in the hospital section 120 requires that dates in written reports be expressed in such a format.
Similarly, information in the medical report 111 may not appear in the same sequence as in the original audio recording, due to the need to conform to a required report format or for some other reason. For example, the dictating physician may have dictated the objective section 134 first, followed by the subjective section 122, and then by the header 120. The written report 111, however, contains the header 120 first, followed by the subjective section 122, and then the objective section 134. Such a report structure may, for example, be required for medical reports in the hospital specified in the hospital section 120.
The beginning of the report 111 may have been generated based on a spoken audio stream such as the following: “this is doctor smith on uh the first of october um nineteen ninety three patient ID eighty five one d um next is the patient's family history which i have reviewed . . . .” It should be apparent that a verbatim transcript of this speech would be difficult to understand and would not be particularly useful.
Note, for example, that certain words, such as “next is a,” do not appear in the written report 111. Similarly, pause-filling utterances such as “uh” do not appear in the written report 111. In addition, the written report 111 organizes the original speech into the predefined sections 112-140 by re-ordering the speech. As these examples illustrate, the written report 111 is not a verbatim transcript of the dictating physician's speech.
In summary, a report such as the report 111 may be more desirable than a verbatim transcript for a variety of reasons (e.g., because it organizes information in a way that facilitates understanding). It would, therefore, be desirable for an automatic transcription system to be capable of generating a structured report (rather than a verbatim transcript) based on unstructured speech.
Referring to FIG. 1A, a dataflow diagram is shown of a prior art system 100 for generating a structured document 110 based on a spoken audio stream 102. Such a system produces the structured textual document 110 from the spoken audio stream 102 using a two-step process: (1) an automatic speech recognizer 104 generates a verbatim transcript 106 based on the spoken audio stream 102; and (2) a natural language processor 108 identifies structure in the transcript 106 and thereby creates the structured document 110, which has the same content as the transcript 106, but which is organized into the structure (e.g., report format) identified by the natural language processor 108.
For example, some existing systems attempt to generate structured textual documents by: (1) analyzing the spoken audio stream 102 to identify and distinguish spoken content in the audio stream 102 from explicit or implicit structural hints in the audio stream 102; (2) converting the “content” portions of the spoken audio stream 102 into raw text; and (3) using the identified structural hints to convert the raw text into the structured report 110. Examples of explicit structural hints include formatting commands (e.g., “new paragraph,” “new line,” “next item”) and paragraph identifiers (e.g., “findings,” “impression,” “conclusion”). Examples of implicit structural hints include long pauses that may denote paragraph boundaries, prosodic cues that indicate ends of enumerations, and the spoken content itself.
For various reasons described in more detail below, the structured document 110 produced by the system 100 may be sub-optimal. For example, the structured document 110 may contain incorrectly transcribed (i.e., misrecognized) words, the structure of the structured document 110 may fail to reflect the desired document structure, and content from the spoken audio stream 102 may be inserted into the wrong sub-structures (e.g., sections, paragraphs, or sentences) in the structured document.
Furthermore, in addition to or instead of generating the structured document 110 based on the spoken audio stream 102, it may be desirable to extract semantic content (such as information about medications, allergies, or previous illnesses of the patient described in the audio stream 102) from the spoken audio stream 102. Although such semantic content may be useful for generating the structured document 110, such content may also be useful for other purposes, such as populating a database of patient information that can be analyzed independently of the document 110. Prior art systems, such as the system 100 shown in FIG. 1, however, typically are designed to generate the structured document 110 based primarily or solely on syntactic information in the spoken audio stream 102. Such systems, therefore, are not useful for extracting semantic content.
What is needed, therefore, are improved techniques for generating structured documents based on spoken audio streams.