Field of the Invention
The present invention relates to document transcription systems, and more particularly, to techniques for training document transcription systems.
Related Art
It is desirable in many contexts to record human speech in a written document. In general, the term “transcription” refers to the process of recording speech in a textual document referred to as a “transcript” of the speech. In the legal profession, for example, transcriptionists transcribe testimony given in court proceedings and in depositions to produce a written transcript of the testimony. Similarly, in the medical profession, transcripts are produced of diagnoses, prognoses, prescriptions, and other information dictated by doctors and other medical professionals. Transcripts in these and other fields typically need to be highly accurate (as measured in terms of the degree of correspondence between the original speech and the resulting transcript) because of the reliance placed on the resulting transcripts and the harm that could result from an inaccuracy (such as providing an incorrect prescription drug to a patient). High degrees of reliability may, however, be difficult to obtain consistently for a variety of reasons, such as variations in: (1) features of the speakers whose speech is transcribed (e.g., accent, volume, dialect, speed); (2) external conditions (e.g., background noise); (3) the transcriptionist or transcription system (e.g., imperfect hearing or audio capture capabilities, imperfect understanding of language); or (4) the recording/transmission medium (e.g., paper, analog audio tape, analog telephone network).
At first, transcription was performed solely by human transcriptionists who would listen to speech, either in real-time (i.e., in person by “taking dictation”) or by listening to a recording. One benefit of human transcriptionists is that they may have domain-specific knowledge, such as knowledge of medicine and medical terminology, which enables them to interpret ambiguities in speech and thereby to improve transcript accuracy. Human transcriptionists, however, have a variety of disadvantages. For example, human transcriptionists produce transcripts relatively slowly and are subject to decreasing accuracy over time as a result of fatigue.
Various automated speech recognition systems exist for recognizing human speech generally and for transcribing speech in particular. Speech recognition systems which create transcripts are referred to herein as “automated transcription systems” or “automated dictation systems.” Off-the-shelf dictation software, for example, may be used by personal computer users to dictate documents in a word processor as an alternative to typing such documents using a keyboard.
Automated transcription systems, and speech recognizers more generally, use both “acoustic models” and “language models” to recognize speech. In general, an acoustic model maps audio signals to phonemes or parts of phonemes. A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning, such as the “m” in “mat” and the “b” in “bat.” During speech recognition, an acoustic model is used to identify the phonemes represented by portions of the audio signal being recognized. Such a sequence of phonemes may then be combined to recognize the words, sentences, and other syntactic elements spoken by the speaker. Various kinds of acoustic models, such as those which utilize Hidden Markov Models (HMMs), are well-known to those having ordinary skill in the art.
A particular acoustic model represents a particular mapping between speech and text. Although such a mapping could be specified manually by the designer of the transcription system, manual creation of such a mapping would be prohibitively time-consuming and would not likely produce an accurate acoustic model. Instead, acoustic models typically are created using a semi-automated process referred to as “training.” The term “training” refers to the process of adapting the parameters of an acoustic model (or of a speech recognition system more generally) for optimal performance in a new domain (e.g., medical or legal) and/or in conjunction with a new speaker.
Referring to FIG. 1A, a dataflow diagram is shown of a prior art system 100 for training a set of acoustic models 112. In the system 100, the acoustic models 112 are trained using a training database 101 consisting of two closely connected data sources: (1) training speech 102 (e.g., in the form of audio recordings of speech) in a particular target domain and/or from a particular speaker; and (2) verbatim transcripts 104 of the speech 102. Because the transcripts 104 are known to be verbatim transcripts of the training speech 102, the combination of the training speech 102 and transcripts 104 implicitly define mappings between phonemes and text, as required by acoustic models. The process of training may be viewed as a process by which these mappings are extracted from the training speech 102 and corresponding transcripts 104 and then represented in a form which may be used subsequently to perform speech recognition on other speech 126 in the same domain. While “speaker dependent” systems can only reliably recognize speech spoken by the speaker of the training speech 102, “speaker independent” systems use training speech spoken by several different speakers, and corresponding transcripts, to train speaker-independent models which may be used to recognize speech from any speaker.
More specifically, a dictionary 108 which maps text to phonetic symbols is used to translate 106 the transcripts 104 into a sequence of dictionary symbols 110 representing the sequence of phonemes in the transcript 104. For example, the sentence this is a cat” may be translated into the following sequence of dictionary symbols: “dh ih s ih s ax k ae t,” where each dictionary symbol represents a phoneme in the original sentence.
A base set of acoustic models 112 may be predefined. Each of the acoustic models 112 typically is associated with a set of Gaussian models, each of which has a set of mean values and variances. Before such models 112 have been trained, they may have initial values, such as mean values of zero and variances of some predetermined large number. From the acoustic models 112, a sequence of acoustic models 116 corresponding to the dictionary symbols 110 may be identified 114. More than one acoustic model may correspond to each dictionary symbol.
An association is made between these models 116 and the training speech 102 by aligning 118 the speech 102 onto the sequence of models 116, thereby producing timing data 120 specifying a temporal mapping between the models 116 and frames in the training speech 102. A frame is a short audio segment, typically 5-10 milliseconds in duration. Each of the acoustic models 116 may be aligned with a plurality of frames. In the example provided above, the “ih” models may be assigned to frames from the corresponding sound in speech for the word “this” as well as the same sound in speech for the word “is.” Parameters of the models 116 (such as their means and variances) may then be derived from characteristics of the speech 102 in the corresponding frames. Such derivation of acoustic model parameters, and subsequent updating of the acoustic models 112, is referred to as “training” 122 the acoustic models 112. In general, the resulting parameter values indicate probabilities that particular observed sounds represent particular phonemes or parts of phonemes.
The process just described may be repeated for multiple instances of training speech and corresponding verbatim transcripts. Once the acoustic models 112 have been trained in this manner, speech recognition 124 may be performed on other speech 126 by using the trained acoustic models 112 to identify the phonemes that most likely correspond to frames in the speech 126. Text 128 corresponding to the speech 126 may be produced by reversing the mapping going from words to phonemes to models. Because the parameters of the acoustic models 112 were derived from the correspondence between the training text 104 and the training speech 102, speech recognition performed in this way will likely produce poor results if the training text 104 does not accurately represent the training speech 102.
As described above, acoustic models 112 typically are trained based on a training database 101 which includes both recorded utterances 102 and text transcriptions 104 which are known to be verbatim transcripts of the recorded utterances 102. In the dictation domain, for example, the database 101 typically is created by first creating the text 104 and then having speakers speak the text 104 to produce the training speech 102. Text 104 typically is created or collected from existing sources. If a domain-specific acoustic model is desired, such existing sources may be domain-specific sources, such as medical reports if a medical-specific acoustic model is desired. If a generic acoustic model is desired, the existing sources may, for example, be text obtained from a newspaper.
Sections of the training text 104 may then be displayed to a speaker or speakers, who may read the text aloud. A dedicated “speech collection” computer program may record the speech 102 and store it along with the corresponding source text 104, thereby enabling a mapping between source text 104 and spoken utterances 102 to be recorded.
In conversational systems, the training database 101 typically is created by manually transcribing either pre-existing speech or speech created specifically for the purpose of training. For example, chosen subjects may be asked to speak or converse on a given topic. The resulting conversation may be recorded to produce training speech 102, and a human transcriptionist may listen to the spoken recording and produce a verbatim transcript 104 of the speech 102. As a result, an audio file, verbatim transcript of the audio file, and mapping between utterances in the audio file and words in the transcript 104 may be produced.
Regardless of the manner in which the training database 101 is created, the quality of the resulting acoustic models 112 typically is highly reliant on the accuracy of the correspondence between the training speech 102 and the corresponding transcripts 104. In particular, it is typically required that there be an exact or near-exact temporal alignment between the training speech 102 and the corresponding transcripts 104. If such a close temporal alignment does not exist, then the timing data 120 will specify a correlation between text (in the transcripts 104) and audio (in the training speech 102) which do not represent the same speech as each other, and the resulting acoustic models 112 will be poorly trained. Although some training systems are able to identify poorly trained phonemes and to discard the resulting training data (i.e., acoustic model parameters) in response, such an approach reduces the amount of training data, which in turn reduces the accuracy of the resulting acoustic models 112. For these reasons, verbatim transcripts typically are required for conventional acoustic model training to be performed effectively.
It can be difficult to use such training techniques, therefore, in domains in which it is difficult to obtain a large quantity of training speech and corresponding verbatim transcripts. Examples of such domains include the medical and legal domains. In the case of the “prompted speech collection” approach, it may be prohibitively expensive or otherwise impossible to enlist doctors, lawyers, and other professionals who are able to spend the time necessary to recite large amounts of training text 104, and thereby to create the audio recordings 102 necessary to produce the training database 101. Similarly, in the case of the “conversational” approach, the abundance of obscure domain-specific terms in the training speech 102 and the lack of trained medical/legal transcriptionists with knowledge of such terms may make it difficult to produce the large volume of accurate verbatim transcripts 104 that is needed for high-quality training to be performed. In either case, it may be difficult and/or prohibitively expensive to generate the training database 101, given the need for verbatim transcripts 104 of training speech 102 to perform conventional acoustic model training.
In some circumstances, however, large existing bodies of recorded speech and corresponding transcripts may exist. The medical transcription industry, for example, regularly produces a variety of medical reports based on the recorded speech of doctors and other medical professionals. Such reports, however, typically are not suitable for use in the kind of conventional acoustic model training illustrated in FIG. 1A, because such reports typically are not verbatim transcripts of the recorded speech for a variety of reasons.
One reason for a mismatch between the recorded speech and corresponding document is a failure by the transcriptionist to recognize and transcribe the speech accurately. In addition to such errors, however, transcriptionists may intentionally introduce a variety of changes into the written transcription. A transcriptionist may, for example, filter out spontaneous speech effects (e.g., pause fillers, hesitations, and false starts), discard irrelevant remarks and comments, convert data into a standard format, insert headings or other explanatory materials, or change the sequence of the speech to fit the structure of a written report as required by a certain medical institution or physician.
For example, referring to FIG. 12, an example of a structured and formatted medical report 1200 is shown. The report includes a variety of sections 1202-1230 which appear in a predetermined sequence when the report 1200 is displayed. In the particular example shown in FIG. 12, the report includes a header section 1202, a subjective section 1212, an objective section 1224, an assessment section 1226, and a plan section 1228. Sections may include text as well as sub-sections. For example, the header section 1202 includes a hospital name section 1210 (containing the text “General Hospital”), a patient name section 1204 (containing the text “Jane Doe”), a chart number section 1206 (containing the text “851D”), and a report date section 1208 (containing text “10/1/1993”).
Similarly, the subjective section includes various subjective information about the patient, included both in text and in a medical history section 1214, a medications section 1216, an allergies section 1218, a family history section 1220, a social history section 1222, and a signature section 1230. The objective section 1224 includes various objective information about the patient, such as her weight and blood pressure. Although not illustrated in FIG. 12, the information in the objective section may include sub-sections for containing the illustrated information. The assessment section 1226 includes a textual assessment of the patient's condition, and the plan subsection 1228 includes a textual description of a plan of treatment. Finally, the signature section includes a textual representation of the doctor's signature.
Note that information may appear in a different form in the report from the form in which such information was spoken by the dictating doctor. For example, the date in the report date section 1208 may have been spoken as “October first nineteen ninety three, “the first of October ninety three,” or in some other form. These alternative ways of speaking the same date are referred to herein as “alternative spoken forms” of the date. More generally, each way of speaking a particular concept is referred to herein as a “spoken form” of the concept. The transcriptionist, however, transcribed such speech using the text “10/1/1993” in the report date section 1208, perhaps because written reports in the hospital specified in the hospital section 1210 requires that dates be expressed in reports in such a format.
Similarly, information in the medical report 1200 may not appear in the same sequence in the report 1200 as in the original audio recording, due to the need to conform to a required report format or some other reason. For example, the dictating physician may have dictated the objective section 1224 first, followed by the subjective section 1212, and then by the header 1202. The written report 1200, however, contains the header 1202 first, followed by the subjective section 1212, and then the objective section 1224. Such a report structure may, for example, be required for medical reports in the hospital specified in the hospital section 1210.
The beginning of the report 1200 may have been generated based on a spoken audio stream such as the following: this is doctor smith on uh the first of October um nineteen ninety three patient ID eighty five one d um next is the patient's family history which I have reviewed . . . ” It should be apparent that a verbatim transcript of this speech would be difficult to understand and would not be particularly useful.
Note, for example, that certain words, such as “next is a,” do not appear in the written report 1200. Similarly, pause-filling utterances such as “uh” do not appear in the written report 1200. Furthermore, certain terms, such as dates, have been recorded in the report 1200 using particular canonical forms (e.g., in the report date section 1208). In addition, the written report 1200 organizes the original speech into the predefined sections 1202-1230 by re-ordering the speech. As these examples illustrate, the written report 1200 is not a verbatim transcript of the dictating physician's speech.
Although a report such as the report 1200 may be more desirable than a verbatim transcript for a variety of reasons (e.g., because it organizes information in a way that facilitates understanding), the report is not useful as training text in the traditional acoustic model training process described above with respect to FIG. 1A, precisely because the report 1200 is not the kind of verbatim transcript required for traditional acoustic model training.
In summary, although a large body of existing documents corresponding to speech may be available in certain circumstances, such documents may not be verbatim transcripts of the corresponding speech. If conventional acoustic model training were applied to such speech and corresponding documents, the resulting acoustic models would be sub-optimal, perhaps to such an extent that they would not be suitable for use in speech recognition.
It would be advantageous, however, to be able to use such reports to train acoustic models because of the abundance of existing reports in domains such as medicine and law. Although new, verbatim, transcripts could be generated based on existing recorded spoken audio streams, generating large volumes of such transcripts would be tedious, time-consuming, and costly. Furthermore, it would inefficiently require two transcripts to be generated for each recorded audio stream (one verbatim transcript to be used for acoustic model training, and one non-verbatim transcript to be used for traditional purposes).
Referring to FIG. 1B, a dataflow diagram is shown of a prior art system 150 which attempts to solve the problem just described. The system 150 includes spoken audio 152 and a corresponding non-literal transcript 154 of the audio 152, produced by a transcriptionist 156. As described in more detail below, the non-literal transcript 154 includes information from the audio 152, but is not a literal (verbatim) transcript of the audio 152. An attempt is made, either manually or automatically, to align 158 the audio 152 with the non-literal transcript 154, thereby producing timing data 160 specifying temporal correlations between portions of the audio 152 and text in the non-literal transcript 154.
The audio 152, timing data 160 and non-literal transcript 154 are provided to a confidence filter 164, which measures the degree of “fit” between the frames and corresponding word models. If the fit for a particular frame does not satisfy a confidence threshold, the confidence filter 164 marks the frame as unusable. The confidence filter 164 thereby produces a set of filtered labels 166 which identify the frames that satisfied the confidence threshold. The audio 152, non-literal transcript 154, and filtered labels 166 are provided to a trainer 162, which produces a set of trained acoustic models 168 based on the portions of the spoken audio stream 152 and non-literal transcript 154 identified by the filtered labels 166.
One problem with the approach illustrated in FIG. 1B is that a large amount of training data from the initial acoustic models 168 may be discarded because so much of the non-literal transcript 154 fails to match the corresponding portions of the spoken audio 152. In particular, such an approach may tend to systematically discard training data that do not take the same form as the text in the non-literal transcript 154. For example, if the word “November” in the spoken audio stream 152 is aligned with the text “11” in the non-literal transcript 154, such training data will be discarded even though “November” and “11” represent the same semantic content. If the spoken audio stream 152 consistently contains the word “November” when the non-literal transcript contains the text “11”, training data of this kind will consistently be discarded. The approach illustrated in FIG. 1B, therefore, has limited usefulness.
What is needed, therefore, are improved techniques for training speech recognition systems and, in particular, improved techniques for training transcription systems based on non-literal transcripts of speech.