The present invention relates generally to transcription methods and apparatus, and more particularly, to a continuous speech voice transcription method and apparatus for use in transcribing structured reports, such as radiology reports, and the like.
Transcription is a major bottleneck in timely radiology reporting, for example. Radiology images may be acquired, read, and dictated in a few minutes, but many days may pass until the transcription is complete. Similar problems occur in medicine, law and other areas of endeavor.
Report transcription has traditionally been a process that involves a number of people. In its most primitive form, the transcription is to a cassette that is collected and carried to the transcriptionist where the cassettes are put in an xe2x80x9cinxe2x80x9d basket. The transcriptionist sequentially processes incoming cassettes, transcribing and printing the reports, and sending them back to the radiologists for signing. If there are corrections, there is another cycle using the transcriptionist.
A more advanced form of transcription uses a communication network to record the voice report such that the transcriptionist can retrieve the recording directly at a workstation without transporting the physical cassette. The wait in a transcription queue can be several hours in an efficient hospital to several days for less efficient hospitals to generate a typical report.
Thus, if the transcription were performed automatically, the text report could be available at the end of the dictation with no waiting for several hours to several days for the transcriptionist to complete the transcription task. The computing horsepower of a computer workstation can be used to perform the automatic transcription. What is needed is an automatic transcription algorithm that can transcribe the dictated reports.
Electronic reports may be structured, such that either a fill in the blanks report or a full structured report are generated. The two variations are related. The structured report starts with a basic report for a pathology, such as mammography for example. In order to make the electronic report complete and like other reports, the American College of Radiology has established a mammography reporting form. The form is a basic report with areas that can be filled in with words that are selected from a list of words for that blank in that report. This xe2x80x9cfill in the blanksxe2x80x9d reporting makes all mammography reports very similar, using simple variations on the language for each of the individual reports.
The structured form of the fill in the blanks report is much more useful in computer processing to determine outcomes of the treatment. The processing performed by the computer can ignore filler text and process the data contained in the filled in blanks to generate the report. Some of the blanks describe the severity of the pathology. Other blanks describe the changes in the pathology as a result of whatever treatment has been performed. Over a period of time the progression of the contents of the blanks on the form show a picture of the progress of the patient. After a period of time, the outcome for the patient can be evaluated.
A yet more structured form of the electronic report is a collection of codes. The codes point to selected phrases in a dictionary of codes. The SNOMED dictionary maintained by the National Library of medicine is one such dictionary. This micro-glossary has words and phrases that are useful in describing a large number of pathologies including the location and severity of the pathology. To read the report requires converting the codes to text form. For a computer to read the codes is trivial, since the codes are an almost ideal representation of information for the computer. The outcomes must be assessed, and thus the report should be organized to make the assessment easy. As a result, structured reporting including the use of codes to describe the pathology and its progress, will be used more in the future.
The radiologist should be able to generate the report while looking at an image that is to be evaluated. With the attention on the image, the radiologist can progress through the image in an orderly fashion, making sure that all aspects of the diagnosis are properly covered in the report. The transcription should therefore be something that can be done while looking at the image that is diagnosed. The radiologist should not have to look at every word generated to make certain that the word is properly spelled, for example. While not as important a requirement, it would be beneficial if the radiologist could dictate the report without using hands. The radiologist should use his or her hands to manipulate images, change to historical images for comparison, and magnify selected areas of the images for detail.
A number of transcription devices and methods are currently available. They fall into several categories including isolated word recognition, continuous speech recognition, batch transcription after dictation, and on the fly transcription while dictating. Generally the transcription devices require a training cycle. A new user must train the system to recognize a vocabulary of words. The isolated word recognition devices use patterns of each individual word in performing the recognition. A typical training cycle requires one-half hour to several hours to say each of the words or phrases required for training.
The transcription devices are generally organized to recognize free text spoken by an individual. The transcription devices are advertised with a description of the number of tens of thousands of words that can be recognized by the system. These devices use a decision procedure that requires the recognition of isolated words from a very large vocabulary for a single individual. Many isolated words are short and easy to confuse with other words. When the vocabulary available is large, there is more possibility of confusion with other words in the vocabulary.
Isolated word recognition devices may be used to generate fill in the blanks reports. The blanks are filled with isolated words. If the words could be restricted to only the few words that are available for the particular blanks, the recognition problem would be very much easier and the performance much better. Similarly, structured reporting using codes could be performed effectively by isolated word recognition devices using small vocabularies. However, this has not been done in the past.
Batch transcription after dictation method uses dictated voice reports that are transcribed at a later time. However, this method is not desirable. The radiologist must review the transcribed text at a later time to determine that the information has been accurately transcribed. Time delays associated with the batch processing method also make the approach less desirable.
Continuous speech recognition devices are useful. When a radiologist does not have to speak each word as an isolated word, generation of the report can proceed much more quickly with less attention from the radiologist. However, while it is desirable, continuous speech recognition of free text is generally not performed. The problem is technically difficult. The usual result is transcribed text with many errors.
In view of the above, it is believed that a report transcription tool that generates reports using a limited number of sounds may be advantageously employed in a number of disciplines.
A number of patents relate to voice recognition, and the like. The patents may be grouped into those disclosing template matching, single word recognition, hidden Markov model, and subphrase recognition. Template matching is a generic approach. The cited patents typically have different measurements to match against templates of words. Single word recognition requires an easily recognized beginning and ending of a word. This technique requires the individual to use an artificial one-word-at-a-time speaking style. Hidden Markov model techniques use probabilistic techniques to determine the most probable next word in a sequence. The probabilistic model is generated by analyzing long sequences of text. Subphrase recognition operates on very short segments of sounds, generally less than one word. The hidden Markov model above could be used for such combinations of subphrases to recognize words.
The concept of a sound alphabet is disclosed in U.S. Pat. No. 4,829,576, for example, but others also contain the basic idea of recognizing sounds from an alphabet of sounds. Many have very complex sounds representing not only basic phoneme sounds but sequences of phonemes with transitions therebetween.
Template matching is disclosed in U.S. Pat. No. 5,329,608 issued to Bocchieri, which matches subword strings, not phrases and is believed to generally exhibit poor performance. U.S. Pat. No. 5,142,585 issued to Taylor discloses template matching of stored reference vocabulary words, single word recognition, and adaptation to the surrounding sounds and speaking peculiarities of the individual. U.S. Pat. No. 4,712,242 issued to Rajasekaran discloses speaker independent recognition of a small vocabulary which uses zero crossings and dynamic time warping, and spoken words are compared to reference templates of the individual vocabulary words. U.S. Pat. No. 4,284,846 issued to Marley discloses single word recognition using template matching, and wherein transitions and transition glides are characterized. U.S. Pat. No. 4,910,784 issued to Doddington discloses single word recognition using binary feature components and template matching to reference words.
U.S. Pat. No. 5,425,128 issued to Morrison discloses single word recognition with training on each word to be recognized. U.S. Pat. No. 4,866,778 issued to Baker discloses single word at a time recognition. U.S. Pat. No. 4,811,399 issued to Landell discloses a single word recognizer using xe2x80x9csound templatesxe2x80x9d. U.S. Pat. No. 4,336,421 issued to Welch discloses using inter-segment and inter-string boundaries for grouping speech for single sound segment recognition, which is related to single word recognition. U.S. Pat. No. 5,165,095 issued to Borcherding discloses speaker independent templates for recognizing telephone numbers. U.S. Pat. No. 4,780,906 issued to Rajasekaran discloses speaker independent limited word recognition using energy measures and zero crossing rate to generate feature vectors. U.S. Pat. No. 4,388,495 issued to Hitchcock discloses speaker independent single word at a time, and uses zero crossings to established voiced, fricative, and silence, and uses templates of zero crossings. U.S. Pat. No. 5,231,670 issued to Goldhor discloses dictation events and text events that are used in recognizing single words and commands. U.S. Pat. No. 5,054,085 issued to Meisel discloses pitch, spectrum parameters, and time measurements for speech recognition. The training process is involved, and a scheme for mapping an individual""s speech to general speech templates is disclosed. U.S. Pat. No. 5,524,169 issued to Cohen discloses location specific libraries of templates adapt to speech accents, and includes place names, proper names, and business establishments in template set. U.S. Pat. No. 4,763,278 issued to Rajasekaran discloses speaker independent recognition of a small vocabulary, and uses zero crossings rates for measurement, wherein templates of quantized zero crossing rates are used to recognize words. U.S. Pat. No. 5,526,466 issued to Takizawa discloses single word recognition using durations of speech units. U.S. Pat. No. 5,212,730 issued to Wheatley, discloses speaker independent recognition of names for access, and includes extensive training. U.S. Pat. No. 4,618,984 issued to Subrata, discloses single word recognition using adaptive training from continuous speech, which uses a method of shortening the training cycle to make the method easier.
The use of a hidden Markov models is disclosed in U.S. Pat. No. 5,509,104 issued to Lee, which discloses a hidden Markov model approach to small vocabulary, speaker independent speech recognition. U.S. Pat. No. 5,033,087 issued to Bahl discloses that Markov models are used with variations in sound structures of phonemes in various contexts to recognize words. U.S. Pat. No. 5,278,911 issued to Bickerton discloses multiple examples of individual words for training, and uses a neural network and hidden Markov model for recognition. U.S. Pat. No. 4,852,180 issued to Levinson discloses speaker independent, continuous speech recognition. and implements a continuously variable-duration hidden Markov model. U.S. Pat. No. 4,783,804 issued to Juang discloses a hidden Markov model.
Subphrase recognition is disclosed in U.S. Pat. No. 4,829,576 issued to Porter, which uses text string from recognized utterance to find same text string in other places in sample text and uses results to limit the words to be recognized and/or the probability of the words that can be next. In contrast, the present invention converts the text to a sound alphabet that can be searched for matches to the sounds that are being spoken. U.S. Pat. No. 4,181,813 issued to Marley discloses a phoneme recognizer that uses delta mod at two different rates to recognize attacks and transitions, and uses speech xe2x80x9cwaveform characteristicsxe2x80x9d in a phoneme decision tree to recognize the phonemes. U.S. Pat. No. 5,208,897 issued to Hutchens discloses a system that recognizes sub-syllables, maps collections of sub-syllables to syllables, and maps collections of syllables to words for word recognition.
Other speech-related techniques are disclosed in a number of patents. U.S. Pat. No. 4,713,777 issued to Klovstad discloses use of a grammar graph and non-speech segment recognition. U.S. Pat. No. 4,757,541 issued to Beadles discloses general analysis to identify group of phonemes followed by optical analysis of lip shape to determine member of group. U.S. Pat. No. 3,812,291 issued to Brodes discloses use of xe2x80x9cProperty Filtersxe2x80x9d, wherein matching to previous reference patterns from binary signals generated from the Property Filters. U.S. Pat. No. 5,168,548 issued to Kaufman inserts selected recognized words into canned text reports. U.S. Pat. No. 4,087,632 issued to Hafer discloses a feature extractor uses the Coker vocal tract model to extract tongue position and motion along with other variables, and uses formants in the modeling and matching to library words. U.S. Pat. No. 4,713,778 issued to Baker discloses dynamic programming with grammar graphs applied to acoustic speech sound parameters. Recognizes keywords. U.S. Pat. No. 4,718,088 issued to Baker discloses a training method for use with U.S. Pat. No. 4,713,778. U.S. Pat. No. 5,027,406 issued to Roberts discloses single word or continuous speech using xe2x80x9cword modelsxe2x80x9d. When the recognizer is confused, it presents possible words and asks the user to select. The selection is used to update the word model for that word and add the word to the vocabulary, if it is not present.
Accordingly, it is an objective of the present invention to provide for a continuous speech voice transcription method and apparatus for use in transcribing structured reports, such as radiology reports, and the like, using a limited number of sounds and requiring a very short training cycle.
To meet the above and other objectives, the present invention provides for an apparatus and method that embody a new approach for performing automatic speech transcription. The approach is based on simple recognition of a vocabulary of sounds (a sound alphabet) followed by a translation to text. The translation to text uses a novel technique based on matching spoken sounds to sounds of previous text sequences.
The system comprises a microphone that is coupled to a sound processor or computer. The sound processor is coupled to a printer or display that displays the transcribed text. A sound dictionary is created that is stored in a memory of the sound processor that represents a translation between text and sound. The sound dictionary may be formed by using a sound translation guide similar to that contained in Webster""s dictionary, for example. The sound dictionary is used for any individual that uses the system.
The system is trained for each specific individual that is to use it because of differences in each individual""s voice characteristics. The individual that is to use the system speaks a predefined set of words containing a limited number of predefined sounds into the microphone. The sound processor processes the sounds to create a sound alphabet that is specific to the individual. The sound alphabet is represented by a set of symbols, such as single and double letters, for example, and may be generated using Cepstral coefficients, for example.
The sound alphabet for the individual is used to recognize sounds spoken by the individual and which are output for comparison to sounds contained in sound strings generated by applying text to the sound dictionary. When the individual speaks a particular sound, a corresponding sound from the sound alphabet is accessed. Each sound that is accessed in the sound alphabet is compared to the sounds in the sound string. When the spoken sound string matches the sound string from the text string, the corresponding phrase, or string of text, contained in the recorded text is accessed and output for printing or display.
Thus, during transcription, the individual speaks phrases into the microphone. The phrases are processed to recognize the sounds of the individual""s sound alphabet. The sounds of the individual""s sound alphabet are then compared with sounds from the sound strings. When sound matching occurs, text phrases contained in the recorded text that match the sounds from the sound dictionary are accessed. The recorded text that corresponds to the sounds from the translated text sound string is output to the printer or display as part of a printed transcription.
The present invention only processes a limited set of text relating to a specific area, such as radiology, for example, and thus only a limited number or set of sounds is required for the translated sound string. Phrases that are spoken by the individual are processed using the individual""s sound alphabet to access matching phrases contained in the text. Thus, spoken phrase segments are matched to phrases contained in the prerecorded text, which is output as part of the transcription.
The present invention provides for speaker dependent, continuous speech recognition using a limited number of sounds, and makes the speech easier to recognize. Training for the processing requires only speaking a set of words with the desired sounds imbedded therein. The training process takes about thirty seconds. The system and method are therefore speaker dependent, in that the present approach functions only for the individual that is currently using the system and who has generated a training set of sounds.
The present invention recognizes phrases, not words. The phrases are connected speech. The only departures from continuous sounds are those that occur naturally, such as glottal stops and pauses between words. The lack of speech is treated as just another sound in the sound sequence. A natural phrasing is in terms of sentences. When a period is found in the text, a new search for a phrase that matches the continuation of the speech is initiated.
The present invention can generate long strings of text with only a few words. That is, the sound sequence for a text string can be a xe2x80x9ccode wordxe2x80x9d for the text string. A simple example is xe2x80x9cNormal Chestxe2x80x9d. Many radiologists dictate a long string of text that is used as a report when the patient has a normal chest. Many other text strings may also be triggered by a phrase. The present invention works well in generating radiology reports because radiologists use a very limited vocabulary to produce the reports.
In contrast to the prior art techniques, the present invention uses a very simple sound alphabet without transition sounds, a very short training cycle, and uses a small body of text that is translated to the sound alphabet. The present invention searches the sequence of translated sounds from the sample text for similar sound sequences representing whole phrases, not single words. The corresponding text segments are then reported as the recognized speech translated to text.
The small training cycle permits the recognition to be speaker dependent, but does not require a large investment of time to use the device. The speaker dependent recognition has much better performance than a speaker independent recognition process. The use of the sound alphabet permits translation of the voice sounds to a speaker independent representation of the sounds. The use of a body of recorded text for the matching process limits the vocabulary to a single subject. The use of the body of text also provides sequences of words in proper order without requiring an extensive analysis process to derive parameters of a hidden Markov model. The recorded body of text provides examples of the next word in sequence instead of a probabilistic model of which words might be next in a sequence as is done in Markov processing.
The present invention employs processing to recognize whole phrases without an established beginning or end. The sequence of sounds as represented in the translated text is without beginning or end. The sounds that are processed thus come from continuous speech. The use of a limited vocabulary provides fewer possibilities for errors. The present invention recognizes whole phrases, with much reduced error rates compared to conventional approaches.