The present invention relates generally to a system and method for the loading and unloading of dynamic grammars and section-based language models in a speech recognition system.
For most speech recognition applications, training speakers to dictate in an organized fashion is essential to increasing the efficiency of the system. A speaker trained to dictate certain language necessary for domain-based diagnosis, reporting, and billing documents can greatly increase the accuracy and efficiency of report generation. In addition, a speaker trained to dictate certain section-specific information in an organized and orderly fashion will further increase the accuracy and efficiency of the speech recognition system. However, even the best trained speaker can improve the accuracy and efficiency of an overall speech recognition system by only so much.
Speech recognition systems have for many years been designed with language models specific to certain domains. For example, a speech recognition system with a language model in the radiology domain will improve efficiency of the speech recognition engine when compared to such a system implemented with a general domain language model. The domain-specific language model is typically created using documents from the intended discipline of the speaker.
Specific domain language models are advantageous over general domain language models because the data within the specific domain language model is uniquely tailored to the intended speakers in that particular domain. The data within the specific domain language model is narrower when compared to the general domain language model, hence any speech recognition engine will be able to work more accurately and efficiently with a narrower domain.
Notwithstanding the advancements in speech recognition over the last few years, further advancement is still possible. For example, it is well known that different medical disciplines require certain documents and reports. It also well known that documents can be further broken down into sections and sub-sections. In the medical field, virtually every medical document consists of headings and subheadings where information related to these headings and subheadings is often quite distinct in structure and content from other sections of the document. For example, a discharge summary report will likely have a section that deals with the patient's history and physical examination, typically a narrative section. There may be another section that concerns the principle diagnosis, which is typically a list of disease names. Another section may include medications which themselves have an organization and content quite distinct from everything else in the document. This relationship between document structure and content is pervasive in medical reports and also common in other disciplines. Although some sections are more narrative and some are more structured in very specific ways, these structures tend to be fairly limited and repetitive within a given section. Narrative sections can be also highly repetitive utilizing a limited number of lexical and structural patterns. It is possible to exploit these repetitive patterns to improve accuracy and efficiency in report generation through automatic speech recognition.
Distinct section organization of reports and repetitive structural and lexical characteristics of sections is not limited to the medical domain; it is also found in other domains, such as public safety, insurance, and many others.
Most automatic speech recognition applications accommodate for the particular domain by developing domain-specific language models that relate to the discipline itself rather than to any kind of structural and organizational regularity in reports in the specific domain. Hence, in the medical domain, there typically exists either a general medical language model or more likely a language model that is very specific to the discipline or sub-discipline. For example, language models might be developed which are very specific to the documents and the language that are used by physicians in general in oncology, pediatrics, or other particular sub-disciplines.
In the event a physician practices across several medical domains, the physician may switch dictation domains from general domain dictation to specific domain dictation; or from one specific domain to another specific domain. The physician may dictate a letter that has general medical content which is quite different from a technical report such as a cardiac operative note. In this example, the speech recognition system needs to be nimble enough to switch from a general language model to a more specific language model.
It has been found that a speech recognition system having the ability to change domains within the context of a single document is desirable. Complicating this situation is the fact that there are no standards for the structure and organization of medical reports. Therefore, there exists a need for a speech recognition system having the ability to change domains within the context of a single document in any arbitrary way.
There have been attempts to improve speech recognition by using a language model that changes domains within the document context. Such a system is described in published U.S. patent application 20040254791 entitled “Method and Apparatus for Improving the Transcription Accuracy of Speech Recognition Software” with listed inventors Coifman, et al. Coifman et al. use standard and already existing automatic speech recognition technologies to perform contextual and adaptive ASR by domain, document type, and speaker. Coifman, et al. teach the use of sub-databases having high-likelihood text strings that are created and prioritized such that those text strings are made available within definable portions of computer-transcribed dictations as a first-pass vocabulary for text matches. If there is no match within the first-pass vocabulary, Coifman, et al. teach a second pass where the voice recognition software attempts to match the speech input to text strings within a more general vocabulary. This system as taught by Coifman, et al. is known a two-pass system. A drawback exists in the two-pass system in that it requires an assumption that there exists well-defined structured data, most likely input field type data. Such a system is not applicable in any environment existing off-line, such as a traditional telephony dictation system, without structure because there is no mechanism to identify structural units, their respective contents, and how the units will interact with the system. Unlike a free-form dictation approach, the two-pass system requires defined and clearly delimited data fields within which the speaker dictates.
In addition, the two-pass system requires the use of a fixed set of word combinations or “text strings” for each data field which limits the repertoire of text strings to those that have been observed to have been dictated or are allowed in certain sections or fields.
Further the two-pass system requires the use of a general vocabulary recognition system if no match is made to this repertoire of text strings, and as opposed to a speech recognition system that has vocabulary and grammatical constraints provided by knowledge of the text strings that have been observed to have been dictated in certain sections or fields.
Heretofore, there has been no system or method for loading and unloading of dynamic grammars and section-based language models in an automatic speech recognition system.
There exists a need for such a system and method that can operate with clearly defined data fields, but does not require the use of data fields within which the speaker dictates.
There also exists a need for such a system and method that is constrained by knowledge of the text strings that have been observed to have been dictated in certain sections or fields.
There also exists a need for such a system and method that is not limited by vocabulary and grammatical constraints provided by knowledge of the text strings that have been observed to have been dictated in certain sections or fields.
There also exists a need for such a system and method that dynamically identifies the larger context in which words are being used, with or without the presence of headings or key words, and applies section language models or grammars when there is evidence in the dictation that it could be used.