In many industries, there is a need to be able to quickly and accurately describe an event orally and have the information from that description accurately entered into a system to be further processed. The more accurately the system can determine what has been described or spoken, the quicker and more accurately the information can be processed and stored or used such as to generate a report.
The main obstacle to such a system involves solving the complex problem of determining what has been said, or matching the word or words of a spoken utterance to the terms of a template in order to make the correct determination.
Although such an utterance matching determination is useful in a number of industries, one of the industries that would greatly benefit from such a system is the medical industry. In the medical industry, medical professionals regularly generate reports based on review and examination pertaining to the treatment and care of a patient by providing information through a number of input methods, each of which have advantages and disadvantages. These methods include handwriting, typing, dictation and speech recognition systems, among others. Clearly, handwriting and typing are extremely slow methods of inputting information about a subject such as a patient. Further, handwriting and sometimes typing both require the person describing the situation to often look away from the subject being described. These methods are slow and usually delay the time necessary for a proper description of, for example, an examination or investigation, to occur. In the medical profession, this delay is undesirable and can further impact not only immediate patient care, particularly when the patient is in a critical condition, but also long term healthcare costs. Equally problematic is the fact that handwritten and typed information is merely text, not actionable data. In order for textual information to used, it must first be read by a human, who can then act on the information. Actionable data, on the other hand, can be acted upon by automated processes. A simple example is the time invested and mistakes made in processing a familiar handwritten prescription contrasted with the streamlined processing of a prescription entered as data into a patient's electronic medical record.
Dictation allows an individual such as a medical professional to speak the substance of the information into a recording device. From this recording, a transcript is later prepared, often manually. The person dictating typically must review the transcribed report for accuracy. Because typically someone other than the person dictating actually prepares the transcript from the recording made by the professional, errors result from the transcriber's inability to accurately identify what was said. After the professional is satisfied with the accuracy of the transcript, a final report can be prepared, although spelling and grammatical errors often also appear in the transcript and thus in the final report. In addition, it takes time for a dictated report to be transcribed, reviewed, edited, and approved for final distribution. Finally, and most importantly, the resulting transcription is merely text (to be read) not actionable data.
Further, speech recognition technologies are known for entering spoken descriptions into a computer system. These technologies permit a user, such as a medical professional, to speak into a recording device and, through the use of speech recognition software; a transcription for the medical report can be prepared. For purposes of this application, speech recognition is defined to be synonymous with voice recognition. The transcription or report that results from this process can then be revised by the professional, either on a display device (real-time or off-line) or on paper (off-line), and edited, if necessary. This approach, however, is not problem-free.
Problems with conventional speech recognition technologies include erroneous transcription. Transcription error rates typically range from 5% to 15% depending on the speaker's skill with the language used to prepare the report, the environment, and vocabulary. Equally important, speech recognition errors are unpredictable, with even simple words and phrases being misrecognized as completely nonsensical words and phrases. In order to prevent these recognition errors from appearing in the final report, the medical professional must very carefully review the transcribed report. Given the large number of reports that many medical professionals are required to prepare in a single day, they often attempt to review the transcribed text as it is produced by speech recognition software by glancing at the transcribed text on the display device while receiving or analyzing the data or image about which the transcription or report is being prepared.
In some reporting environments, however, this approach is time consuming and can cause errors in the transcribed text to be overlooked and/or cause errors to creep into the report. For example, for radiologists, the traditional approach to report preparation using speech recognition software is particularly problematic. It is not easy for a radiologist to go from examining the intricate details of an X-ray to reviewing written words, then return to examining the X-ray without losing track of the exact spot on the X-ray or the precise details of the pathology that he or she was examining before reviewing the text transcribed from his or her dictated observations. In addition, the displayed report occupies space on the display device, preventing it from illustrating other content, such as images. Finally, as with dictation, the resulting transcription is merely text (to be read) not actionable data.
Structured reporting technologies are known also. They permit, for example, a medical professional to record data about a patient using a computer user interface, such as a mouse and/or keyboard. The medical report is automatically generated from this information in real-time.
The primary problem with current structured reporting technologies is that they may require that a medical professional take an unacceptable amount of time to complete a report when using a traditional computer user interface. Medical reports often require very large structured reporting data sets. As a result, navigating these data sets may be complex and entering findings may become a lengthy process that requires time that medical professionals could use more effectively attending to other tasks, such as seeing additional patients, preparing additional medical reports, or reading medical literature.
Some structured reporting systems may include the limited use of speech recognition software to support navigation and data entry, in which a user selects an item on-screen by reading its name aloud instead of clicking it with a mouse or enters a numeric value into an on-screen data entry box by speaking it aloud instead of typing it in. While this use of speech recognition allows the reporting interface to be operated in a “hands free” manner, it does not make navigating the structured data set any faster—quite the contrary—nor does it remove the need to look at the reporting interface to see the list of available choices at each data-entry point.
Attempts have been made to improve the efficiency with which reports, including medical reports are prepared. Often these methods use what are termed “macros”. A macro is a rule or pattern that specifies how a certain input sequence (often a sequence of words) should be mapped to an output sequence (also often a sequence of words) according to a defined procedure. The mapping process instantiates (transforms) a macro into a specific output sequence.
Traditional macros include simple macros and complex macros. A “simple macro” is a text string identified by a name. For example, a macro corresponding to the text string “No focal liver mass or intrahepatic duct dilatation” may be identified by the macro name “Normal liver”. A “complex macro” includes at least one short cut, or placeholder, such as blank slot or pick-list, for example such as that shown in FIG. 1B. The placeholders indicate where the user may—or must—insert additional text. Some technologies that record and transcribe the spoken word utilize macros. For example, by mentioning the name of the macro in a voice command or a user interface, the associated text and placeholders are included in the medical report. The text can be then be selected on-screen and edited, and any placeholders can be selected on-screen and filled in by the medical professional to generate narrative text.
Certain simple macros and the names by which each is identified are shown in the following chart:
Macro NameMacro Content“Right dominant”The coronary circulation is right dominant.“NormalThe    coronary arteries are patent withoutcoronaries”significant disease.“LAD lesion”The left anterior descending artery has a    stenosisin the    segment.
The macro content (right column of chart) can be orally identified to the system that is being used to prepare the report by simply mentioning the macro name (left column of chart). The system then includes the associated content (text and/or placeholders) in the report. According to this technology, the user is forced into a rigid command-like dialog. The user speaks the macro name, which the reporting system then uses to select a macro content, followed by the names of terms in various pick-lists (which the reporting system then uses to record terms from pick-lists), such as the sequence below:                Pathology: mass        Size: small        Shape: oval        Margins: smooth        
Some reporting systems allow reports to be generated through the use of an extensive set of macros or a macro library. A macro library may include tens, hundreds, or even thousands of macros created, for example, by users to match a specific reporting style, or by commercial vendors and licensed as “comprehensive” macro sets. While large macro sets can be advantageous and permit a wide variety of reports to be prepared more rapidly under a wider range of circumstances, the sheer size of the library can be a significant disadvantage as memorizing all of the macro names may be simply infeasible for the user.
To lessen this problem, large macro libraries may include a user interface that categorizes macros and provides for visual navigation of the extensive macro library. However, this navigation approach has all of the disadvantages of a structured reporting user interface. Navigating an on-screen interface that categorizes the macros in the macro library takes significant time. It also requires a medical professional to remove his or her visual focus from other clinical activities, such as reviewing the medical images which are the subject of the report or even attending to a patient. Navigating an on-screen interface may be a significant distraction that may lead to errors, as well as increase the time it takes to prepare a report.
In addition, macros alone are usually insufficient to complete a medical report. Many medical reports consist of a combination of text strings recorded as macros (and perhaps subsequently edited) and unstructured free-form statements entered directly by the user (transcribed or typed).
Overall, dictation, speech recognition, and structured reporting including structured reporting using traditional macros constrain medical reporting, for example, by limiting the speed with which reports can be created and by forcing physicians to adjust their visual and cognitive focus back and forth between clinical information (e.g., images, test results, and patients) and the reporting system's user interface. Medical professionals need to be able to enter information quickly and efficiently, oftentimes so as to transmit the resulting information rapidly to other medical professionals (e.g., referring physicians).
Again, with respect to radiology, a single user may need to enter an enormous amount of information in a single day. Understanding and obtaining the information requires intense visual focus on one or more images, such as X-ray images, computed tomography (“CT”) scans, magnetic resonance images (“MRI”), and ultrasound loops. Having to look away from an image to a computer user interface on which the report text or data entry interface appears, is a significant time consuming distraction that again can lead to errors and longer reporting times.
To overcome these disadvantages, an individual performing an examination, review or observation should be able to use the spoken word or utterances to enter the necessary information while continuing to perform their core task, for example, a medical professional visually examining medical images or reviewing clinical data.
Existing reporting systems organize content into a set of named macros. For example, in a radiology reporting system, a macro name is typed, selected on-screen, or spoken aloud, matched against the set of macro names, and the corresponding macro selected and recorded into memory. Matching the input name against the set of macro names is a basic text string matching problem, making this a relatively simple task. The downside to this approach is that any information to be recorded using macros must be pre-coordinated into a manageable number of named macros. Exhaustive pre-coordination of related information is obviously intractable. Covering even a modest subset of combinations would yield an impracticably large number of macros.
As a consequence, the use of macros in existing reporting systems is typically limited to pre-selected combinations such as procedure, anatomy, and patient history (disease state). In a radiology reporting system, for example, macro names include “chest x-ray”, “chest x-ray asthma”, “chest x-ray bronchitis”, “shoulder x-ray”, “pelvic x-ray”.
In an effort to reduce the amount of dictation that must be performed without exploding the number of macros, some reporting systems allow a macro to include pick-lists containing additional text that can be selected with a mouse or microphone button. Unfortunately, using this feature requires that a user look at the reporting interface, and away from the image display, in order to select from the various on-screen picklists; greatly limiting the effectiveness and use of this feature.
In the case of a simple macro with no hierarchy, existing techniques based on word matching can be used to compute how well an utterance matches the macro text. A term-matching algorithm, for instance, might compute the relative match between an utterance and a macro as being equal to the percentage of terms in the macro text that are matched to a word in the utterance.
Term-matching algorithms may use the words in a given vocabulary to populate a term vector space in which each dimension corresponds to a separate word in the vocabulary. The individual dimensions of a term vector space are commonly weighted to reflect the relative infrequency with which terms are used; that is, greater weight is given to terms which occur less frequently. Given a vocabulary term vector space, a given set of terms can be represented as a term vector, where each term in the set has a non-zero (weighted) value in the corresponding dimension of the term vector. Not all words are equally important, or equally useful, when it comes to matching. Weighted term vectors are used to compute a probabilistic score of the degree to which terms match.
Somewhat more sophisticated term-matching algorithms account for the relative match between an utterance and a macro in a bidirectional manner; that is, they attempt to capture how well the macro matches the utterance in addition to how well the utterance matches the macro. In such algorithms, the relative match score is often computed as the dot product of the utterance term vector and macro text term vector. Whether a simple percentage or a dot product is used, the relative degree of the match is typically expressed as a numeric score and threshold filters are applied to categorize the accuracy of the match, such as exact match, partial match, or no match.
In the case of a complex macro that includes a term-hierarchy such as that shown in FIG. 1B, simply matching the words in an utterance against the set of terms that occur in the template, as per existing term-matching techniques, will not produce a useful result; in large part, because doing so ignores the semantics of the term-hierarchy. For example, matching the utterance “medium mass in the liver” against the set of terms in the template in FIG. 1B produces a percentage score of 0.27 and a term vector dot product score of 0.52. These low scores reflect the inclusion of all the terms in the hierarchy of the SIZE group 14 and the hierarchy of the ORGAN group 16, despite the fact that only one term can be selected from each hierarchy when filling-in the template.
An alternative existing approach is to use a finite-state automata to walk the utterance and the term-hierarchy in parallel—or, more precisely, to walk the utterance and a depth-first traversal of the term-hierarchy in parallel—attempting to match words of the utterance with terms of the term-hierarchy. Using a finite-state automata to match the utterance, “large mass in the liver” to the template in FIG. 1B begins with matching the utterance against the hierarchy of the SIZE group 14, yielding a match for “large” 15; followed by a match with “mass” 12 in the hierarchy root; and finally with a match for “liver” 17 in the hierarchy of the ORGAN group 16.
The disadvantage of matching using finite-state automata is that such techniques are critically dependent on the order of the words in the utterance precisely matching the order of the terms in the template. For example, finite-state automata matching techniques will not match the utterance “liver has a large mass” with the template shown in FIG. 1B because the word “liver” precedes the keyword “mass” in the utterance, but follows it in the template. Nor do finite-state automata approaches account for situations in which no terms match in a given hierarchy or where there are only partial matches within a given hierarchy or with the hierarchy root. These limitations become acute as the size and complexity of the term-hierarchy increases.
These disadvantages are intrinsic to the approach, but are not a disadvantage in domains in which a user is limited to reading from a scripted or prompted dialog; for example, when reading aloud a choice from among a set of displayed “fill-in-the-blank” options. These are the kinds of domains for which matching using finite-state automata has been cited in the past.
Another problem with term-based matching is that it treats each utterance and template as a simple “bag of words”. For example, matching the utterance “mass in the lower right quadrant of the left breast” against the following term-hierarchy:                PATHOLOGY: mass        ANATOMY: [left breast, right breast]        LOCATION: [upper left quadrant, upper right quadrant, lower left quadrant, lower right quadrant]using term-based matching (depicted using underlining) yields an inconclusive result with respect to the “ANATOMY” field and the “LOCATION” field because both “left breast” and “right breast” include the term “breast” which is a word of the utterance, as are both “lower left quadrant” and “lower right quadrant”—both including the terms “lower” and “quadrant” of the utterance.        
Based on the disadvantages described above, there is a need for an utterance matching system that allows users, such as medical professionals, to match the words in a spoken utterance to the terms in a template hierarchy to select the best matching template or set of templates. There is a need for an utterance matching system that accounts for variation in utterance expression such as word order, grammatical form, incomplete phrasings, extraneous terms, synonymous terms, and multi-term phrasings. Therefore, there is a need for a system that operates independent of word order, form, construction or pattern of the utterance, but relies on structure, semantics and content thereby allowing a user to enter information about a subject such as a patient into a system using utterances, thereby not requiring the user to be distracted from visual focus of the subject.