In many industries, there is a need to be able to quickly and accurately describe information orally and have the information from that description accurately entered into a system to be further processed. The more accurately the system can determine what has been described or spoken, the quicker and more accurately the information can be processed and stored or used such as to generate a report.
The main obstacle to such a system involves solving the complex problem of determining what has been said, or matching the word or words of a spoken utterance to the terms of a template in order to make the correct determination.
Although such an utterance matching determination is useful in a number of industries, one of the industries that would certainly benefit from such a system is the medical industry. In the medical industry, medical professionals regularly generate reports based on review and examination of a patient by providing information through a number of input methods, each of which have certain advantages and disadvantages. These input methods include handwriting, typing, dictation, and speech recognition systems, among others. Clearly, handwriting and typing are slow methods of inputting information about a subject such as a patient. Further, handwriting and sometimes typing require the person describing the situation to look away from the subject being described. These methods often extend the time necessary to input a proper description of, for example, an examination or investigation. In the medical profession, this extended input time is undesirable and can impact not only immediate patient care, particularly when the patient is in a critical condition, but also long term healthcare costs and overall efficiency of a medical professional or institution. Equally problematic is the fact that handwritten and typed information is merely text, not actionable data. In order for textual information to be used by certain types of systems, it must first be reviewed by a human, who can then act on the information. Actionable data, on the other hand, can be acted upon by automated processes. Such actionable data often facilitates taking action on the data with lower time investment by a medical professional or other person, but also, reduces mistakes made from human interpretation of handwriting. For example, such mistakes may include processing a familiar handwritten prescription contrasted with the streamlined processing of a prescription entered as actionable data, which can be imported directly into a patient's electronic medical record.
Another input method—dictation—permits an individual such as a medical professional to verbalize the substance of the information into a recording device. From this recording, a written transcript is prepared, often by the person dictating or a second person who listens to the recording at a later time. The person dictating typically must review the transcribed report for accuracy. Because typically someone other than the person dictating often prepares the transcript from the recording made by the professional, errors result from the transcriber's inability to accurately identify what was said. After the medical professional is satisfied with the accuracy of the transcript, a final report can be prepared, although spelling and grammatical errors often also appear in the transcript and thus in the final report. In addition, it takes time for a dictated report to be transcribed, reviewed, edited, and approved for final distribution. Finally, and most importantly, the resulting transcription is merely readable text, not actionable data.
Another input device permits transferring spoken words into actionable data without requiring a person to transcribe a recording, specifically, a speech recognition device. Such devices are known for entering spoken descriptions into a computer system. These technologies permit a user, such as a medical professional, to speak into a recording device and, through the use of speech recognition software, a transcription for the medical report can be prepared. For purposes of this application, the term “speech recognition” is defined to be synonymous with “voice recognition”. The transcription or report that results from this process can then be revised by the professional, either on a display device (real-time or off-line) or on paper (off-line), and edited, if necessary. This approach, however, is not without its drawbacks.
Problems with conventional speech recognition technologies include erroneous transcription. Transcription error rates typically range from 5% to 15% depending on the speaker's precision and skill with the language used to prepare the report, the environment in which the speaker is verbalizing the information, and the difficulty of vocabulary in the verbalization. Equally important, speech recognition errors are unpredictable, with even simple words and phrases sometimes misrecognized as completely nonsensical words and phrases. In order to prevent these recognition errors from appearing in the final report, the medical professional must very carefully review the machine-transcribed report. Given the large number of reports that many medical professionals are required to prepare in a single day, they often attempt to review the transcribed text as it is produced by speech recognition software by glancing at the transcribed text on the display device while multi-tasking, for example, while receiving or analyzing the data or image about which the transcription or report is being prepared.
This multi-tasking approach is time consuming and permits easily overlooking errors in the transcribed text. For certain medical professionals, for example, radiologists, the traditional approach to report preparation using speech recognition software is particularly problematic. It is not easy for a radiologist to go from examining the intricate details of an X-ray to reviewing written words, then return to examining the X-ray without losing track of the exact spot on the X-ray or the precise details of the pathology that he or she was examining before reviewing the text transcribed from his or her dictated observations. In addition, the displayed report occupies space on the display device, preventing it from illustrating other content, such as images. Finally, as with dictation, the resulting transcription is merely readable text not actionable data.
Structured reporting technologies also are known. They permit, for example, a medical professional to record data about a patient using a computer user interface, such as a mouse and/or keyboard. The medical report is automatically generated from this information in real-time.
The primary problem with current structured reporting technologies is that they may require that a medical professional take an unacceptable amount of time to complete a report when using a traditional computer user interface. Medical reports often require very large structured reporting data sets. As a result, navigating these data sets may be complex and entering findings may become a lengthy process that requires time that medical professionals could use more effectively attending to other tasks, such as seeing additional patients, preparing additional medical reports, or reading medical literature.
Some structured reporting systems may include the limited use of speech recognition software to support navigation and data entry, in which a user selects an item on-screen by reading its name aloud instead of clicking it with a mouse or enters a numeric value into an on-screen data entry box by speaking it aloud instead of typing it in. While this use of speech recognition allows the reporting interface to be operated in a “hands free” manner, it does not make navigating the structured data set any faster—quite the contrary—nor does it remove the need to look at the reporting interface to see the list of available choices at each data-entry point.
Attempts have been made to improve the efficiency with which reports, including medical reports are prepared. Often these methods use what are termed “macros”. A macro is a rule or pattern that specifies how a certain input sequence (often a sequence of words) should be mapped to an output sequence (also often a sequence of words) according to a defined procedure. The mapping process transforms a macro into a specific output sequence. For example, a “simple macro” is a text string 11, as illustrated in FIG. 1A, identified by a name. Another example is a macro corresponding to the text string “No focal liver mass or intrahepatic duct dilatation” identified by the macro name “Normal liver”.
In a reporting system that uses macros, a macro name is typed, selected on-screen, or spoken aloud, matched against the set of macro names in the system, and the corresponding macro selected and recorded into memory. Matching the input name against the set of macro names is a basic text string matching problem, making this a relatively simple task.
However, there are certain disadvantages associated with systems using only “macro name”-activated reporting. The downside to this approach is that any information to be recorded using macros must be pre-coordinated into a manageable number of named macros. Exhaustive pre-coordination of related information is obviously intractable. Covering even a modest subset of combinations would yield an impractically large number of macros. For example, some reporting systems permit reports to be generated through the use of an extensive set of macros or a macro library. A macro library may include tens, hundreds, or even thousands of macros created, for example, by users to match a specific reporting style, or by commercial vendors and licensed as “comprehensive” macro sets. While large macro sets can permit a wide variety of reports to be prepared more rapidly under a wider range of circumstances, the sheer size of the library can be a significant disadvantage as memorizing all of the macro names may be simply infeasible for the user.
To reduce the effects of this disadvantage, large macro libraries may include a user interface that categorizes macros and provides for visual navigation of the extensive macro library. However, this visual navigation approach has nearly all of the disadvantages of a structured reporting user interface. Navigating an on-screen interface that categorizes the macros in the macro library takes significant time. It also requires a medical professional to remove his or her visual focus from other clinical tasks, such as reviewing the medical images which are the subject of the report or even attending to a patient. Navigating an on-screen interface may be a significant distraction that may lead to errors, as well as increase the time it takes to prepare a report.
Also in an effort to reduce the amount of dictation that must be performed without exploding the number of macros, certain systems may use what is called a “complex macro”. A “complex macro” includes at least one short cut, or placeholder, such as blank slot or pick-list, an example of which is shown in FIG. 1B. The placeholders indicate where the user may—or must—insert additional text. Some technologies that record and transcribe the spoken word utilize simple or complex macros. For example, by stating the name of the macro in a voice command or selecting the name in a user interface, the associated text and placeholders are included in the medical report. The text can be then be selected on-screen and edited, and any placeholders can be selected on-screen and filled in by the medical professional to generate narrative text.
Certain simple macros and the names by which each is identified are shown in the following chart:
Macro NameMacro Content“Right dominant”The coronary circulation is right dominant.“NormalThe      coronary arteries are patent withoutcoronaries”significant disease.“LAD lesion”The left anterior descending artery has a     stenosis in the      segment.
The macro content (right column of chart) can be orally identified to the system that is being used to prepare the report by simply mentioning the macro name (left column of chart). The system then imports the associated content (text and/or placeholders) in the report. The user speaks the macro name, which the reporting system then uses to select a macro content, followed by the names of terms in various pick-lists (which the reporting system then uses to record terms from pick-lists), such as the sequence below:                Pathology: mass        Size: small        Shape: oval        Margins: smooth        
However, because this type of technology often offers only “macro name” type information input, the user is forced into a rigid command-like dialog. In addition, using the “complex macro” feature requires that a user look at the reporting interface, and away from the image display, in order to select from the various on-screen placeholders; greatly limiting the effectiveness and use of this feature.
Whether using a simple macro or a complex macro, certain techniques for the step of matching the utterance with the macro (or macro name) are known in the art.
For example, a term-matching algorithm might compute the relative match between an utterance and a macro as being equal to the percentage of terms in the macro text that are matched to a word in the utterance. Term-matching algorithms may use the words in a given vocabulary to populate a term vector space in which each dimension corresponds to a separate word in the vocabulary. The individual dimensions of a term vector space are commonly weighted to reflect the relative infrequency with which terms are used; that is, greater weight is given to terms which occur less frequently. Given a vocabulary term vector space, a given set of terms can be represented as a term vector, where each term in the set has a non-zero (weighted) value in the corresponding dimension of the term vector. Not all words are equally important, or equally useful, when it comes to matching. Weighted term vectors are used to compute a probabilistic score of the degree to which terms match.
Somewhat more sophisticated term-matching algorithms account for the relative match between an utterance and a macro in a bidirectional manner; that is, they attempt to capture how well the macro matches the utterance in addition to how well the utterance matches the macro. In such algorithms, the relative match score is often computed as the dot product of the utterance term vector and macro text term vector. Whether a simple percentage or a dot product is used, the relative degree of the match is typically expressed as a numeric score and threshold filters are applied to categorize the accuracy of the match, such as exact match, partial match, or no match.
However, for a macro that includes a term-hierarchy such as the illustration in FIG. 1B, simply matching the words in an utterance against the set of terms that occur in the template, as per existing term-matching techniques, will not produce a useful result; in large part, because doing so ignores the semantics of the term-hierarchy. For example, matching the utterance “medium mass in the liver” against the set of terms in the template in FIG. 1B produces a percentage score of 0.27 and a term vector dot product score of 0.52. These low scores reflect the inclusion of all the terms in the hierarchy of the SIZE group 14 and the hierarchy of the ORGAN group 16, despite the fact that only one term can be selected from each hierarchy when filling-in the template.
An alternative existing approach is to use a finite-state automata to walk the utterance and the term-hierarchy in parallel—or, more precisely, to walk the utterance and a depth-first traversal of the term-hierarchy in parallel—attempting to match words of the utterance with terms of the term-hierarchy. Using a finite-state automata to match the utterance, “large mass in the liver” to the template in FIG. 1B begins with matching the utterance against the hierarchy of the SIZE group 14, yielding a match for “large” 15; followed by a match with “mass” 12 in the hierarchy root; and finally with a match for “liver” 17 in the hierarchy of the ORGAN group 16.
A disadvantage of using finite-state automata for the matching step is that it depends on the order of the words in the utterance precisely matching the order of the terms in the template. For example, finite-state automata matching techniques will not match the utterance “liver has a large mass” with the template shown in FIG. 1B because the word “liver” precedes the keyword “mass” in the utterance, but follows it in the template. Nor do finite-state automata approaches account for situations in which no terms match in a given hierarchy or where there are only partial matches within a given hierarchy or with the hierarchy root. These limitations become acute as the size and complexity of the term-hierarchy increases.
These disadvantages are intrinsic to the approach, but are not a disadvantage in domains in which a user is limited to reading from a scripted or prompted dialog; for example, when reading aloud a choice from among a set of displayed “fill-in-the-blank” options. These are the kinds of domains for which matching using finite-state automata has been cited in the past.
Another problem with term-based matching is that it treats each utterance and template as a simple “bag of words”. For example, matching the utterance “mass in the lower right quadrant of the left breast” against the following term-hierarchy:                PATHOLOGY: mass        ANATOMY: [left breast, right breast]        LOCATION: [upper left quadrant, upper right quadrant, lower left quadrant, lower right quadrant]using term-based matching (depicted in bold) yields an inconclusive result with respect to the “ANATOMY” field and the “LOCATION” field because both “left breast” and “right breast” include the term “breast” which is a word of the utterance, as are both “lower left quadrant” and “lower right quadrant”—both including the terms “lower” and “quadrant” of the utterance.        
Clearly, known techniques for matching an utterance with a macro text have distinct disadvantages.
In addition, macros alone are usually insufficient to complete a report such as a medical report. The use of macros in existing reporting systems is typically limited to pre-selected combinations of information such as procedures, anatomy, and patient history including disease state. In a radiology reporting system, examples of macro names include “chest x-ray”, “chest x-ray asthma”, “chest x-ray bronchitis”, “shoulder x-ray”, and “pelvic x-ray”. Accordingly, many medical reports consist of a combination of text strings recorded as macros (and perhaps subsequently edited) and then, because the available macros do not capture certain information that the medical professional wishes to convey, unstructured free-form statements are entered directly by the user either via transcription or typing.
When multiple types of input methods—e.g., macros and free-form text—are needed to convey the desired information, the existing reporting systems may require the medical professional to utilize a mouse, keyboard, or other physical input device to navigate through the various input options. Accordingly, the medical professional must, not only look away from the task—e.g., interacting with or testing the patient or reading a medial image—but also use his or her hands to navigate through the various on-screen interfaces. If, for example, the medical professional is conducting an ultrasound or using another medical device to perform a test, their hands may have to be removed from the medical device (thereby possibly interrupting the test/procedure) to navigate through the medical reporting screens.
Overall, dictation, speech recognition, and structured reporting, any of which may include using simple or complex macros, permit only limited medical reporting processes. For example, such known methods limit the speed with which reports can be created and require medical professionals to adjust their visual and cognitive focus back and forth between clinical information (e.g., images, test results, and patients) and the reporting system's user interface. Medical professionals need to be able to enter information quickly and efficiently, possibly to transmit the resulting information rapidly to other medical professionals (e.g., referring physicians or pharmacists) or to move on to another value-added task.
Again, regarding the example in which the user is a radiologist, each user may need to enter an enormous amount of information in a single day. Understanding and obtaining the information requires intense visual focus on one or more images, such as X-ray images, computed tomography (“CT”) scans, magnetic resonance images (“MRI”), and ultrasound loops. Having to look away from an image to a computer user interface on which the report text or data entry interface appears, is a significant, time-consuming distraction that can lead to errors and longer reporting times.
To overcome these disadvantages, an individual performing an examination, review, or observation should be able to use spoken words or utterances to enter the necessary information while continuing to perform their core task, for example, a medical professional visually examining medical images or reviewing clinical data.
Clearly, there is a need for a system and methods configured to compare spoken utterances about a subject to the terms in a template hierarchy that accounts for variation in utterance expression such as word order, grammatical form, incomplete phrasings, extraneous terms, synonymous terms, and multi-term phrasings; generate more accurate matches such that the user needs to make fewer edits and can maintain more visual focus on the subject; and, after conducting the comparison, provide the output to the user as actionable data.