The present invention relates to a method and system for processing spoken language; more specifically using semantic recognition methods for determining when a spoken utterance is part of a knowledge domain, correcting incorrect or incorrectly recognized utterances, and suggesting context specific related domain sentences.
Speech recognition is a process in which an acoustic signal is converted to words and symbols by a computer. Nearly all speech recognizers use Hidden Markov Models (HMMs) to identify the most likely string of words in an utterance. HMMs are statistical models that associate spoken words with structural acoustic features found in speech signals. This association is made in part by training with speech data. HMMs also use word-level models with co-occurrence frequency information to improve recognition accuracy. Algorithms, such as Viterbi beam search, direct the search for likely word strings using these models. While this technique is accurate to a point, it is only able to produce statistical word string hypotheses, because these statistical models are inadequate for representing the semantic structure of an utterance. Yet, semantic information is central to understanding when an utterance is part of a domain of knowledge, and if not suggesting a meaningful correction. With semantic analysis, a speech recognition system can anticipate and suggest to the user other sentences they may wish to speak in a specific context. Semantic understanding also offers a valuable dimension to processing speech, making it useful for decision support, automatic summarization, and coding. While the utility of semantic understanding is widely acknowledged in the speech community, the development of semantic recognition algorithms is difficult.
Some have tried to use methods in natural language processing (nip) to extract semantic information. Speech recognition systems that employ nip are less common than pure statistically driven recognizers, and normally function as a sequential stage after statistical word recognition. Often the goal has been to process information for queries or commands spoken by users (Romero, Method and system for semantic speech recognition, U.S. patent application 20020111803), and (Vanbuskirk, Method and apparatus for disambiguating lists of elements for speech interfaces, U.S. Pat. No. 6,523,004).
Most nip systems use parsers, which employ grammar rules to identify phrases in a text string, and then use slot-filler logic to extract a semantic item. For example, a system might have a rule R1=<SUBJ><is normal> that when applied to the input string “The heart is normal” creates the output <SUBJ=heart> and semantically fills a slot such as Appearance: (Normal heart). A key problem with this type of semantic extraction is defining the relevant concepts in a circumscribed area of knowledge—a domain. Few tools and methods are available for systematically categorizing domain knowledge, especially for medium to large scale domains. Knowledge engineers often spend months creating even modest knowledge-bases. Without a semantic knowledge-base, a parser cannot successfully extract the knowledge of interest.
Parsers employed in conjunction with speech recognition are usually given the speech recognizer's top hypothesis after the initial recognition step. Chang and Jackson (U.S. Pat. No. 6,567,778) improved the confidence score for the top hypothesis generated by a speech recognizer by using a parser to determine whether to accept the words used in certain semantic slots. If the slot was rejected, their program could request the user to repeat only the information necessary to fill that slot, rather than requiring the user to repeat the entire stream of input speech. Weber (U.S. Pat. No. 6,532,444) used a similar approach to semantic speech processing, additionally providing a backup grammar file of general rules if context specific processing failed. One problem with parsers is high computational costs. While speech processing continues to increase in speed, natural language parsers operate in O(N)3 time, where n equals the number of word tokens, and thus are not suitable for interactive speech processing except for very short sentences.
Few have used nip techniques to guide the search of different word string hypotheses by the speech recognizer itself because of the computational complexity. In Crespo (U.S. patent application 20010041978), a method was described that used nip to restrict searching to an N-best list of salient words using grammar rules. In Asano et. al. (U.S. Pat. No. 5,991,721) actual language examples were used instead grammar rules. The speech recognizer's hypotheses were compared to these examples using a similarity metric to each of the individual words supplemented by a thesaurus. The best match was then used as the top rated hypothesis. While this method assures some similarity between the input words and actual words in example database, it cannot guarantee that the semantic meaning of the entire sentence is equivalent. Asano also does not describe methods for creating the example database. Considering the complexity and expressiveness of language, creating a good example database is a major obstacle to using his method in even a medium size knowledge domain.
While these approaches offer improvements for processing speech, they do not answer the question of whether an utterance is semantically valid in a particular knowledge domain. For example, the sentence, “The patient's heart is red”, is a syntactically and semantically valid English sentence. However, this is not something that would be usually said in the domain of radiology. Currently, no parsers exists that would flag such out of domain sentences except in very simple domains. State of the art speech recognition systems often employ context free grammar (CFG) rules which can be programmed to allow or disallow certain utterances. These rules are only employed for small scale domains since the cost and complexity of building CFGs for large scale domains such as radiology is impractical.
An important unmet need for semantically driven speech recognition in medium to large scale domains exists not only because it can help with identification of misrecognized utterances, but because it can increase the confidence that the speech recognizer accurately identified an utterance in a given knowledge domain.
Users dictating text with speech recognition find that errors are hard to detect. For example, researchers from the departments of radiology at the University of Washington in Seattle and the Mayo Clinic in Rochester, Minn., evaluated the performance of a single continuous speech recognition software product (IBM MedSpeak T/Radiology, version 1.1) among radiologists. The overall error rate of the transcribed reports was 10.3%. The rate of significant errors (class 2 or 3) was 7.8%, with an average of 7 significant errors per 87-word report. Subtle significant errors (class 3) were found in 1.2% of cases, or on average, about 1 error per 87-word report [Kanai K M, Hangiandreou N J, Sykes A M, Eklund H E, et al. The Evaluation of the Accuracy of Continuous Speech Recognition Software System in Radiology. Journal of Digital Imagining Vol. 13 (2), Supplement 1: 211-212. May 2000.]. Subtle errors required careful proofreading to catch. Of the 84 total reports transcribed in the study, 50 had one or more subtle significant errors.
Physicians can miss subtle insertion, deletion or substitution errors. For example, the sentence “there is no evidence of atelectasis” can become “there is evidence of atelectasis” through a deletion error. This type of error changes the meaning of the report and could affect clinical care. Semantic detection would be extremely useful to avoiding a medical error.
Lai and Vergo (U.S. Pat. No. 6,006,183) created a user interface for displaying the output of speech recognition by changing the display properties for each word based on the speech recognizer's confidence scores. While this system makes it easier to spot certain types of speech recognition errors, other errors escape detection because semantic analysis is not performed. It is known that humans perform better transcription than computers because they implicitly perform domain dependent semantic analysis. Al-Aynati and Chorneyko showed correcting pathology reports dictated with speech recognition took an extra 67 minutes a week compared to human transcriptionists [Al-Aynati M M, Chorneyko K A. Comparison of voice-automated transcription and human transcription in generating pathology reports. Arch Pathol Lab Med. 2003 June 127(6):721-5]. This difference was mainly in the time to spot and correct errors.
Significant savings in time could be realized if the speech recognition system identified and classified the semantic content of each recognized sentence or sentence fragment. For example, if a radiologist were dictating a normal chest x-ray report, a semantic classifier should categorize all the report findings as normal, and display this to the user with a green color code next to each finding. Then if the speech recognition system made an error, by significantly changing the semantic content (as in the deletion error of the word “no” in the above example), the semantic classifier could generate a red color code to indicate an abnormal finding. A user would then easily spot the error by noting the incorrect color. Currently, no such semantic classifier operating in conjunction with a speech recognizer exists for a medium to large scale domain such as medicine.
Although semantic understanding could prove useful if coupled to a speech recognition system, there are several problems this approach must overcome. It is hard to build rule-based grammars for natural language. There doe not exist a complete syntactic rule base for English, and not even the best unification grammars have been created to extract all the relevant semantic knowledge in a moderately complex knowledge domain such as radiology.
Additionally, a commercially useful system with semantic understanding would need to be extensible since real world domains such as internal medicine are continually adding new knowledge. Low cost, easy to use methods for adding new concepts and linguistic expressions would be crucial to the system's design.