In order to made coded data available in a setting where a large subset of the information resides in natural language documents; a technology called natural language understanding (NLU) is required. This technology allows a computer system to xe2x80x9creadxe2x80x9d free-text documents, convert the language in these documents to concepts, and capture these concepts in a coded form in a medical database. NLU has been a topic of interest for many years. However, it represents one of the most difficult problems in artificial intelligence. Various approaches have been tried with varied degrees of success. Most current systems are still in the research stage, and have either limited accuracy or the capability to recognize only a very limited set of concepts.
NLU systems which have been developed for use in the field of medicine include those of Sager et al. (xe2x80x9cNatural language processing and the representation of clinical dataxe2x80x9d, JAMIA, vol.1, pp 142-160, 1994), and Gabrielli (xe2x80x9cComputer assisted assessment of patient care in the hospitalxe2x80x9d, J. Med. Syst., vol. 12, p 135, 1989). One approach has been to made use of regularities in speech patterns to break sentences into their grammatical parts. Many of these systems work well in elucidating the syntax of sentences, but they fall short in consistently mapping the semantics of sentences.
The concepts and ultimate data base representation of the text must be derived from its semantics. Systems which rely upon the use of semantic grammars include those of Sager et al. (Medical Language Processing: Computer Management of Narrative Data, Addison-Wesley, Menlo Park, Calif., 1987) and Friedman et al. (xe2x80x9cA general natural-language text processor for clinical radiology,xe2x80x9d JAMIA, vol. 1, pp.161-174, 1994). Zingmond and Lenert have described a system which performs semantic encoding of x-ray abnormalities (xe2x80x9cMonitoring free-text data using medical language processingxe2x80x9d, Comp. Biomed. Res., vol. 265, pp. 467-481, 1993).
A few systems have been developed which used a combination of semantic and syntactic techniques, e.g., Haug et al. (as described in xe2x80x9cA Natural Language Understanding System Combining Syntactic and Semantic Techniques,xe2x80x9d Eighteenth Annual Symposium on Computer Applications in Medical Care, pp. 247-251, 1994 and xe2x80x9cExperience with a Mixed Semantic/Syntactic Parser,xe2x80x9d Nineteenth Annual Symposium on Computer Applications in Medical Care, pp. 284-288, 1995) and Gunderson et al. (xe2x80x9cDevelopment and Evaluation of a Computerized Admission Diagnoses Encoding System,xe2x80x9d Comp. Biomed. Res, Vol. 29, pp. 351-372, 1996).
Bayesian networks, also known as causal or belief networks, are trainable systems, which have been used to apply probabilistic reasoning to a variety of problems. These networks are described in some detail in Pearl (Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufman, San Mateo, Calif., 1988) and Neopolitan (Probabilistic Reasoning in Expert Systems, Wiley, New York, N.Y., 1990.
All of the above references are incorporated herein by reference.
The present invention uses a probabilistic model of the meaning of medical reports to extract and encode medical concepts. It makes use of Bayesian networks to map from groups of words and phrases to concepts. This approach has the potential to bridge the gap between free-text and coded medical data and to allow computer systems to provide the advantages of both. Natural language is common in medical systems and is becoming more common. Not only is dictation and transcription widespread in medical information systems, but new technologies (e.g., computer systems that convert speech to text) are beginning to arrive that will made free-text documents easier and less expensive to produce. Accordingly, a system, which allows free-text data to be transformed to coded data, will be increasingly valuable in medical applications. The inventive system disclosed herein was developed for use in the encoding of free-text diagnoses and for the encoding of x-ray reports. However, the inventive system could also be used in legal and other fields.
It is desirable to provide a method for capturing and manipulating large amounts of medical data within medical information system databases wherein natural language free-text data is extracted and encoded to provide standardized coded data. In particular, it is desirable to provide a method and system which makes use of trainable Bayesian networks to provide accurate mapping of free-text words into a coded form. Moreover, it is desirable to provide a computer system, which is designed to efficiently, and automatically perform the method of this invention.
It is the general objective of this invention to provide a method for converting natural language free-text into encoded data for use in medical information system databases.
It is a further objective of this invention to provide a computerized method for extracting and encoding the information contained within free-text data.
It is a further objective of this invention to provide a method for encoding free-text medical information using a probabilistic Bayesian network, which can be trained to improve encoding accuracy.
It is a further objective of this invention to provide an encoding method, which is capable of accurate recognition and encoding in applications requiring the identification of a large number of concepts.
It is a further objective of this invention to provide an encoding method, which can be trained to improve its accuracy.
It is a further objective of this invention to provide a method for encoding free-text data, which employs spell checking.
It is a further objective of this invention to provide a method of encoding free-text data, which uses a synonym parser to replace words or phrases in the free-text data with equivalent expressions.
It is a further objective of this invention to provide a method of encoding free-text medical data by applying a transformational grammar.
It is a further object of this invention to provide a method for extracting and encoding medical concepts from free-text data using a probabilistic model.
These and other objectives of this invention are achieved by a method comprising the steps of receiving free-text data and other information; performing synonym checking; performing spell checking; syntactic parsing; grammar transformation; performing semantic analysis; and writing discrete concepts, as standardized medical codes, into a medical database.
In the presently preferred embodiment of the invention, the semantic parser uses a probabilistic (Bayesian) network to perform statistical pattern recognition to form mapping between terms and concepts. Improved system performance is obtained by training the Bayesian network. The inventive system has the advantage that it is capable of accurate recognition of a large number of concepts and that, once set up, its accuracy can be improved through a simple training program.
Additional detail and further developments of this invention are described in SYMTEXT A Natural Language Understanding System for Encoding Free Text Medical Data, by Spencer B. Koehler, one of the inventors. This document is Dr. Koehler""s Ph.D. dissertation, published by the University of Utah in June of 1998. The reader should note that this dissertation was written and published after the filing date of the provisional patent application (Sep. 30, 1997) on which this patent application claims priority. This dissertation, cited on the Information Disclosure Form, is hereby incorporated by reference in this application for the material contained therein to provide additional scientific background for this invention. It is not the intent of the applicant that any additional new matter be included in this application by the incorporation of this Koehler dissertation.