The use of statistical or probabilistic methods such as Maximum Entropy and Markov Models to normalize or rationalize text during electronic document processing is generally known. Text normalization, the process of identifying variants and bringing them to a common (normalized) form, is an important aspect of successful information retrieval from medical documents such as health records, clinical notes, radiology reports and discharge summaries. In the medical domain, a significant part of the general problem of text normalization is abbreviation and acronym disambiguation. Throughout the remainder of this document, the word “abbreviation” is used to mean both “abbreviation” and “acronym” since the two words can be used interchangeably for the purposes of this document and invention. Numerous abbreviations are used routinely throughout such medical text and identifying their meaning is critical to understanding the document.
A problem is presented by the fact that abbreviations are highly ambiguous with respect to their meaning. The Unified Medical Language System (UMLS) is a database containing biomedical information and tools developed at the National Library of Medicine. Using the UMLS as an example, “RA” can have as meanings or stand for the expansions “rheumatoid arthritis,” “renal artery,” “right atrium,” “right atrial,” “refractory anemia,” “radioactive,” “right arm,” “rheumatic arthritis” and other terms. It has been estimated that about 33% of the abbreviations in the UMLS are ambiguous. In addition to problems associated with text interpretation, abbreviations constitute a major source of errors in a system that automatically generates lexicons for medical natural language processing (NLP).
When processing documents to identify those that contain a specific term, it would be desirable to identify all the documents that also use an abbreviation for the specific term. For example, if searching for documents containing the term “rheumatoid arthritis,” it would be desirable to retrieve all those documents that use the abbreviation “RA” in the sense of “rheumatoid arthritis.” At the same time, it is desirable not to identify documents that use the same abbreviation, but with a sense different from that of “rheumatoid arthritis.” Continuing with the above example, it would be desirable that the search not identify those documents where “RA” means “right atrial.”
This abbreviation normalization is effectively a special case of word sense disambiguation (WSD). Approaches to WSD include supervised machine learning techniques, where some amount of training data is marked up by hand and used to train a classifier. One technique involves using a decision tree classifier. Black, An Experiment in Computational Discrimination of English Word Senses, IBM Journal of Research and Development, 32(2), pp. 185–194 (1988). Fully unsupervised learning methods such as clustering have also been successfully used. Shutze, Automatic Word Sense Disambiguation, Computational Linguistics, 24(1) (1998). A hybrid class of machine learning techniques for WSD relies on a small set of hand labeled data used to bootstrap a larger corpus of training data. Hearst, Noun Homograph Disambiguation Using Local Context In Large Text Corpra, In Proc., 7th Annual Conference of the University of Waterloo Center for the New OED and Text Research, Oxford (1991), Yarowski, Unsupervised Word Sense Disambiguation Rivaling Supervised Methods, In Proc., ACL-95, pp. 189–196 (1995).
One way to take context into account is to encode the type of discourse in which the abbreviation occurs. “Discourse” can, for example, be defined as the type of medical document and the medical specialty. As a more particular example, “RA” in a cardiology report can be normalized to “right atrial,” while “RA” in a rheumatology note can be normalized to “rheumatoid arthritis.” Unfortunately, this method of using the global context to resolve abbreviation ambiguity suffers from a number of drawbacks that limit its use in automatic document processing applications. First, it requires a database of abbreviations and their expansions linked with possible contexts in which particular expansions can be used. This is a labor intensive and error-prone task. Second, it requires a rule-based system for assigning correct expansions to their abbreviations. Any such system would likely become large and difficult to maintain. Third, the distinctions made between various expansions are likely to be coarse. For example, it may be possible to distinguish between “rheumatoid arthritis” and “right atrial,” since the two terms likely appear in very separable contexts. However, distinguishing between “rheumatoid arthritis” and “right atrium” becomes more of a challenge and may require introducing additional rules that further complicate the system.
Maximum Entropy is statistical technique that has been used for Natural Language Processing. A useful aspect of this technique is that it allow the predefinition of characteristics of the objects being modeled. The modeling includes a set of training data known as feature vectors, which are predefined features or constraints that uniformly distribute the probability space between the candidates that do not conform to the constraints. Features are represented by indicator functions of the following kind.
      F    ⁡          (              o        ,        c            )        =      {                                        1            ,                                                              if              ⁢                                                          ⁢              o                        =                                          x                ⁢                                                                  ⁢                and                ⁢                                                                  ⁢                c                            =              y                                                                        0            ,                                    otherwise                    Where “o” stands for outcome and “c” stands for context. This function maps contexts and outcomes to a binary set. For example, to take a simplified part-of-speech tagging example, if y=“the” and x=“noun”, then F(o,c)=1, where y is the word immediately preceding x. This means that in the context of “the” the next word is classified as a noun.
To find the maximum entropy distribution the Generalized Iterative Scaling (GIS) algorithm is used, which is a procedure for finding the maximum entropy distribution that conforms to the constraints imposed by the empirical distribution of the modeled properties in the training data.
There remains a need for an automated or at least semi-automated method (i.e., one that can be performed by an electronic data processing system) for generating training data used by statistical text normalization modeling systems. The method should be capable of generating training data that will enable the text normalization modeling systems to normalize the text to a relatively high degree of accuracy. A system of this type that can be used to normalize abbreviations and acronyms in medical text would be particularly useful.