The present invention relates generally to a system and method for categorization of strings of words. More specifically, the present invention relates to a system and method for normalizing a string of words for use in a system for categorization of words in a predetermined categorization scheme.
There are a number of systems capable of analyzing a section of text to determine the significance of that section of text. Some exemplary fields of information classification and analysis include the identification of names of people, their roles or positions within various organizations, product names, and names of organizations such as information extraction systems developed for the DARPA and NIST MUC experiments. On a very basic level, the analysis of text typically involves two steps. First, the relevant string of text, or the string of interest should be identified. This may require some form of isolation from other strings of text or other text within a given string. Secondly, once the bounds of the text of interest have been determined, the text should be characterized and tagged with labels. Some algorithms may combine these two steps into a single step. Traditionally, data has been classified on a number of different levels including complete texts, sections of documents, paragraphs, single sentences, and even strings of words within a sentence.
Traditional systems utilize broad categories when classifying information and do not permit the classification of words, or strings of words into the categories of complex ontologies or nomenclatures. In fact, traditional systems may prove unwieldy or unmanageable when applied to several thousand or more distinctions, as would be required to characterize text within even a moderately complex ontology or nomenclature. Prior classification schemes are, therefore, comparatively high-level and, may produce ineffective classification of information for particular applications.
One example of nomenclature that is very complex in nature is found in medicine and medical diagnosis. For example, a medical diagnosis may include information related to the intensity of a particular malady, the anatomical site of an infliction or complications relating to the diagnosis. One physician within the medical profession may state a diagnosis in way that is not necessarily identical to the way another physician will state the same diagnosis. This may be due to, for example, a difference in word order, different complications associated with a particular diagnosis, the diagnosis may be associated with a slightly different part of the anatomy, or a diagnosis may indicate a medical problem having a range of different intensities. This list is not intended to be exhaustive, but is illustrative of the number of different reasons why an identical diagnosis can be stated in a number of forms. For a given lexicon of medical problems, there may be millions or even billions of ways to express diagnoses of medical problems.
One example of such a complex hierarchically-organized nomenclature is the SNOMED CT (Systematized Nomenclature of Medicine Clinical Terms) nomenclature. SNOMED is a common index or dictionary against which data can be encoded, stored, and referenced. The SNOMED CT nomenclature includes hundreds of thousands of differing categories of medical diagnoses based on the large number of different concepts within the medical field. A much smaller subset of the SNOMED CT, the Dictaphone SNOMED CT Clinical Subset, includes only about 7,000 of the approximately 100,000 disease and findings categories that the entire SNOMED CT ontology includes.
In the hospital or clinical setting, dictation of diagnoses and patient records is common. Once dictated, the speech can be converted directly into text either manually or with speech recognition systems. Due to differences in spoken medical diagnoses, however, systems may not properly recognize or classify a particular diagnosis. Therefore, a system and method for recognizing and classifying text such as a medical diagnosis will preferably be configured to accommodate a large degree of variability within the input text strings. Such variability may be due to, for example, dictation by medical professionals including professionals in different departments of a hospital, professionals in different hospitals, professionals having different specialties, professionals having different backgrounds, dictation at different time periods, and dictation in different contexts.
Therefore, what is needed is a system and a method for classifying words and strings of words into categories of complex ontologies or nomenclatures, such as would exist, in for example, the SNOMED CT nomenclature. The present invention seeks to address this and other potential shortcomings of prior art systems and methods when applied to complex nomenclatures, including, for example, complex hierarchically organized nomenclatures.