A significant amount of work has been done in the last 25 years in the area of natural language understanding. In its broadest terms natural language understanding encompasses processes by which documents in human readable form are processed to a computer readable form. Among the applications for natural language understanding are indexing and retrieval of free text documents, and coding of documents by subject matter. As manual methods are time-consuming, require highly trained individuals to review text, and are often inaccurate based on human error and inconsistent use of terms and codes, there is a strong desire to develop robust and reliable computer systems that can perform these tasks.
Current natural language understanding systems for indexing, coding, and retrieval of free text are time consuming and somewhat imprecise. Existing systems use conventional word matching or concept matching. These systems use only words or concepts, rather than concept extraction that is independent of language and terminology. For example, U.S. Pat. No. 4,868,733 to Fujisawa et al uses “concepts” represented by words, and links or “relations” between the “concepts.” The concepts, however, are in reality words, or terms, arranged hierarchically such that certain terms subsume other terms.
U.S. Pat. No. 6,061,675 to Wical et al describes a knowledge catalog that stores different senses and forms of terms within static and dynamic ontologies for particular areas of knowledge (i.e. particular industries). The ontologies contain words that define terminology specific to different industries and fields of study. U.S. Pat. No. 4,815,005 to Oyanagi et al describes a main associative memory unit, or ontology, that stores knowledge “data”, each piece of data consisting of an object, an attribute and a value. Each object is represented by a “node”. Examples of objects are “bird”, “tire” or “man.” U.S. Pat. No. 4,967,371 to Muranaga et al describes a frame-based technology in which “objects” are represented by frames that store information related to the particular object. Additionally, objects can be connected to one another. This frame-based technology, however, remains reliant on words or terms that describe objects of interest and derives values for slots within particular frames such that only terms are interrelated in a hierarchical structure.
Of particular interest in the area of natural language understanding is the coding of medical language to allow consistent classification and storage of medical information using commercial and proprietary coding systems. Examples of methods that relate to systems for coding data are described in U.S. Pat. No. 5,809,476 to Ryan, and U.S. Pat. No. 6,292,771 to Haug et al.
All of the above systems share a common shortfall in that they rely on words and terms to define concepts in free text.
In order to know how to make computers better understand language, it is necessary to understand how language works. The way language works is based on the world view of users of language. A common representation of how language works is shown in FIG. 1. The diagram in FIG. 1 is commonly referred to as the semantic triangle. The three vertices of the semantic triangle represent the basic components that are commonly used to define how language works. At one vertex are words and terms, which in their broadest scope comprise language with all of its syntactic rules. At the second vertex are concepts, which in their totality make up the world view. World views are constrained by the physical capabilities of the human body to perceive reality. For example, the concept of color exists because most people can see colors. Finally, at the third vertex are the real world objects that are the focus of concepts and words. In totality, the sum of objects make up reality.
Concepts can be differentiated from words and language, and from real world objects by recognizing that a word or term is simply a label applied to the object or concept. Word and term formation are partially based on physical characteristics of the objects that they denote. For example, the words “bark” and “quack” mimic the sounds made by the animals denoted by the words. The concept is the sum of all of the definitions given to the object in a particular culture that applies the label to the object. For example, the label “dog” applies to a concept shared by people in English speaking countries about a particular real object. However, an individual can contemplate the notion of “dog” without having the physical object in front of him. Hence, the concept is disembodied from the real world object. Further, when presented with the particular object an individual can recognize it based on the concept of that object without appealing to the word or term used to label it. Thus the concept attached to “dog” is independent of the word as well.
It is well known that the words or terms used to refer to concepts/objects varies across cultures, because of the variety of languages that exist. However, it is also true that the concepts attached to real objects may vary across cultures as well. For example, in western culture, the concept for “dog” does not include the definition of being a food item, whereas in certain eastern cultures this definition is included in the concept. Functionally, the full definition given to a specific concept, such as “dog”, in a given culture is understood by the totality of other concepts related to the specific concept. For example, a full definition of the concept “dog” may include the following related concepts:                Is an “animal”        Is a “pet”        Has “fur”        Has “sharp teeth”        Eats “meat”        Makes sound “barking”        Is owned by “person”        
As can be seen from the example, the concepts that construct the full definition of a specific concept can indicate several aspects about the concept: 1) state of being (animal, pet), 2) physical qualities (fur, sharp teeth), 3) how it acts on other concepts (eats meat, barks) and how other concepts act on it (owned by person). Further, it can be seen that certain reciprocal relationships can exist between concepts. For example, the definition of the concept “dog” as something owned by a person implies the definition of the concept “person” as something that can own a dog.
The terms, or words selected in a particular language are used to express “concepts” which are notions of the “objects” that exist in our understanding of the world or “reality.” However, humans do not understand language only as a collection of labels applied to concepts. For example, in the sentence:
“In China the people wear fur coats and regularly eat dog meat.”
it is understood by the reader that “fur” and “eat meat” refer to qualities of the people, not the dog. Despite the conceptual relationship of these concepts to “dog”, the syntactic structure of the sentence and knowledge of the way language works and of reality allows a human reader to extract a completely different meaning from the text. Additionally, a human reader is able to extract from this text the knowledge that the dog is most likely not a pet, even though this concept is not explicitly in the sentence. Further, a reader is able to discern that the sentence is related to the more general subject of garments and dietary habits of people in China.
For a computer based system to fully understand natural language, a combination of all three vertices of the semantic triangle must be used. Previous systems have relied on the relationship of words and groupings of words in free text to extract concepts from free text documents. This approach ignores the fact that concepts exist independent of language. Further, it limits the ability of previous systems to extract concepts that are not explicitly represented in the free text. Also importantly, by relying on words, word groupings and grammar to extract concepts from free text, previous systems lack the ability to index and code documents in more than one language.
It would therefore be desirable to develop a system of natural language understanding for a computer based system that uses both terms and concepts to provide a more accurate and thorough understanding of free text by extracting concepts to which words, terms or phrases are attached as known grammaticalisations in a specific language. For example, although words used in a sentence may be important and may relate to the particular topic(s) represented by the sentence, reaching a higher level of conceptual understanding and relationship between the words and their syntactic and semantic relationships to one another will help glean more information from the sentence that will permit more accurate and complete indexing, coding, and query analysis.