1. Field of Invention
The present invention relates generally to the field of natural language processing. More specifically, the present invention is related to the discovery of terminology from comparative and contextual usage.
2. Discussion of Prior Art
Terminology, defined as the grouping of words through the combination of several common and possibly uncommon words in a particular order, is primarily based on a contextual configuration and usage. For example, the word groupings “hard drive” and “entropy minimization technique” are combinations of common and uncommon words creating a term for the representation of a new concept or idea. A term such as hard drive, once it is used often enough, may find its way into common usage in the general population and thus may no longer qualify as technical terminology. The latter term, “entropy minimization technique”, combines a pre-existing term, entropy, with other common words to create a new term.
In other cases, terminology is produced by creating a single word that has not previously existed. This is sometimes accomplished by combining morphemes, or meaningful sub-segments of words. For example, the term hyperlink combines the morpheme hyper, meaning above or beyond, with the morpheme link, which means to connect.
Terminology is also determined by contextual usage of commonly used words in various fields and disciplines to mean something slightly different or to have a completely new definition than that which is commonly understood by general populations. For example, in computer terminology, mouse refers to a pointer style input device. By contrast, it is understood to be a small furry rodent by the general population. In academic fields, particularly the natural sciences, technical terminology is created by combining the discoverer's name with another word (e.g., Nash equilibrium after the famous mathematician John Nash) or by combining word morphemes from other languages such as Latin (e.g., Acer saccharum for sugar maple). In medicine, new terms are created to describe newly discovered diseases and may be created by using the name of the first patient found with the illness (e.g., Lou Gehrig's disease) or by using the name of the doctor discovering the disease (e.g., Raynaud's disease after A. G. Maurice Raynaud). Finally, some terminology develops as an acronym and evolves into popular use as a word. For example, the word radar was originally the acronym R.A.D.A.R. which stands for Radio Detection and Ranging. Some computer acronyms are close to becoming words. One such acronym is G.U.I. (Graphical User Interface) which is already pronounced as if it were a word (i.e., like “gooey”).
Common terminology is generally used to improve communication between members affiliated with a group specified by a task, idea, or profession. This improvement is realized because terminology acts as a handle to a long description of an idea, thereby reducing the number of words needed to communicate an idea. In specified groups that communicate certain ideas and concepts frequently, terminology is useful for efficient communications and serves as a common ground for information exchange.
However, a difficulty lies in outsiders attempting to understand communications between members affiliated with a particular group; they may have trouble learning what these terms mean or even what the terminology is for the particular group. Likewise, two groups that have worked independently for some time might develop their own terminology and then have trouble collaborating because of the necessity of sorting out the terminology of common ground ideas and concepts between the groups. In addition, the use of language evolves and changes over time; the understanding of terminology of a group at one point in time does not necessarily guarantee an understanding of the terminology of the same group at a later point in time.
In one scenario, a team of biologists researching gene influence on a particular metabolic process for protein synthesis may choose to consult molecular geneticists or molecular genetics references on relevant issues. However, they will quickly become limited by a great deal of unfamiliar terminology in the literature and references they have chosen to consult; they may not even recognize the relevance of a particular reference if foreign terminology becomes an influencing factor. In another scenario, a company may wish to send operations to an affiliate in another part of the country. The use of terminology and “in-house” words may create communication problems in how affiliates create contracts and agreements, and also may create communication problems in how an affiliate to which work is sent provides services and responds to requests. By learning the terminology of a company before an engagement, an affiliate to which work is sent improves its ability to provide services and contract terms. Language use and word usage are defined as how a word is presented in an expression regarding its association with other words. (How a word is used in an expression constructed specifically to communicate with a particular subset of members, is a goal of the present invention. In the present invention, terminology discovery illustrates how, based on word usage, language use, and context, members of a specified subset communicate differently from members of a population at-large).
According to the non-patent literature entitled “Methods of Automatic Term Recognition,” by Kageura and Umino, current approaches to automatic term recognition (ATR) in mining of text literature include an information retrieval (IR) approach, a linguistic approach, and a statistical approach. An IR approach finds terms in a given set of documents by a measure of term frequency or weighted term frequency, thus facilitating document organization by category. Documents sharing common categories have a high frequency of a common set of terms and a low frequency of terms that are in documents of other categories; term frequency is therefore useful for classifying documents. However, the relationship between term frequency and technical terminology is less clear; current approaches are limited in their provision of identifiable or unique distributional characteristics of technical terminology.
A linguistic approach involves the use of language grammar models to find patterns of grammatical constructs (e.g. parts of speech, syntactic structure) that indicate terminology. For example, the non-patent literature entitled “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text,” by Justeson and Katz, reports that technical terms are used more formally than other terms so that they only occur in a few different forms of noun phrases. This approach is limited by its application solely to complex, multi-word terms and to formal terms similar to those used in scientific literature. Furthermore, it is necessary for text to be manually tagged for grammatical constructs before linguistic processing can begin.
The last of these approaches, a statistical approach, attempts to find multi-word sequences that are possibly complex technical terms using various methods, most of which are based on frequency of specific words and word co-occurrences (e.g., bi-grams or n-grams) in a set of documents. While this method is highly useful for finding unique, complex terms in a given corpus of documents, it is difficult to ascertain whether these complex terms are actually technical terms, and not more generally, commonly used terms. Thus, a statistical approach incurs the same limitations as an IR approach. In addition, it is limited in its provision for a way to find simple, single-word terms or a way to disambiguate different contextual uses of terminology.
While ATR approaches are able to provide information and statistics about the text in documents, ATR approaches are limited in their ability to find types of terminology, indicated by either, or both, a new or different usage or a new word, in a document collection. This is because technical terminology is not determined solely by the frequencies of terms, frequencies of term co-occurrences, or patterns of grammatical constructs. (According to the present invention, technical terminology is determined by the usage of a word by a definable audience or group that is significantly different from the way a word is commonly used by the general population. Technical terminology is also defined, of course, as the use of a new term).
U.S. Pat. No. 6,101,515, to Wical et al., discloses a system for automatically determining the meaning of a term using a collection of documents that have been categorized by another system. Each term found is matched to a category providing the meaning of the term. Limitations of the disclosed approach include the necessity of deciding what the terms are prior to processing, which would require a great deal of manual effort. Another limitation lies in the fact that categories or semantic topics are determined by a separate system; therefore, how well the terms are specified depends heavily on the classification algorithm and the quality of the document database used. For example, some terms may be about subtle topics or may make fine distinctions that can not be automatically detected from the documents used. Furthermore, categories and terms may not have a simple one-to-one relationship as is assumed in the disclosed approach; there may be several terms used to express several concepts that are all associated with a particular topic.
Both U.S. Pat. No. 6,212,494 B1, to Boguraev, and the non-patent literature entitled “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text,” by Justeson et al., disclose linguistic approaches requiring the use of parts of speech, syntax, and rules of grammar to analyze the context of a potential term to help identify terminology. These approaches are limited by their dependence on linguistic constructs to effectively “reverse engineer” the construction of terminology as it appears in written text.
Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention.