An Appendix containing a listing of programs in a kernel for performing term matching and term manipulations is attached. The programs are written in the C++ programming language. The Appendix contains material that is subject to copyright protection. The copyright owner has no objection to anyone who requires a copy of the program disclosed therein for purposes of understanding or analyzing the invention, but otherwise reserves all copyright rights whatsoever. This includes making a copy for any other purposes including the loading of a processing device with code in any form or language.
1. Field Of The Invention
This invention relates to classification of vocabularies, and more specifically to classification of structured vocabularies in a single workstation or distributed system environment.
2. Discussion Of Background Information
Today there are many areas of science and technology. Each of the many areas of science and technology have their own terms and concepts that relate specifically to that area of science or technology. These terms and concepts define a vocabulary that is related to the specific areas of science and technology. Given a narrative input containing terms and/or vocabulary that may relate to concepts of a particular science or technology, it may be desired to identify what concepts of a particular science or technology relate to the input. To accomplish this the input may be compared with known concepts to attempt to match the terms received in the input with the known concepts. Terms received in the input may in fact be related to or match multiple concepts. Therefore, it may be necessary to then classify the concepts of each term to, therefore, come up with a more refined concept that relates to the input text.
FIG. 1 shows a flow chart of an example process for accomplishing classification of input. As shown in FIG. 1 an input may be received containing multiple terms related to a particular area of science or technology S1. The input is compared with known concepts to extract the particular terms that relate to concepts of the particular area of science or technology S2. These terms are then classified into a specific concept based on comparing the terms further with known concepts S3. An output is created where the terms that have been extracted in S2 have been classified to a particular concept related to the area of science or technology S4. This process may be automatically done by a processing device where the processing device receives an input, performs steps S1 through S4, and provides an output automatically with minimal user intervention. However, due to the complexity of many scientific and technology terms and phrases, the results of the extraction, and the classification, may need to be reviewed by a user to allow for further refinement of the results of those stages.
Currently, this type of processing is only performed using standalone single workstations. These workstations are operated by a single user, and receive the input, perform the extraction and classification, and produce the output. Generally these workstations may contain a thesaurus containing concepts and terms related to the specific area of science and technology. The workstation may also contain a knowledge bases which is a repository of abbreviations, fillers, algorithms, (stop words such as xe2x80x9cofxe2x80x9d and xe2x80x9cthexe2x80x9d), proximity data (words that are similar in spelling or general concept (e.g. teeth and teething)), suffix data (words suffixes with unique meaning), and/or word synonym data (pairs of words that have equivalent meaning or closely related (e.g. car and automobile)). Finally a synonym database may also be used by the workstation. The synonym database contains word synonyms related to terms. The thesaurus, knowledge base, and synonym database contain information that may be used during extraction and classification to compare against the terms received in the input.
Single workstations that perform extraction and classification processing have several drawbacks. Generally, a single workstation uses a single environment of one thesaurus, one knowledge base, and a limited set of synonyms. Some applications may demand multiple thesauri and versions of thesauri, with a unique knowledge base for each, and a set of synonyms that are tailored to the application. Also, a single workstation can only support a single user, and lacks the capability to support hundreds of users in a multi-tiered organization with intercepting lines of authority and reporting. Single workstations use a single controlled vocabulary for all processing, and lacks the capability to be expanded to include generalized areas (domains, i.e., a generalization of multiple concepts) that tier down to specific items (studies or work packages). Current systems are not compatible with legacy systems. Current systems do not allow customer control of assignment and use of approved term synonym lists.
Moreover, current systems do not remove duplicate terms within an input before extraction and classification. Current systems have no management and maintenance tools that allow for the establishment of domains, the establishment of work packages within domains, the assignment of processing environments to work packages, and the assignment of personnel to domains and work packages. Current systems do not allow the loading of multiple thesauri, the maintenance of thesauri, the establishing and maintaining of multiple tiers of term synonym tables, or the ability to associate term synonyms at the enterprise, domain, and work package levels. Current systems do not assign user roles nor prevent any user from doing any work on the system. Further, current systems use a single knowledge base, and do not support copying and associating knowledge bases with various thesauri.
The present invention may be directed to a method for classifying structured vocabulary that includes: receiving input including one or more terms, where the terms are related to an area of technology; extracting every term from the input; reviewing results from the extracting and manually modifying the extracted terms if appropriate; classifying each extracted term, where the classification associates a classified term to each extracted term, and where each classified term is related to the area of technology; reviewing results from the classifying and manually modifying the classification results if appropriate; and generating a result output containing each term and the associated classified term.
The present invention may also be directed to a method for classifying structured vocabulary that includes: receiving input including one or more terms, where the terms are related to an area of technology; classifying each term, where the classification associates a classified term to each term, and where each classified term is related to the area of technology; reviewing results from the classifying and manually modifying the classification results if appropriate; and generating a result output containing each term and the associated classified term.
The extracted terms may be filtered where the filtering removes duplicate extracted terms producing one or more one unique terms, and the classification is performed on the one or more unique terms. The input may be categorized into one of one or more work packages where the one work package is part of a domain, and each domain includes one or more work packages.
Further, the present invention may be directed to a system for classifying structured vocabulary that includes: one or more networks; one or more client computing devices that are operatively connected to the one or more networks; one or more databases that are operatively connected to the one or more networks; and one or more servers that are operatively connected to the one or more networks, where the servers receive input from the clients, and the input includes one or more terms related to an area of technology and causes the servers to perform: extracting every term from the input; reviewing results from the extracting and manually modifying the extracted terms if appropriate; classifying each extracted term, where the classification associates a classified term to each extracted term, and where each classified term is related to the area of technology; reviewing results from the classifying and manually modifying the classification results if appropriate; and generating a result output containing each term and the associated classified term.
Moreover, the present invention may be directed to a system for classifying structured vocabulary that includes: a workstation; and one or more databases that are operatively connected to the workstation, where the workstation receives input that includes one or more terms related to an area of technology, and the input causes the workstation to perform: extracting every term from the input; reviewing results from the extracting and manually modifying the extracted terms if appropriate; classifying each extracted term, where the classification associates a classified term to each extracted term, and where each classified term is related to the area of technology; reviewing results from the classifying and manually modifying the classification results if appropriate; and generating a result output containing term and the associated classified term.
Additionally, the present invention may be directed to an article comprising a storage medium having instructions stored therein, when executed causes a computing device to perform: receiving input comprising one or more terms, where the terms are related to an area of technology; extracting every term from the input; reviewing results from the extracting and manually modifying the extracted terms if appropriate; classifying each extracted term, where the classification associates a classified term to each extracted term, and where each classified term is related to the area of technology; reviewing results from the classifying and manually modifying the classification results if appropriate; and generating a result output containing each term and the associated classified term.