1. Field of the Invention
The present invention is directed toward the field of morphological and ontological systems, and more particularly toward techniques to automatically discover categories in a terminological knowledge base.
2. Art Background
In general, knowledge bases include information arranged to reflect ideas, concepts, or rules regarding a particular problem set. Knowledge bases have application for use in natural language processing systems (a.k.a. artificial linguistic or computational linguistic systems). These types of knowledge bases store information about language. Specifically, natural language processing knowledge bases store information about language, including how terminology relates to other terminology in that language. For example, such a knowledge base may store information that the term xe2x80x9cbuildingsxe2x80x9d is related to the term xe2x80x9carchitecture,xe2x80x9d because there is a linguistic connection between these two terms.
Natural language processing systems use knowledge bases for a number of applications. For example, natural language processing systems use knowledge bases of terminology to classify information or documents. One example of such a natural language processing system is described in U.S. Pat. No. 5,694,523, entitled xe2x80x9cContent Processing System for Discourse,xe2x80x9d issued to Kelly Wical on Dec. 2, 1997, which is expressly incorporated herein by reference.
Terminological knowledge bases also have application for use in information search and retrieval systems. In this application, a knowledge base may be used to identify terms related to the query terms input by a user. One example for use of a knowledge base in an information search and retrieval system is described in U.S. patent application Ser. No. 09/095,515, entitled xe2x80x9cHierarchical Query Feedback in an Informative Retrieval System,xe2x80x9d Inventor Mohammad Faisal, filed on Jun. 10, 1998 and U.S. patent application Ser. No. 09/170,894, entitled xe2x80x9cRanking of Query Feedback Terms in an Information Retrieval System,xe2x80x9d Inventors Mohammad Faisal and James Conklin, filed on Oct. 13, 1998, both of which are incorporated herein by reference.
One type of terminological knowledge base, disclosed in U.S. patent application Ser. No. 09/095,515, associates one or more terms or concepts with categories of the knowledge base. For example, the category xe2x80x9coperating systemsxe2x80x9d may include a number of concepts, although associated with the category xe2x80x9coperating systemsxe2x80x9d, are not categories themselves. For this example, the terms xe2x80x9cUNIXxe2x80x9d, xe2x80x9cWindows ""98xe2x80x9d, and xe2x80x9cMac OS8xe2x80x9d may be associated with the knowledge base category xe2x80x9coperating systems.xe2x80x9d In one implementation for a terminological knowledge base, there may be hundreds or even thousands of these terms associated with a single category.
As discussed above, natural language processing systems use terminological knowledge bases to classify information, such as documents. If these natural language processing systems classify terms primarily based on categories, then it is desirable to provide as many categories as possible while still maintaining the accuracy of the ontological distinctions. If a single category has associated with it hundreds or thousands of terms, then the categorization of a particular document to a term loses distinction as the number of terms grows large in the single category. Accordingly, a document classified in a category that has too many terms associated with that category becomes difficult to accurately index, with regard to the proper classification of subject matter in that document. Similarly, if the number of concepts in a single category grows too large, then the performance of terminological knowledge bases for use in information search and retrieval systems becomes degraded. For example, if a category of a knowledge base is used to identify additional subject matter areas from a search query, and a single category is associated with 1,000 terms, then the use of that category to identify additional subject matter may become overly inclusive (i.e., too many subject matter areas are identified through the single category in the knowledge base). Accordingly, it is desirable to limit the number of concepts or terms associated with a single category of a terminological knowledge base.
One way of controlling the number of concepts associated with a single category is to split the category up into one or more subcategories. Using this approach, terms within that single category that are related may become subcategories beneath the parent or original category. One approach to splitting or dividing categories is through a linguist""s manual interpretation of each category to determine both whether a category should be subdivided, and if so, which terms associated with that category should be subdivided. The manual process of making these determinations is laborious. In addition, if different linguistics use different criteria, the knowledge base may grow to include subcategories based on underlying principles that may differ. Accordingly, it is desirable to automate the process of splitting one or more groups of terms associated with a single category to generate one or more subcategories.
A terminological system automatically generates sub-categories from categories of a knowledge base. The knowledge base includes a plurality of hierarchically arranged categories, as well as terms associated with the categories. A subset of the categories of the knowledge base are designated xe2x80x9cdimensional categories.xe2x80x9d A target category in the knowledge base is selected to generate sub-categories for some of the terms associated with the target category. The system also stores a corpus of documents, including themes and corresponding theme weights for each document. A target category is selected to generate sub-categories. A set of themes from the corpus of documents are selected for each term. Dimensional category vectors, one for each term, are generated by associating the set of themes for a term to a dimensional category in the knowledge base. The dimensional category vectors for each term are analyzed to determine if one or more clusters of terminological groups exist. If one or more terminological groups exist, then the terminological groups form terms associated with a new sub-category.