1. Field of the Invention
The invention generally relates to automatically discovering a concept hierarchy from a corpus of documents. More particularly, the invention is a method, system and computer program for automatically discovering concepts from a corpus of documents and automatically generating a labeled concept hierarchy.
2. Related Art
Enormous amount of information is generated everyday; most of this information is in the form of soft documents. The information is fed into computer systems of various establishments, organizations and the World Wide Web for the purpose of storage and circulation. The volume of soft documents generated can be estimated from the fact that there are about 4 billion static web pages on the Internet, and that the web is growing at a rate of about 7.3 million new pages per day.
Searching for relevant information from huge volume of data is a difficult task, if the information is not organized in some logical manner. Complexity of the search increases as the size of the data space increases. This might result in situations, where searches may miss relevant information or return redundant results. Therefore, it is essential that the information be stored and arranged in a logical manner; clearly, such storage will lead to easy browsing and retrieval of the stored information (as and when required).
The problem of organizing this large volume of information/documents can be equated with the problem of arranging books in a library. In a library there are books dealing with diverse subjects. The books are organized in a logical manner, according to the subject they deal with, or according to the author, or according to some other characteristics (such as publisher or the date of publication etc.) The underlying objective is to create a system, wherein a user can easily locate the relevant book. This logical arrangement of books not only helps a user in locating the required book but also helps a librarian in arranging the books in the relevant sections.
In a similar manner, we now also have soft documents that deal with numerous topics. These soft documents need to be classified and arranged in a logical manner. A ‘Document Taxonomy’ logically arranges documents into categories. A category is a predefined parameter (or characteristic) for clustering documents pertaining to that specified parameter. For example, a taxonomy dealing with financial reports may categorize relevant documents into categories such as annual statements and periodic statements, which can be further categorized according to the different operational divisions. The documents to be populated in a predefined category can be identified based on the content and ideas reflected therein. A given category in taxonomy is populated with documents that reflect the same ideas and content. Taxonomy creation would facilitate mining of relevant information from a huge corpus of documents, by allowing for a manageable search space (category) to be created. This facilitates easier browsing, navigation and retrieval.
Taxonomy construction is a challenging task and requires an in-depth knowledge of the subject for which taxonomy is being generated. As such, taxonomy construction is usually done manually by experts in that particular domain. One example of manually created taxonomy structure is the directory structure of Yahoo. Manual taxonomy construction is usually time consuming and costly. Further, with the development of science and technology new fields are being identified and novel terms being coined. This makes updating of taxonomies a difficult task.
The organization of documents within categories in Taxonomy is facilitated, if the content and ideas of a document can be automatically identified without having to actually read through every document in a large corpus. The salient ideas reflected in the documents can be defined as ‘Concepts’. For example, a document dealing with ‘Renewable energy systems’ may have concepts like windmill, solar energy, solar lighting, natural resources, biofuel and others. The concepts are arranged in a hierarchical structure, where related concepts are arranged close to each other and more general concepts are nearer to the top of the hierarchy. The concept hierarchy can be regarded as “a tree” (data structure), where the most general concept forms the root of the tree and the most specific ones are the leaves. The following is an example of a concept hierarchy; if science is taken as a root, it may have physics, chemistry, and biology as its “children” nodes. In turn, Physics, Chemistry and Biology may have their own children nodes; for example: Physics may have Mechanics, Electromagnetism, Optics, and Thermal Physics as its children nodes; Chemistry may have Organic chemistry and Inorganic chemistry as its children nodes, and Biology may have Botany and Zoology as its children nodes. Clearly, these nodes may further have children, and so on, until leaves (i.e., nodes having no children) are reached. Leaves signify the most specific classifications of science. Indeed, in one such hierarchy, neurology, pathology, nuclear magnetism, and alkenes may form the leaves of the hierarchy.
The concepts organized in a hierarchical structure facilitate a user to perform a search pertaining to these concepts. Further, searching for concepts also facilitates in populating categories in Document Taxonomy with documents associated with concepts. A category can contain more than one concept. Similarly, a concept can be used in more than one category. A conceptual search locates documents relevant to a concept based on keywords associated with the concept. A conceptual search may be used as a first step in identifying documents for a category in a taxonomy. Thus, automated concept and concept hierarchy generation considerably reduces the time and cost involved in manual taxonomy construction.
The process of automated concept extraction and concept hierarchy generation involves the following two steps: (a) Identification and extraction of concepts from the corpus of documents; and (b) Arrangement of concepts in a concept hierarchy.
a) Identifying and extracting concepts from the corpus of documents: Concepts represent the key ideas in the document. The key ideas of the document are often well represented by a set of keywords. These key words are extracted from the corpus of documents, and then related keywords are clustered together to represent a concept.
b) Concept hierarchy generation: The above-mentioned step of concept extraction usually results in a number of concepts being generated. Many of these concepts are related and many times a concept can be broken into further sub-concepts. A logical relationship among concepts is required to be identified. A concept hierarchy representing this logical relationship between concepts is generated.
Numerous methods have been developed for extracting concepts and generating concept hierarchies. Most of these methods use lexical information to extract concepts and to arrange them in hierarchical order.
“Automatic Concept Extraction From Plain Text”, presented in AAAI Workshop on Learning for Text Categorization, Madison, July 1998 by Boris Gelfand, Mariltyn Wulfekuhler and William F. Punch III describes a system for extracting concepts from unstructured text. This system is dependent on lexical relationship among words and uses WordNet to find these relationships. WordNet is a lexical reference system of words. In WordNet words are arranged according to lexical concepts. For example nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Certain semantic features that are often called “base words” are extracted from raw text, which are then linked together by Semantic Relationship Graph (SRG). Base words constitute nodes in a SRG, and semantically related “base words” are linked to each other. For those “base words,” which do not have a direct semantic relationship in the lexical database but are linked via a connecting word, this connecting word is added as an “augmenting word”. For example, if the two words “biology” and “physics” appear in the SRG, and are not directly related, then it is likely that an “augmenting word” like “science” will be introduced into the SRG. Nodes that are not connected to enough nodes are removed from the graph. The resulting graph captures semantic information of the corpus and is used for classification. Finally, SRG is partitioned into sub-graphs in order to obtain classes of various documents.
“WEBSOM—Self Organizing Maps of Document Collections”, presented in WSOM97 Workshop on Self-Organizing Maps, Espoo, Finland, 1997, by Timo Honkela, Samuel Kaski, Krista Lagus, and Teuvo Kohonen describes a method that uses a corpus of documents to extract a set of keywords that act as features for these documents. Suppose there are five documents to be classified and fifty keywords have been extracted out of these documents. These keywords are then used as features for these documents. For each of these documents, a “feature vector” of fifty dimensions is generated. Each element in the feature vector corresponds to the frequency of occurrence of the corresponding keyword in the document. These documents are mapped on a two dimensional map. Documents that are “close” to each other, according to the distance between their “feature vectors” are clustered together and are mapped close to each other on the map. This map provides a visual overview of the document collection wherein “similar” documents are clustered together.
“Finding Topic Words for Hierarchical Summarization”, presented in International Conference on Research and Development in Information Retrieval, 2001, by D. Lawrie, W. Bruce Croft and A. Rosenberg describes a method for constructing topic hierarchies for the purpose of summarization. Topic hierarchy organizes topic terms into a hierarchy where the lower level topic terms cover the same vocabulary as its parents. This method uses conditional probabilities of occurrence of words in the corpus for extracting topic terms and for creating topic hierarchy. The relationship between any two words is expressed in a graph. This graph is a directed graph in which nodes are terms in the corpus, connected by edges that are weighted by “subsumption” probability. A term ‘x’ subsumes a term ‘y’ if ‘y’ is a more general description of ‘x’. Nodes that have the highest subsumption probability and connect many other nodes are discovered as terms at a higher level of abstraction. This process is recursively repeated to discover terms at higher levels in the hierarchy.
“Deriving Concept Hierarchies From Text”, presented in International Conference on Research and Development in Information Retrieval, 1999, by M. Sanderson and Bruce Croft describes a means for automatically deriving a hierarchical organization of concepts from a corpus of documents, based on subsumption probabilities of a pair of words. Words form the nodes in a concept hierarchy derived by this method. The word ‘p’ is the parent of the word ‘c’ if the word ‘p’ is a more general description of the word ‘c’. The hierarchy generated captures the hierarchical relationship between words of the text.
In contrast, the present system organizes concepts into a hierarchy. The bottom layer of nodes in a hierarchy are words. Internal nodes of the hierarchy are concepts (clusters of words) organized into different levels of abstraction. The hierarchy captures relationship between concepts. Also, a node cannot belong to more than one parent in the hierarchy constructed by Sanderson and Croft. The present system does not suffer from this limitation.
In addition to the above mentioned research papers on the subject, various patents have been granted in the areas related to concept extraction and concept hierarchy construction.
U.S. Pat. No. 5,325,298 titled “Method for generating or revising concept vectors for a plurality of word stems” and U.S. Pat. No. 5,619,709 titled “System and methods of concept vector generation and retrieval” describe methods for generating context vectors which may be used in storage and retrieval of documents and other information. These context vectors are used to represent contextual relationships among documents. The relationship may be used to club related documents together.
U.S. Pat. No. 5,873,056 titled “Natural language processing system for semantic vector representation which accounts for lexical ambiguity” presents a method for automatic classification and retrieval of documents by their general subject content. It uses a lexical database to obtain subject codes, which are used for classification and retrieval. U.S. Pat. No. 5,953,726 titled “Method and apparatus for maintaining multiple inheritance concept hierarchies” deals with modification of concept properties and concept hierarchies.
The above methods and patents make an attempt to solve various problems associated with automated concept extraction and concept hierarchy generation. Still, there are various lacunas and the above mentioned research papers and patents fail to address one or more of the following concerns.
Most of the systems that depend on lexical databases for concept extraction are restricted in their scope by the extent of coverage of lexical databases. Usually, lexical databases are not specialized enough for dealing with topics related to specialized subjects. Moreover, advancement in science and technology leads to emergence of new fields and new terms being coined; for example, ‘biometrics’ is term that has been coined recently. Such fields and terms may not find reference in these databases.
Further, most of the systems, which use probabilistic models for concept extraction and concept generation, are deficient in the ability to handle the problem of ‘data sparsity’, ‘polysemy’ and occurrence of ‘redundant keywords’.
The problem of data sparsity occurs due to the fact that ‘key words’ are chosen from a corpus of documents. The occurrence frequency of a keyword in a collection of documents (as opposed to a single document) is sparse. This may result in inaccurate weight being assigned to the key word and this would reflect on the measure of similarity between any two key words.
Polysemy refers to the problem where one word has more than one meaning. For example, the word ‘club’ can mean a suit in cards, or a weapon, or a gathering. Obtaining the contextual meaning of the word is important for the purpose of generating and arranging concepts in hierarchy. Prior work in word sense disambiguation differentiates meanings of a word using lexical references that store pre-defined definitions of words. Further, conventional word sense disambiguation focuses on lexical meanings of words and the contextual meanings of the word are generally neglected. For example, a sense of the word ‘car’ according to lexical knowledge refers to a transportation vehicle, but the word ‘car’ may have two contextual senses, one related to ‘car insurance’ and the other to ‘car racing’.
The problem of ‘redundant keywords’ refers to the case, where redundant words occurring in the corpus may be chosen as key words. For example, the term ‘reporter’ can occur numerous times in a corpus of documents dealing with sports. This term may be chosen as a key word on the basis of occurrence frequency. However, this term has no bearing with the fields of sports and use of this term, as a key word for concept generation would result in inaccuracies.
In view of the above shortcomings, there exists a need for an automated approach for concept extraction and concept hierarchy generation that overcomes the above-mentioned drawbacks.