1. Field of the Invention
The present invention relates to automatically naming a set of documents for document organization, and deals more particularly with a method, system and computer program for naming a cluster of words and phrases, using a lexical database to provide a name that appropriately brings out the essence of the same.
2. Description of the Related Art
A general way of considering all documents is one wherein a document is a collection of words (such as a report, a news article, or a web page), or simply, a collection of characters that can be obtained by typing on a keyboard or a typewriter. With advances in modern technology and an ever-increasing reliance on computers, the quantum of soft documents generated has witnessed a sharp increase. Typically, in large corporations today, hundreds of thousands—or even more—soft documents are generated and stored. This obviously leads to a situation where data or document retrieval becomes difficult and time consuming, and, results in a need for providing a system to classify documents appropriately and efficiently. Clearly, an efficient classification would result in ensuring that related documents are grouped together. Hence this usually results in more efficient retrieval, browsing, and navigation and content organization of the entire document set, thereby making it easier to access the same. For example, a news-provider—who could be newspaper publisher, a radio station, a television station or any other organization providing news—may have documents pertaining to finance, politics, sports, entertainment, classified advertisements, general advertisements, and other topics. If all these documents are clustered together, it will be difficult to efficiently search for a particular news article. Hence, all documents should preferably be classified under relevant subjects and related documents, or similar documents should be clustered together. For example, in most cases, it would make sense that all documents pertaining to sports constitute one category. Similarly, all documents pertaining to finance may fall under one category. Indeed, since the news-provider may have archives of many such documents, their overall quantity tends to become quite large. Hence, in many cases, a further sub-classification may be required. Continuing with the abovementioned example, the category of “sports,” may have to be further divided into two or more classes, and as an example, a sub-category or a sub-class that has all articles related to “tennis” (and one that has all articles related to “football”) may emerge.
As can be seen from the above discussion, proper classification of documents is indeed an important issue for organizations such as libraries and big corporations that have large quantities of documents. Proper classification helps in logically arranging documents and reduces the time and effort spent on searching for a document on a particular subject.
In order to classify the documents appropriately, it is important to label a cluster of documents in the best manner possible. A label is a descriptive or identifying word or phrase that brings out the essence of the documents and can be used to uniquely identity the same. Traditional classification methods have relied on the author or some other professional (such as a library science professional) to label or index the documents, so that these labels or indices can be further used to classify the documents. Although this option of manually labeling and classifying documents may result in high quality, it is usually time consuming and expensive. However, if the data associated with the set of documents becomes large, the effort involved in manual labeling often becomes monumental, and some times simply not doable. Indeed, in the absence of such manual labeling, one is handicapped due to the lack of any proper automatic labeling method.
In the past, numerous methods have been proposed for automatically generating labels of documents. Most methods use a few words from within the document to constitute the label. In such cases, the labels are simply those that either contain the most frequent or the most descriptive words appearing in the document. Indeed, such methods may not generate labels that bring out the essence of the document completely. For instance, if we continue with the aforementioned example, news articles on football games and tennis games are likely to have the word “reporter” occurring very frequently in them. If these labeling methods of choosing most frequently occurring words as the label were used, the word “reporter” will very likely occur in the label of the category containing the two documents (on football games and tennis games), and may even put these documents in the same category or sub-category! Clearly, since, tennis and football are not related, their classification under the same category or sub-category is not appropriate. This may even result in confusion at the time of searching for the documents. At a minimum, the label “reporter” would not be able to appropriately bring out the context, essence, or the import of any of these documents. Therefore, what is needed is a method to appropriately label a document in a way that brings out the subject matter—including the key concepts and the context—of the document. Hence, all this discussion shows a need for concept based labeling of documents.
One prior art method uses Self-Organizing Maps (SOM) to classify and label documents. Typically, a document has many features, such as frequency of occurrence of a particular keyword associated with it. A document is therefore represented as a feature vector with the feature values (that is, the frequency of occurrence of the corresponding keyword) as its elements. Representing documents in this way enables one to use SOMs and to do cluster analysis of documents. WEBSOM and LabelSOM are two techniques that employ SOMs to cluster and label document collections.
“WEBSOM—Self Organizing Maps of Document Collections”, presented in Proc. Workshop on Self-Organizing Maps (WSOM97), Espoo, Finland, 1997 by Timo Honkela, Samuel Kaski, Krista Lagus, and Teuvo Kohonen describes a method that uses a list of documents to extract a set of keywords that act as features for these documents. Suppose there are five documents to be classified and fifty keywords have been extracted out of these documents. These fifty words are used as features for these documents. For each of these documents, a vector of fifty dimensions is generated. Each element in the feature vector will correspond to the frequency of occurrence of the corresponding keyword in the document. These documents are mapped on a two-by-two map. Documents that are “close” to each other according to this distance are clustered together and are mapped close to each other on the map. Hence, this map provides a visual overview of the document collection wherein “similar” documents are clustered together. However, it does not label documents. Moreover, the clustering uses words appearing in the document only.
“LabelSOM: On the Labeling of Self-Organizing Maps”, 1999 by Andreas Rauber describes an approach for automatically labeling a SOM (http://www.ifs.tuwien.ac.at/˜andi). The output in this method is a N×M grid wherein a cluster of documents is mapped to a grid element, and this cluster is given a label using the words in the documents (that have been mapped to this grid location). Documents to be mapped to a cluster are determined using the Euclidean distance between the documents and the stored feature vector representing the cluster. Each such cluster is thereafter labeled using certain elements from the stored feature vector. This is done by determining the contribution of each element in the feature vector towards the overall Euclidean distance, i.e. those elements are selected to form the label that are the most distinguishing ones for that cluster. The resulting labeled map allows the user to understand the structure and the information available in the map.
However, neither WEBSOM nor LabelSOM addresses the issue of naming a document with a meaningful name or phrase that appropriately brings out the import of the documents. Indeed, both used frequently occurring words in the documents as labels and, in the example given above, these techniques might label news articles with football games and tennis games under the heading of a “reporter”. Moreover, the resulted label has to be one or few words appeared in the documents.
“Automated Concept Extraction from Plain Text”, AAAI Workshop on Learning for Text Categorization, Madison, July 1998 by Boris Gelfand, Marilyn Wulfekuhler and William F. Punch III describes a system for extracting concepts from unstructured text. This method identifies relationships between words in the text using a lexical database and identifies groups of these words that form closely tied conceptual groups. This method extracts certain semantic features from raw text, which are then linked together in a Semantic Relationship Graph (SRG). The output, SRG, is a graph wherein words that are semantically related (according to the lexical database) are linked to each other. Furthermore, in this graph, if two words are not directly linked to each other but are linked by a connecting word in the lexical database, then this connecting word is added to the graph as an “augmented word” that will connect these two words. For example, if two words, “priest” and “government”, appear in the SRG, and if they are not directly related, then it is likely that an “augmented word” such as “authority” will be added in the SRG and it will connect to both the words, “priest” and “government.” Finally, SRG is partitioned into sub-graphs in order to obtain classes of various documents. However, this paper does not address the issue of labeling a document or a set of documents; in other words, a strong need still remains as to how such classes should be labeled so that the corresponding labels exhibit the context, concepts, and the import of the documents contained there in.
“Automatic Labeling of Document Clusters”, 2000 by Alexandrin Popescul and Lyle H. Ungar describes a method for labeling document clusters (http://www.cis.upenn.edu/˜popescul/Publications/labeling KDD00.pdf). The method uses a statistical method called “χ2 test of significance” for each word at each node in a hierarchy starting at the root and recursively moving down the hierarchy. If the hypothesis, that a word is equally likely to occur in all of the children of a given node, cannot be rejected, then it is marked as a feature of the current subtree. Thereafter this word is assigned to the current node's bag of node-specific words and removed from all the children nodes. After having reached the leaf nodes, each node is labeled by its bag of node-specific words. However, this labeling is of a very rudimentary form insofar as it merely picks words that exist in the document as the label for the document. These words, when used as labels, may not depict the context, concept, or the exact import of the document.
In addition to the abovementioned research papers on the subject, various patents have also been granted in the area of concept extraction and labeling.
U.S. Pat. No. 5,077,668 titled “Method and apparatus for producing an abstract of a document”, U.S. Pat. No. 5,638,543 titled “Method and apparatus for automatic document summarization”, U.S. Pat. No. 5,689,716 titled “Automatic method of generating thematic summaries” and U.S. Pat. No. 5,918,240 titled “Automatic method of extracting summarization using feature probabilities” deal with automatically producing abstract of a document that is indicative of the content of the document. In all these inventions, certain phrases and sentences are picked up from the document itself, based on predetermined heuristics, which are then juxtaposed together to form the summary. However, these inventions merely summarize the document and do not address the issue of labeling.
U.S. Pat. No. 5,642,518 titled “Keyword assigning method and system therefor” describes a keyword assigning system for automatically assigning keywords to large amount of text data. The domain-wise keywords are extracted from one of the many available text data inputs based on occurrence frequencies of domain-specific words stored in a memory. Thereafter a text data, which is to be assigned a keyword, is inputted. Finally, a keyword is extracted from the input text data using the domain-wise keywords. This keyword is assigned as the label to the input text data. However, this invention merely extracts words from within the input text data and uses them as labels. The label so assigned may not be very relevant to the document from a contextual point of view.
From a study of the various methods stated above, it is clear that although many attempts have been made at concept extraction and labeling of documents, none of these methods deal with labeling documents in a manner that reveals the context or the key concepts of the documents. Indeed, such methods merely restrict themselves to picking up text from the documents themselves and using these as labels. Therefore, what is needed is a method, system and computer program for labeling a document or a set of documents in a manner that key concepts and its import are clearly brought out. Moreover, not much effort has been made to labeling a set of related words and phrases instead of labeling documents directly.