The volume of electronic information generated and available has rapidly increased with advancements in electronics including digital processing, communications and storage. There have also been improvements to assist with analysis and retrieval of electronic data from databases or other data compilations. For example, systems and methods for enabling document clustering have been introduced to assist with analysis of relatively substantial collections of documents. These systems and methods generate clusters which include documents which are related in some way to one another. For example, the documents of the document collection may be analyzed and documents which have certain terms may be considered to be related to one another and may be provided into the same cluster. Clustering may be implemented by filtering the documents of the collection according to the frequency of occurrence of terms in documents of the collection, topics of the documents, overlap of subject matter of the documents and/or other criteria.
One of the long standing issues in document clustering concerns the identification of a meaning of the cluster. In one approach, prominent terms within each cluster are identified and selected. These prominent terms may be presented to the user as labels which attempt to generally provide an indication of semantic content for each cluster as a whole.
In general, cluster labels can be helpful in clarifying the meaning of clusters. However, the utility of a cluster label is severely limited when the word it represents is polysemous. For example, WordNet (located at www.cogsci.princeton.edu/˜wn) gives 33 senses for the word “drive”: 12 as a noun and 21 as a verb. A user may be able to select the correct sense for a cluster label such as “drive” by comparison with the remaining labels in the cluster and direct inspection of the cluster file(s) in which the label occurs. However, such analysis is time consuming and users may not have the time or disposition to carry out meaning discovery tasks. Moreover, manual inspection is of no avail in situations where a machine, rather than a person, needs to have the correct meaning for the cluster label. These situations are typically present when document clustering is done within a language unknown to the user and cluster labels have to be automatically translated to provide the user with an indication as to whether a given cluster may be of interest. When cluster labels are translated from the unknown language to the language of the user, polysemus words will most likely have several different translations and establishing what the cluster is about with a reasonable degree of certainty may be nearly impossible.
At least some aspects of the disclosure provide methods and apparatus for disambiguating labels of document clusters.