The present invention, in some embodiments thereof, relates to document cluster labeling and, more specifically, but not exclusively, to fusion of multiple labeling algorithms on a cluster of documents.
Standard document clustering algorithms do not provide labels to characterize the clusters chosen. Cluster labeling is provided by cluster labeling algorithms that examine the contents of the documents in the cluster to find a label that best describes the topic(s) of the document cluster and helps distinguish the clusters from each other. For a given cluster of documents, as coherent as possible, a cluster labeling algorithm returns at least a single label that may best describe the cluster's main topic. Labeling clusters of documents is a fundamental and important task in information retrieval, including for applications of multi-document summarization, user profiling, and the like. For example, document cluster labeling algorithms are used for business intelligence and financial performance management. For example, document cluster labeling algorithms are used for enterprise content management. For example, document cluster labeling algorithms are used for business analytics and optimization. For example, document cluster labeling algorithms are used for user profiling in customer and social analysis.
Direct labeling algorithms for cluster labeling extract the label(s) from the cluster documents. For example, direct labeling algorithms include feature selection, most frequent document terms (keywords, phrases, n-grams, and the like), terms most frequent in cluster centroid, anchor text, named entities, cluster hierarchy, and the like. Indirect labeling algorithms extract the label from external relevant label sources. For example, indirect labeling algorithms include using labels extracted from Wikipedia categories, Freebase structured data, Dbpedia structured data, and the like.