In current business scenario, organizing and analyzing huge amount of electronics records is a challenging task. In order to achieve the business objectives of the organization, categorizing the electronic records in different groups based on records similarity is a common step deployed. When user doesn't know about the number of groups to be formed and the nature of the groups, usually unsupervised approach such as clustering is applied. In clustering, system form groups by automatically comparing each document with other documents and by using a threshold for forming a group. Few documents from the collection are selected as the cluster centers around which the groups are formed. Clustering textual answers to a survey questionnaire is one of the significant mechanisms to generate meaningful insights from textual responses.
Most of the clustering techniques do not provide descriptive labels to the clusters. In order to identify good descriptive label for a set of documents, user has to go through the set of documents manually, read and understand them, and then a descriptive label may be created.
Automatic cluster labeling disclosed in prior art faces many challenges such as single word or words set as label, are not sufficient descriptors and they fail to provide descriptive label. A complete sentence as a label is too lengthy for many situations. A complete sentence or words and/or phrases as in centroid vector are also not very useful as it is too lengthy and might not provide good coverage. Most frequent single word and/or phrase also fail to provide good coverage. Complex semantic analysis does not help as it is more time consuming than clustering.
There are many solutions provided in the prior art for cluster labeling, one of them discloses extracting verb phrases, noun phrases from a given cluster using natural language parser. Further, the method calculates the Kullback-Leibler divergence for each keyword or combination of keywords as extracted. Most discriminative key words for a given cluster are selected as the cluster labels. However these labels are not good enough as cluster label and the method is computationally intensive. In addition because of inherent limitations in clustering process that a cluster might not content a single theme or phrase that can cover all the records in the cluster. Further, prior art technique disclosing label using single most frequent phrase or keyword do not exemplify all the records in a given cluster. Thus prior art techniques fail to provide an automatic way to provide descriptive label which will reflect most of the content in the given cluster.