The present invention relates to system and method for interactively classifying and analyzing data, and is particularly applicable to classification and analysis of textual data.
It is becoming increasingly common for organizations to provide a helpdesk service to their customers. Typically, a customer will call the helpdesk to ask for information and to seek solutions to problems relating to the operation of products, the performance of services, necessary procedures and forms, etc. Typically, helpdesks are staffed by knowledgeable human operators, who often spend considerable time with each caller in order to answer the caller""s questions. As a result, helpdesk operation is very expensive and manpower intensive. Much of the helpdesk operator""s time is spent solving identical or nearly identical problems over and over again. A need arises for a technique by which the solutions to frequently recurring problems may be automated in order to improve the efficiency of helpdesk operation. In particular, what is needed is a technique that can aid in identification of helpdesk inquiry and problem categories that are most amenable to automated fulfillment or solution.
The present invention relates to system and method for interactively classifying and analyzing data that is particularly applicable to classification and analysis of textual data. The present invention is useful in a variety of situations, and is particularly advantageous in aiding in identification of helpdesk inquiry and problem categories that are most amenable to automated fulfillment or solution.
The present invention is useful in identifying candidate helpdesk problem categories that are most amenable to automated solutions. In a preferred embodiment, the present invention uses clustering techniques to identify collections of problems from free form text descriptions. It then facilitates a human user""s modifications to collections as appropriate to improve the coherence and usefulness of the classification. Measures of cluster goodness, such as intra-cluster cohesion and inter-cluster distinctness are used to help the user determine which classes are the best candidates for automated solutions. Clusters are named automatically to convey some idea of their contents. Documents within each cluster may be viewed in sorted order by typicality. Ultimately, the user may use all of this information in combination to interactively modify the text categories to produce a classification that will be useful in authoring solutions.
The helpdesk application area is only one of many areas to which the present invention may be advantageously applied. One of ordinary skill in the art would recognize that any set of text documents may be classified and subsequently analyzed using the present invention.
The present invention is a method, system, and computer program product for interactive classification and analysis. In order to carry out the method, a dictionary is generated including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set. A count of occurrences of each word in the dictionary within each document in the document set is generated. The set of documents is partitioned into a plurality of clusters, each cluster containing at least one document. A name is generated for each cluster. A centroid of each cluster in the space of the dictionary is generated. A cohesion score is generated for each cluster. A distinctness score is generated for each cluster. A table including the name of each cluster and the cohesion score and distinctness score for each cluster is displayed.
Further, for at least one cluster, the documents contained in at least one cluster, are displayed and the documents sorted based on their similarity to other documents in the cluster. The similarity of a document to other documents may be determined by calculating the distance of the document to the centroid of the cluster. The documents may be sorted in order of descending distance of the document to the centroid of the cluster or the documents may be sorted in order of ascending distance of the document to the centroid of the cluster.
Further, editing input may be received from a user and the displayed table modified based on the received editing input. The editing input may comprise an indication of a cluster to be split, in which case the displayed table is modified by splitting the indicated cluster. The editing input may comprise an indication of a cluster to be deleted, in which case the displayed table is modified by deleting the indicated cluster.
The count may be generated by generating a matrix having rows and columns, each column corresponding to a word in the dictionary, each row corresponding to a document, and each entry representing a number of occurrences of the corresponding word in the corresponding document.
The set of documents may be partitioned by partitioning the set of documents into a plurality of clusters using a k-means partitioning procedure. The k-means partitioning procedure may include determining a distance between a centroid and a document vector using a distance function of:             d      ⁢              xe2x80x83            ⁢              (                  X          ,          Y                )              =          -                        X          ·          Y                                      "LeftDoubleBracketingBar"            X            "RightDoubleBracketingBar"                    ·                      "LeftDoubleBracketingBar"            Y            "RightDoubleBracketingBar"                                ,
wherein X is the centroid, Y is the document vector, and d(X,Y) is the distance between the centroid and the document vector.
A name may be generated for each cluster by, for each cluster, including in the name of the cluster at least one word, the word selected from the dictionary based on a frequency of occurrence in the cluster. Likewise, a name may be generated for each cluster by, for each cluster, including in the name of the cluster a plurality of words, each word selected from the dictionary based on a frequency of occurrence in the cluster.
The centroid may be generated by generating a vector having a plurality of entries, each entry corresponding to a word in the common dictionary and having a value equal to an average of the values of the entries in the matrix corresponding to the word in the common dictionary.
The cohesion score for each cluster may be generated by generating a cohesion score based on an average negative cosine distance of the centroid of the cluster to the documents contained in the cluster. The distinctness score for each cluster may be generated by generating a cohesion score based on an average negative distance of the centroid of the cluster to a closest centroid among centroids of other clusters.
The displayed table may be user-editable and the user may be provided with the capability to delete and split clusters.