The present invention relates to a system and method for automatic generation of a comparison list given two different classifications.
Document classification, or grouping of documents, provides a means for a reader to quickly locate a set of similar documents that are most relevant to the reader""s needs. In the past, such classifications were generated manually by a human expert or automatically via a computer program that compares the text of different documents based on frequency of word occurrence. Examples of electronic document classifications include folders of email messages, categorizations of help desk problem tickets, and logical groupings of research abstracts by subject.
A problem arises when comparing two different classifications in a domain of similar or identical documents. Different classifications may arise either because of a change in the method for generating a classification (e.g. human expert vs. automatic) or because the underlying set of documents being classified has changed (e.g. additional documents being authored over time). A comparison consists of a list in which each of the classes contained in one classification is matched with the single most similar class in a second classification. Past approaches to this problem have focussed primarily on comparing classifications on the same document set where the primary goal has been to find out which classification was better or more complete. A need arises for a technique which will provide automatic generation of such a list given two different classifications, and automatic sorting of the list in order of similarity.
The present invention is a system and method for automatic generation of a comparison list given two different classifications, and automatic sorting of the list in order of similarity. The two classifications may be over the same set of documents or two different (but somewhat similar) sets of documents. The approach of this invention is more flexible than past approaches, since it can apply to classifications on different document sets. The present invention does not discover which classification is xe2x80x9cbetterxe2x80x9d, but rather discovers the key similarities and differences between classifications.
In order to perform the method of the present invention, a first dictionary is generated including a subset of words contained in a first document set, the first document set including at least one document and having an associated first classification including at least one class, each class having a class name. A second dictionary is generated including a subset of words contained in a second document set, the second document set including at least one document and having an associated second classification including at least one class, each class having a class name. A common dictionary including words that are common to both the first dictionary and the second dictionary is generated. A count of occurrences of each word in the common dictionary within each document in each document set is generated. A centroid of each class in the space of the common dictionary is generated. A nearest centroid in the second classification for each centroid in the first classification is determined. A list is generated including class names of each class in the first classification and a class name of a corresponding nearest class in the second classification and the class names in the first classification are sorted based on a distance from a nearest centroid in the second classification.
According to one aspect of the present invention, the count of occurrences is generated by generating a matrix having rows and columns, each column corresponding to a word in the common dictionary, each column corresponding to a document, and each entry representing a number of occurrences of the corresponding word in the corresponding document.
According to another aspect of the present invention, the centroid of each class is generated by generating a vector having a plurality of entries, each entry corresponding to a word in the common dictionary and having a value equal to an average of the values of the entries in the matrix corresponding to the word in the common dictionary.
According to another aspect of the present invention, the nearest centroid in the second classification for each centroid in the first classification is determined by, for each centroid in the first classification, determining a distance between the centroid in the first classification and each centroid in the second classification; and selecting a centroid in the second classification having a least distance from the centroid in the first classification.
According to another aspect of the present invention, the distance between centroids is determined using a distance function of:             d      ⁡              (                  X          ,          Y                )              =          -                        X          ·          Y                                      "LeftDoubleBracketingBar"            X            "RightDoubleBracketingBar"                    ·                      "LeftDoubleBracketingBar"            Y            "RightDoubleBracketingBar"                                ,
wherein X is the centroid in the first classification, Y is the centroid in the second classification, and d(X,Y) is the distance between the centroid in the first classification and centroid in the second classification.