The following relates to the document organization, retrieval, and storage arts. It particularly relates to cleanup or adjustment of probabilistic categorization or clustering models generated by machine learning techniques, and is described with illustrative reference thereto. However, the following more generally relates to cleanup or adjustment of models for categorization or classification generally, and to runtime evaluation of how well a given document fits into the classification scheme.
The ability to store documents electronically has led to an information explosion. Information bases such as the Internet, corporate digital data networks, electronic government record warehouses, and so forth store vast quantities of information, which motivates development of effective information organization systems. Two commonly used approaches are categorization and clustering. In categorization, a set of classes are pre-defined, and documents are grouped into classes based on content similarity measures. Clustering is similar, except that no pre-defined classes are defined—rather, documents are grouped or clustered based on similarity, and groups of similar documents define the set of classes.
In an illustrative probabilistic approach, documents are each represented by a bag-of-words storing counts of occurrences of keywords, words, tokens, or other chunks of text, possibly excluding certain frequent and typically semantically uninteresting words such as “the” or “an”. Document similarities and differences are measured in terms of the word counts, ratios, or frequencies. In a supervised approach, a model is generated by supervised training based on a set of annotated training documents. In an unsupervised approach, the training documents are partitioned into various classes based on similarities and differences. The training or partitioning generates probabilistic model parameters indicative of word counts, ratios, or frequencies characterizing the classes. Categorization is similar to clustering, except that rather than grouping training documents into classes the training documents are pre-assigned to classes based on their pre-annotated class identifications. Categorization is also sometimes called “supervised learning”.
In automated classification or clustering approaches, the resulting model is typically good, but some documents may not fit well into any of the classes. This may or may not indicate that the document is inappropriate for the document classification scheme. For example, the document may be inappropriate in that it relates to a subject that is not intended to be covered by the document classification scheme. On the other hand, the document may relate to a subject that is to be covered, but the subject may be underrepresented in the set of training documents and so the underrepresented document does not match parameters of any of the classes as derived from the training documents. In the case of categorization, some documents may seem to fit better into a class other than the class to which the document is assigned based on its annotation. This may or may not indicate an erroneous class annotation.