The following relates to the information arts. It is described with example reference to providing keywords, semantic descriptions, or other characterization of classes of probabilistic clustering or categorization systems. However, the following is amenable to other like applications.
Automated clustering and categorization systems classify documents into classes on the basis of automated analysis of content. Categorization and clustering systems can employ soft or hard classification. In soft classification, a single document can be assigned to more than one class. In hard classification, also called partitioning, each document is assigned to a single class.
In probabilistic categorization systems, a set of training documents are typically annotated with class assignment information respective to a pre-defined structure of classes. Probabilistic model parameters are computed that profile the usage of selected vocabulary words or word combinations in training documents, in training documents assigned to classes, or in other content groupings of interest. An algorithm is developed for assigning unclassified documents to classes based on application of the probabilistic model parameters. Henceforth, the developed algorithm is used to assign received unclassified documents to suitable classes.
Probabilistic clustering systems operate similarly, except that in clustering the pre-defined class structure and training document class annotations are not provided. Rather, training documents are clustered during the training to define classes based, for example, on similar usage of vocabulary words or word combinations, or other common document characteristics.
As some illustrative examples, Naïve Bayes type probabilistic classifiers employ Bayesian conditional vocabulary word probabilities, assuming statistical independence of the conditional probabilities. Another type of probabilistic classifiers are the probabilistic latent classifiers (PLC), which are described for example in Goutte et al., U.S. Publ. Appl. 2005/0187892 A1, and in Gaussier et al., “A hierarchical model for clustering and categorizing documents”, in “Advances in Information Retrieval Proceedings of the 24th BCS-IRSG European Colloquium on IR Research”, vol. 2291 of Lecture Notes in Computer Science, pages 229 47 (Springer, 2002), Fabio Crestani, Mark Girolami, and Cornelis Joost van Rijsbergen, editors.
A problem arises in characterizing the classes of an automated clustering or categorization system. The classes are advantageously characterized by keywords, semantically descriptive phrases, sentences, or paragraphs, or so forth. Such characterization enables user browsing of the classification system by semantic description, or enables automated searching of the classification system using keywords, or so forth. Such class characterization substantially improves the efficiency and usefulness of the categorization or clustering system. However, deriving class-distinguishing keywords or semantic class descriptions is challenging.
One existing approach for characterizing classes of probabilistic categorizing or clustering systems is to identify as keywords for a class those vocabulary words that are most commonly used in the class. That is, vocabulary words that are most frequently used, or have the highest probability, in a class are assigned as keywords for that class. In a variant approach, the keywords for a class are limited to vocabulary words that are common to all documents in the class. Here, the user is sure that every document in the class contains the keyword.
These approaches have the disadvantage that common generic vocabulary words that occur frequently in more than one related class may be assigned as keywords for those related classes. For example, common computer-related words such as “keyboard”, “screen”, or “file” may be assigned as keywords for several different computer-related classes. Although these common generic keywords may help the user locate the group of computer-related classes, they typically provide little or no guidance to the user in refining the class selection to those computer-related classes that are of special interest to the user. The variant approach in which a keyword must be so common that it appears in every document of the class can further bias the class characterization toward selecting common generic terms as keywords.