With modern advances in computer technology, modem speeds and network and Internet technology, vast amounts of information have become readily available in homes, businesses and educational and government institutions throughout the world. Many people rely on computer-accessible information on a daily bases. This global popularity has further increased the demand for even greater amounts of computer-accessible information. However, as the total amount of accessible information increases, the ability to locate specific items of information within the totality becomes increasingly more difficult.
Common practices for managing such information complexity on the Internet or in database structures typically involve some ordering structure comprising a plurality of topics to which the information is assigned in order to be easily located by a user. Such ordering structure might be for example hierarchically or linearly (or other) structured.
Such topic ordering structures are referred to herein as “taxonomies”. Such taxonomies can provide a means for designing vastly enhanced searching, browsing, and filtering systems querying with respect to a specific topic can be more reliable than depending only on the presence or absence of specific words in documents, because the danger in querying or filtering by keywords alone is, that there may be many aspects to, and often different interpretations of the keywords, and many of these aspects and interpretations are irrelevant to the subject matter that the searcher intended to find.
Thus, prior art categorization systems are important in order to put a single document or a piece of information into the “box” where it belongs to and where a user expects it to find.
Categorization systems need to be ‘trained’ by providing sets of typical documents, referred herein as training documents, for each category before they can be used to assign categories to documents. Some systems allow a training document to belong to different categories. In the following, we use the term ‘training base’ to refer to a taxonomy and its set of training documents.
A well-established prior art method to measure the quality of categorization systems is to calculate ‘precision’ and ‘recall’ values that represent the degree to which documents from a test set with category information are assigned to the appropriate categories by the system. This test set is typically established by splitting the set of training documents for each category into a new training set and a test set according to a fixed proportion (for example 80% training, 20% test). Calculating precision and recall values is done by counting how many documents from the test set are assigned to the categories to which they belong and how many cannot be assigned to a category by the system. By doing this iteratively with different randomly selected documents the method's independence from the actual choice of documents and thus the quality of the measurement can be improved.
The following definitions of ‘precision’ and ‘recall’ are used:
Precision(c)=Number of documents assigned to category c which belong to c/Number of all documents assigned to c
Recall(c)=Number of documents assigned to category c which belong to c/Number of documents belonging to c,
whereby “assigned to category c” means an assignment as it results from applying the categorizer, whereas “belonging to category c” refers to a pre-assignment which is assumed in here to be available for all training documents independent from the application of the categorizer. The latter is usually done manually.
In the following we use the term ‘training base’ to refer to the taxonomy and the sets of training documents per category.
Though precision and recall can be used to provide an overall assessment of the quality of the output of a categorization system, they provide only very limited information about where “problematic categories or training documents” reside within the training base, and how they could be improved.
Problematic categories are assumed in here to be basically those category constellations and training documents causing a negative effect on the training process and correspondingly decreases the quality of the output of a categorization system.
An example for such prior art categorization system is disclosed in U.S. Pat. No. 6,233,575. A precision/recall feedback is used therein to yield a feedback of the categorization system. The present invention is applicable for any categorization system of this type.
Since taxonomies tend to be of dynamic nature as they typically need adaptation to varying business environments, it is important to note that neither creating a taxonomy nor evaluating the quality of a categorization scheme is a step that only needs to be performed once. Instead, this must be revised often in practice. Since categorization systems use a mathematical model to map documents to categories that is learned in the training step, a change of a taxonomy may have a significant impact on the overall quality of the categorization system, even if the change seemed to affect only a small part of the taxonomy. Thus, categorization systems must be checked for quality after such a modification. Thus, it is desired to do that job without major operation of humans, as each human interaction is error-prone and—due to its monotony—a laborious work.
Though a precision/recall-based feedback about the quality of a categorization system may help to see, if the result of the training phase may be useful or not, this prior art approach provides only very restrictive information about which areas of the taxonomy should be improved if the result of the training phase is not deemed useful.