This section provides background information related to the present disclosure which is not necessarily prior art.
Computer-implemented or machine learning-based clustering or classifier systems are often used to separate a corpus of documents (text-based documents, images, etc.) into classes or clusters of like documents, to which business use case labels may (or may not) be applied. For example, a first classifier might classify all documents relating to topic A, a second classifier for topic B, etc. Once classified in this automated fashion, the respective classes can then be used to assist a computer-implemented search algorithm or used to extract business information relevant to a particular user.
Clustering and classifier systems are often implemented using statistical models. Generally, a classification system will include several individual classifiers; whereas a clustering system may often employ just one clusterer. Classifiers are sometimes said to be “trained,” using training data. A classifier system may thus be fed a set of training data that are digested by a machine learning algorithm to define a set of trained models, each model being associated with a class represented in the training data. The training data are pre-labeled as belonging to a particular class. Hence, classification is sometimes called supervised learning because the learning system is told what each set of training data represents. Clusterers are not trained in this fashion but instead generate clusters based on automatic discovery of the underlying structure, so clustering is sometimes called unsupervised learning.
By way of example, consider a supervised learning system designed to separate a corpus of email messages (documents) into different classes depending on what topic is discussed in the message. In supervised learning, a subject matter expert supplies the classifier with sample documents (training data) known to represent email messages relating to topic A, and the classifier stores parameters computed from those training data in a predetermined model for subsequent use in classifying later submitted documents. The supervised learning process would then be repeated for messages relating to topics B, C, and D, if desired, resulting in a set of trained models associated with a set of classifiers, each designed to recognize one of the topics A, B, C, D, etc.
Once trained, the set of classifiers may be used to identify whether a test document belongs to one of the trained topics. The test document is submitted to each of the classifiers (A, B, C, D, etc.) which are asked whether the test document belongs to its class. For example, if a test document about topic A is supplied, classifier A would respond with “yes,” while the remaining classifiers would respond with “no.” In some instances classifiers may also supply a likelihood score indicating how certain is its decision.
Clustering systems work in essentially the same way, except that the clusters do not have labels previously supplied by a human rater.
Over time, a classifier or clustering system's accuracy may drift, as the corpus of test documents evolves. Perhaps a new topic will be added, or perhaps some underlying document feature has changed, necessitating new training. In addition, when a particular cluster gets too large, it may be necessary to subdivide it, again necessitating new training. In all these scenarios, the system designer needs a way of assessing how well the classifier or clustering system is performing, to know when it is time to retrain the models or when to build new ones.
Assessing the performance of a classifier or clustering system has traditionally involved a great deal of human labor. Traditionally a human rater would individually look at each document within the assigned class or cluster and determine whether it does or does not belong. While there are some statistical techniques that can be used to ameliorate the task, the review process still involves a human looking at potentially hundreds or thousands of documents before the performance quality of the classifier or clustering system can be ascertained.