There are a large number of computational problems that are too complex for a human to explicitly determine and code a solution. Examples of such problems include machine recognition of human facial characteristics, speech recognition, the classification of a corpus of documents into a taxonomy and the extraction of information from documents. In an attempt to solve these problems a class of algorithms has been developed that effectively train a computer to perform a specific task by providing example data. This class of algorithms comes under the broad heading of Machine Learning as the computer running such algorithms is attempting to “learn” how to solve a posed problem by learning from example solutions.
Thus Machine Learning algorithms typically require a collection of human-labeled training examples as input, from which a solution is inferred. By way of an illustrative example, if the posed problem is to recognize the characteristic that a given web page is a corporate “about-us” page from the set of all published web pages on the Internet, then the machine learning algorithm would be provided with labeled “positive” examples of “about-us” pages, and further negative examples of other different kinds of web pages not having this characteristic. The algorithm would then infer the features of the positive class relevant to this characteristic (i.e. “about-us” pages) necessary to distinguish it automatically from the negative class (i.e not “about-us” pages). Once trained sufficiently, the algorithm can then classify new web pages automatically. Obviously, this approach can be generalized and applied to multi-class problems where there are multiple positive classes. Following on from the “about-us” example described above, this might involve further classifying web pages into several additional categories such as pages having the characteristic of being a “contact” page or a “product” page.
For the Machine Learning algorithm to perform well on new examples, the training data must contain sufficiently representative examples of both the positive and negative class (or classes for a multi-class problem). This requirement leads to one serious disadvantage of these types of algorithms in the case where the positive class is under-represented in the natural distribution over the data of interest. Turning once again to the “about us” web page example, pages of this nature comprise only a small fraction of all web-pages. Thus one has to label a large quantity of web pages to obtain enough representative examples of the positive class. As the labeling procedure is performed by humans, it can be a labour intensive and hence expensive process.
One attempt to address this disadvantage is to modify the Machine Learning algorithm to actively select examples for a human to subsequently label thereby reducing the amount of human labeling required to train the algorithm. These refined Machine Learning algorithms are termed Active Learning algorithms and all share the feature of attempting to reduce the overall labeling burden by actively selecting the most “informative” examples from the total data set for the human to label, rather than having the human label large numbers of relatively uninformative negative examples. Thus the Active Learning algorithm must in some sense characterize the maximally informative unlabeled examples from the total data set given the labeled examples seen thus far and the class of classifiers available to the learning algorithm.
However, Active Learning algorithms of this type do not address the often fundamentally limiting practical problem of how to efficiently search the total data set for the proposed better labeling candidates. Once again referring to the “about-us” web page example, whilst the Active Learning algorithm may be able to generate criteria for the candidate web pages for labeling these must still be sought from the total data set of all web pages. As most practical problems of any utility usually involve extremely large data sets this can seriously reduce the effectiveness of an Active Learning system.
It is an object of the invention to provide an improved machine learning system that reduces the amount of training data required.