A variety of machine learning systems and applications are well known in the art. In many machine learning systems, the ability of the system to understand how to process (e.g., classify, cluster, etc.) newly presented data is determined according to a model (referred to herein as a learned model) that is itself developed based on input data previously provided to the machine. Stated another way, such models attempt to discover and mimic the patterns found in the input data (sometimes referred to herein simply as “input”) such that outcomes may be properly predicted for subsequently processed inputs. A variety of techniques are known in the art for developing such models, which are often tailored to the specific nature of the input data being processed and the desired outcomes to be achieved. For example, a significant body of literature has been developed concerning machine learning systems for developing understanding of real-world domains based on the analysis of text. As used herein, a domain may be thought of as a particular subject/topic/thing of interest.
Generally, the performance of learned models improves as the relative breadth of the domain is restricted. That is, the accuracy of the learned model is likely to be better if the breadth of the input data is relatively narrow. For example, the semantic content of text relating to the domain of “digital cameras” is likely to include fewer patterns to be discovered, and therefore more likely to be accurate, than the broader domain of “image capture devices.” Conversely, while more restricted domains may present the opportunity for more accurate learned models, the narrow scope of a given domain may result in a situation where there is a relative lack of information from which the system can develop a learned model in the first instance.
Additionally, some machine learning systems relying on so-called supervised learning wherein the system is provided with a quantity of training data from which the learned model is at least initially developed. Such training data typically comprises input data (e.g., natural language text samples) where the desired outcome is known and provided to the machine learning system. For example, in the case of a learning system implementing a classification-based spam filter, the system may be provided with examples of text for which the determination of “spam” or “not spam” has already been made. Based on this training data, the learning system can develop a learned model reflecting those characteristics of the text that best predict when something will be classified or labeled as “spam” or “not spam” such that subsequent input text may be classified according to the learned model in order to predict the outcome, i.e., whether or not the new input should be classified as spam. While such systems have proven successful, the relative cost of obtaining accurate and useful training data can be quite expensive, particularly, for example, where human subject matter experts are required to develop the training data.
Techniques that permit the accurate and cost-effective development of learned models for use in machine learning systems would represent a welcome addition to the art.