The present invention generally pertains to machine learning classifiers. More specifically, the present invention pertains to methods and apparatus for analyzing classifiers and for facilitating clean-up of anomalies in classifier training data.
Machine learning classifiers are increasingly used in commercial software systems. One example is the search domain in which users enter short natural language strings (queries) and expect intent predictions. The search domain is highly competitive. Users are primarily drawn in by the relevance (accuracy) of the results. Advertising revenue is in turn related to the number of users, and thereby can be considered to be indirectly related to relevance. Therefore, it is highly desirable to have the machine learning classifier perform as well as possible.
Machine learning classifiers typically require training data to learn. The ability to learn from data is the main benefit of machine learning, but also makes this technology prone to data errors introduced either maliciously or by accident. The accuracy of a machine learning classifier is inextricably dependent upon the quality of the training data used to train the classifier. As noted, training data errors can be the result of malicious training data or can result by accident. An example of malicious errors in the case of classifiers trained using user feedback is the purposeful action by users to corrupt the data (i.e., by fraudulent clicks or “Google bombing”). Examples of accidental errors are human mistakes in the data labeling process.
Manual data labeling for use in training a machine learning classifier is expensive. To reduce the labor and corresponding costs, candidate mappings can sometimes be generated with unsupervised methods. However, human correction of these automatically generated mappings is also expensive in the absence of good data cleanup tools. Due to high costs, it is common for commercial systems to minimize or forego human review of their training data. The result is that many systems use just standard accuracy tests or metrics as shipping criteria, without more in-depth analysis of the data and accuracy.
Such an approach leads to the shipping of systems that are less accurate than they could be. Moreover, in the case of systems that get data from external sources, the systems are left exposed to malicious data attacks. The impact of the latter is not visible when only standard accuracy metrics are used as shipping criteria.
The present invention provides solutions to one or more of the above-described problems and/or provides other advantages over the prior art.