The present invention is a computer assisted/implemented tool that allows a non machine learning expert to build text classifiers. The present invention is also directed to the task of building Internet message relevancy filters.
The full end-to-end process of building a new text classifier is traditionally an expensive and time-consuming undertaking. One prior approach was to divide the end-to-end process into a series of steps managed by people with different levels of expertise. Typically, the process goes as follows: (1) a domain expert/programmer/machine-learning expert (DEPMLE) collects unlabeled communications (such as, for example, text messages posted on an Internet message board); (2) the DEPMLE writes a document describing the labeling criteria; (3) hourly workers with minimal computer expertise label a set of communications; (4) a data quality manager reviews the labeling to ensure consistency; and (5) the DEPMLE takes the labeled communications and custom-builds a text classifier and gives reasonable bounds on its accuracy and performance. This process typically takes several weeks to perform.
Traditional text mining software simplifies the process by removing the need for a machine learning expert. The software allows a tool expert to provide labeled training communications to a black box that produces a text classifier with known bounds on its accuracy and performance. This approach does not cover the complete end-to-end process because it skips entirely over the cumbersome step of collecting the communications and labeling them in a consistent fashion.
The traditional approach for labeling data for training a text classifier presents to the user for labeling, sets of randomly-selected training communications (un-labeled communications). Some of the user-labeled communications (the “training set”) are then used to “train” the text classifier through machine learning processes. The rest of the user-labeled communications (the “test set”) are then automatically labeled by the text classifier and compared to the user-provided labels to determine known bounds on the classifier's accuracy and performance. This approach suffers in two ways. First, it is inefficient, because better results can be achieved by labeling smaller but cleverly-selected training and test sets. For example, if a classifier is already very sure of the label of a specific unlabeled training example, it is often a waste of time to have a human label it. The traditional approach to solving this problem is called Active Learning, where an algorithm selects which examples get labeled by a person. The second problem with human labeling is that it is inaccurate. Even the most careful labelers make an astonishingly high number of errors. These errors are usually quite pathological to training a classifier. For example, when building message relevancy filters, a very significant fraction of time may be spent relabeling the messages given by a prior art Active Learning tool.