Many applications involve some type of sorting or filtering of data such as documents, files, email messages, and feedback. Most often the filtering or sorting is based on text included therein, and in particular, natural language text. Text processing relates to assigning labels to text or parts of text and can be performed for various reasons such as sentiment detection, classification, etc. Unfortunately, many applications in text processing require significant human effort for either labeling large document collections (e.g., when learning statistical models) or extrapolating rules from them (e.g., when using knowledge engineering). Regardless of the scenario, classification problems tend to frequently arise when dealing with natural language processing since a word or group of words extracted from a document, file, or message, for example, can have different meanings or connotations given their context.
There are currently two traditional approaches to solving such classification problems. The first involves machine learning where a large set of documents are labeled by hand and then sent through a statistical machine learning engine. This engine can abstract from the labeled documents what the presumed correct procedure is for labeling. When setting up a classification system in this manner, the main overhead lies in the manual examination of large text corpora required to assign the labels. In contrast, in many vendors now ship machine learning engines, which ca be customized to many tasks so that typically little overhead is required for the instantiation of the machine learning engine.
Instead of using an off-the-shelf machine learning technique, a rules system can be set up. An example of a rule system is as follows: Every time I see this pattern, then apply this label. Such rules have to be formulated by hand—presumably by a human. This rule-based engine is not very sophisticated, and even more so, there is again a significant human cost since a human is tasked with looking at all the rules and documents and then abstracting the rules and testing and verifying them. For each specific task or domain such as a spam filter (e.g., legitimate vs. spam) or a specific email filter (e.g., route messages from Jerry to a folder), a different engine or rule system must be built which essentially adds up to a substantial amount of human effort and cost. That is, machine learning models or rule engines can be re-used but every task requires new training data and updating a current engine requires either new training data or new manual construction of the relevant rule set. Thus, the ratio of human effort to accuracy is undesirably high.