Data classification systems are useful in many applications. One application is in filtering data, as might be done as part of a search over a corpus of data. While many data structures might be used with a data classification system, a typical example is a corpus containing many, many data items organized as units such as records or documents. While a document is used as an example of a data item, it should be understood that statements might be equally applicable to data items that are not normally referred to as documents.
A data classification system might be used to a filter documents from a large corpus to flag or otherwise identify relevant documents distinctly from less relevant documents. As an example, a company or an analyst might want to review news items from a large corpus of news items, but only those news items that relate to a particular company or set of companies. They could use a data classification system to flag news items that relate to the companies of interest and provide those relevant documents for further processing, such as manual review.
In the general case, a data classification system classifies documents as being “in” or “not in” a particular class, or classifies documents as being in one or more of two or more classes. In an extremely simple data classification system, a class might be “all documents containing phrase P” and the simple data classification system classifies each document as either being in the class or not being in the class (binary decision). In other simple, but slightly more involved data classification systems, the class might be “all documents mentioning phrase P or its synonyms” or the class might be “all documents apparently relating to topic T”.
A conventional data classification system might first convert documents into enumerated features though a process of feature generation. One way that this can be done is to tokenize text into a distinct dictionary of features with associated enumerated values. Advanced techniques may pre-process text with grammatical knowledge to enrich tokens in a way to aid in a classification task (e.g., part-of-speech POS tagging, negation prefixing, etc.). “Stop” words (“a”, “the”, “but”, “and”, etc.) are often removed to improve efficiency. With each document distilled to a set of enumerated features, the data classification system can then perform feature selection, selecting a subset of features that either enhance, or at least minimize loss of, the information content of the document. Arguably, feature selection is primarily performed for efficiency reasons, as many machine learning algorithms display non-linear efficiency with respect to the number of distinct features.
The selected features can be weighted (which can also be thought of as a “soft” feature selection, where some features are selected strongly and other features are selected weakly), to enhance a machine learning algorithm. An example of feature weighting is the use of Inverse Document Frequency (IDF), wherein terms get more weight if they occur more frequently than their general average in a wider corpus and less weight if they appear less frequently than their general average.
The above processes can be done on documents in a training corpus as well as documents in the corpus that are to be classified. Training might involve providing the data classification system with a corpus and classifications for each document in the training corpus. Thus, for a simple binary classification process, some of the documents in the training corpus are tagged as being examples of members of the class while the others are tagged as being counterexamples.
The data classification system then operates a training process wherein discriminating patterns are preferably discovered in the training corpus between the examples and the counterexamples. Techniques for pattern discrimination have been studied in considerable detail. Examples of machine learning classification techniques include, but are not limited to, Naïve Bayes, Support Vector Machines, Maximum Entropy, and k-nearest neighbor. Others might be found in use or in literature on the topic.
More complex data classification systems have been developed. For example, instead of simply classifying an input document as being an example of a member of the class or a counterexample (a binary classification), the input document might be classified into one of more than just two possibilities (M-ary classification into M classes). For example, when evaluating news stories, a simple data classification system might just make a binary decision as to whether a particular news story refers to topic T or not, while a more complex data classification system might define each class as relating to a particular topic and would classify the input document into one or more of two or more classes.
Data classification systems might make hard decisions as to how to classify a given input document. Some data classification systems might make soft decisions, wherein an input document is not necessarily classified into a class with absolute certainty, but it is tagged with one or more value(s) indicating the degree(s) to which the document would be associated with each of one or more classes.
One problem with existing data classification systems is that real world examples might be more involved and items would be classified differently depending on other considerations. Hence, there is a considerable need in the art for a more sophisticated classification system capable of classifying items based on multiple inputs into multidimensional categories.