1. Field of Technology
The disclosure relates generally to machine learning and classification systems.
2. Glossary
The following definitions are provided merely to help readers generally to understand commonly used terms in machine learning, statistics, and data mining. The definitions are not designed to be completely general but instead are aimed at the most common case. No limitation on the scope of the invention (see claims section, infra) is intended, nor should any be implied.
“Data set” shall mean a schema and a set of “records” matching the schema; A “labeled data set” (or “training data set”) has each record explicitly assigned to a class. A single “record” is also sometimes referred to as a “data item,” an “example,” a “document” or a “case.” A “label” is recorded knowledge about which class or data source the record belongs to.
A “feature” is a measurable attribute of a data record. The “feature value” is the specific value of a feature for a given record. For example, the feature representing “whether the word ‘free’ occurs within the a text record” may have the value 0 or 1. A “feature vector” or “tuple” of a given record is a list of feature values corresponding to a selected list of features describing a given “record.” The feature vectors of a whole database often are represented as a matrix. “Feature selection” is a process that involves determining which of the features columns to retain and which to discard.
“Knowledge discovery” shall mean the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
“Machine learning” (a sub-field of artificial intelligence) is the field of scientific study that concentrates on “induction algorithms” and other algorithms that can be said to learn; generally, it shall mean the application of “induction algorithms,” which is one step in the “knowledge discovery” process.
“Model” shall mean a structure and corresponding interpretation that summarizes or partially summarizes a “data set” for description or prediction.
3. General Background
The volume of machine-readable data that currently is available, for example, on the Internet, is growing at a rapid rate. In order to realize the potentially huge benefits of computer access to this data, the data may be classified into categories (or classes). Traditionally, such data has been classified manually by humans. As the amount of data has increased, however, manual data interpretation has become increasingly impractical. Recently, machine learning has been implemented to classify data automatically into one or more potential classes.
Machine learning encompasses a vast array of tasks and goals. Document categorization, news filtering, document routing, personalization, and the like, constitute an area of endeavor where machine learning may greatly improve computer usage. As one example, when merging with a new company, managers may wish to similarly organize each company's database. Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing and personalization.
“Induction algorithms” (hereinafter “Inducer”) are algorithms that take as input specific feature vectors (hereinafter “feature vectors”) labeled with their class assignments (hereinafter “labels”) and produce a model that generalizes data beyond the training data set. Most inducers generate/build a “model” from a training data set (hereinafter “training data”) that can then be used as classifiers, regressors, patterns for human consumption, and input to subsequent stages of “knowledge discovery” and “data mining.”
A “classifier” provides a function that maps (or classifies) data into one of several predefined potential classes. In particular, a classifier predicts one attribute of a set of data given one or more attributes. The attribute being predicted is called the label, and the attributes used for prediction are called descriptive attributes (hereinafter “feature vectors”). After a classifier has been built, its structure may be used to classify unlabeled records as belonging to one or more of the potential classes. Many different classifiers have been proposed.
The potential is great for machine learning to categorize, route, filter and search for relevant text information. However, good feature selection may improve classification accuracy or, equivalently, reduce the amount and quality of training data needed to obtain a desired level of performance, and conserve computation, storage and network resources needed for future use of the classifier. Feature selection is a pre-processing step wherein a subset of features or attributes is selected for use by the induction step. Well-chosen features may improve substantially the classification accuracy, or equivalently, reduce the amount and quality of training data items needed to obtain a desired level of performance.
When machine learning is used to build a classifier based on a provided training dataset, but then is used to make predictions on a target dataset that differs somewhat in nature from the training dataset, the classifier produced by machine learning may be poorly suited to the target task. The present invention addresses this problem.
In the following description, like reference numbers are used to identify like elements. Furthermore, the drawings are intended to illustrate major features of exemplary embodiments in a diagrammatic manner. The drawings are not intended to depict every feature of every implementation nor relative dimensions of the depicted elements, and are not drawn to scale.