Classifiers are statistical models, typically implemented as computer programs executed on computer systems, used to classify real world events based on a set of features of a real world event. A real world event is an instance of any entity or event in the real world. An instance of a person and an instance of a hockey game are both real world events. However, real world events can be works of imagination, such as book of fiction, a fake news story, an abstract painting, or a computer-generated digital image. Each of these events are still instances of their respective types.
An event has various features, which can be attributes or elements of the event. An attribute is a numerical or qualitative aspect of the event, for example, a digital image can have attributes such as a color histogram, an average luminance, a texture parameter, or the like. An element refers to a sub-part of the event that can take a value; for example an inning of a baseball game is a distinct sub-part of the baseball game that takes a value of a score. The segmentation of video data provides another example of elements of events. Video data is often segmented into a sequential set of shots or frames, each of which are distinct sub-parts or elements of a video that can take values representing audio, visual and temporal aspects of the shot.
In computational classification, statistical models are generated which reflect the probability that an event belongs to a class based on its set of features. For example, a real world event such as an instance of a flower can be classified as a daisy based on features of the flower such as petal length, number of petals, leaf shape and stem length. To generate these statistical models, classifiers are trained on a set of real world event data with known classes, herein referred to as a training set. A corpus of all real world event data with known classes can be used as a training set to exhaustively train the classifier. In practice, the training set is a selected subset of the corpus of available real world event data.
A large problem in training classifiers is the assumption that the training set is representative of real world data. That is, the training set as a subset of the real world data is representative of the set of all real world events, in terms of having substantially the same types and distribution of features and attributes. If the real world event data is sampled from all possible real world events correctly, the real world event data is assumed to be independent and identically distributed. This condition is called IID.
A classic example used to illustrate the concept of independent identically distributed data is rolling a fair die. For each roll of the die, each outcome is independent of other outcomes (e.g. the probability of rolling a six is the same each time the die is rolled), therefore the distribution is independent. The real world events of rolling the die are identically distributed, that is each outcome (i.e. number) has the same probability of being rolled.
Rolling a die also can be used to illustrate sampling error due to chance. If a goal is to sample the set of all possible outcomes (roll the die) to approximate an identical distribution of the data (an equal number of rolls for each possible outcome), many die rolls would be necessary to approximate an equal numbers for each possible outcome.
In the case of real world events associated with large sets of features, the problem of selecting training set data to approximate identical independently distributed data is complicated by many other factors aside from chance. These factors include over-representation which causes non-identical distribution of data. Using the flower example, a specific breed of daisies may be over-represented in the training data, leading to poor classifier performance. Data may also be skewed due to dependencies between the real world events such as the duplication of data. For example, when training a medical image classifier to identify a specific type of tumor based on cell morphology, multiple pictures of the same tumor may be included in a corpus of images.
The assumption of an independent and identical distribution of the set of features associated with the training set of real world events creates similar bias in training classifiers. Often features are heavily correlated, leading to redundancy in the feature set. For example, if the feature “diagnosis of Alzheimer's disease” is heavily correlated with the feature “age”, including both features to train a classifier for classifying medical records can be redundant, and result in the classifier being biased.
The removal of redundant features enhances the capability for generalization of the classification model. If all features in the model as assumed to be independent, then all features are typically assigned an equal weight. Consequently, the inclusion of heavily correlated features leads to over-fitting of the model to the data. Additionally, elimination of redundant features can be necessary for feature sets which are too large to process efficiently. A common method of attempting to compensate for error due to training data that is not identically and independently distributed (non-IID data) is to evaluate the accuracy of a classifier by inputting several random of subsets of training data and evaluating the classifier's classification output of these subsets with respect to real world event data with known classes. This technique is called cross-validation. Cross-validation of non-IID data does not to compensate or correct for non-IID data because random sampling or partitioning of skewed data simply results in a subset of skewed data. Therefore, these methods of cross-validation only serve to evaluate the performance distribution of a classification model based on best and worst sets of randomly sampled data.