Classification is one of the most important operators that is used for phenomenal (or similarity) searches in various image, video, and data mining applications. In a phenomenal search, a target pattern is usually classified according to a set of predefined classes. The target pattern can include, for instance, the spectral signature of a pixel from an image or video frame; the spatial signature of a block of an image or video frame defined by its texture features; the frequency signature of a time series such as stock index movement; or the spatial signature of 3D seismic data.
In order to achieve high classification accuracy, it is usually necessary to train a classifier with sufficient training data from each individual class. However, gathering reliable training data is usually difficult, if even feasible. As an example, the current United States land cover/land use maps were developed around the late 1960's by the United States Geology Survey (USGS). These maps are not completely accurate due to errors in the photointerpretation of the images used to create them, their limited resolution and inaccuracies in geolocation. Additional errors arise when using these maps as source of ground truth in conjunction to more recent images to train the classifier, due to various natural and artificial land cover transformation. As a result, the accuracy of the classifier suffers.
Similarly, classifying video, time series, and 3D seismic data could also encounter unreliable training data.
One way of generating more reliable training data typically involves clustering the data using one of the unsupervised classifiers or vector quantization methods. A human expert then labels the clusters manually. This methodology is appropriate, however, only for generating a small set of training data, since it requires human intervention. Furthermore, it does not automatically incorporate preexisting classified data even though those preclassified data may not be completely accurate.
Other techniques for generating training data include the discarding of outliers. These approaches invariably address those samples that appear to be a statistical anomaly. However, these approaches cannot deal with the situations when the training set is either mislabeled or changed.
Based on the foregoing, a need exists for a training set that is reliable and fully useable. Additionally, a need exists for a technique that allows the modification of an unreliable training set.