Modern acquisition systems can often acquire large amounts of data that can potentially be used to associate subsets of that data to one of a finite set of classes. These classes may or may not be known at the time the data is collected.
However, once the classes of interest are specified only a part of the data may be relevant for differentiating these classes; much of the data may become irrelevant for this specific classification. The irrelevance can result from an absence of any information that by itself could contribute to the classification, or by the presence of information that could contribute to the classification but is redundant with other information already selected for inclusion in the classification operation.
Even individual data values may contain irrelevant information in the representation; for example high order positions may remain invariant or nearly invariant for all the classes of interest, low order positions may consist largely or entirely of random noise.
In many cases, there are advantages to differentiating between data that is relevant to the classification and data that is irrelevant. These include savings in bandwidth requirements resulting from the ability to save the most relevant data while discarding and/or not transmitting data that is known to be irrelevant. In other cases the advantages may lie in the ability to not collect or even not generate data that has been determined to be unnecessary for the classification.
One example of where such differentiation can be advantageous is in the interpretation of multispectral and hyperspectral “images”. The normal human eye can combine images from three different spectral bands “red”, “green”, and “blue” and interpret the combination as a “color”. Multispectral and hyperspectral images portray the same image scene in four or more, even hundreds, of bands and therefore can be very difficult for humans to interpret or even visualize.
Hyperspectral or multispectral images may be created in a variety of ways. The most common method is illuminating the scene with a broad spectrum source and using an imaging device that separates the incoming light so that light from each spectral band forms a separate image. Another method of creating such images is to illuminate a scene successively at different wavelengths using a tunable laser or other variable wavelength source. In this case the imaging device captures multiple images or “bands” in succession, one for each illumination wavelength band.
However hyperspectral images are generated, they contain large amounts of data. The problems with current hyperspectral image analysis resulting from high cost and complexity in dealing with such large volumes of data are well summarized by Jinchang Ren, Timothy Kelman and Stephen Marshall, “Adaptive clustering of spectral components for band selection in hyperspectral imagery”, Proceedings of University of Strathclyde's Second Annual Academic Hyperspectral Imaging Conference, 17-18 May 2011 (hereinafter Ren et al. [2011]). To solve these problems, one practical solution is band selection, which aims to use fewer bands to represent the whole image while maintaining a good performance of analysis, i.e. removal of redundant information.
Ren et al. [2011] categorize the existing techniques for band selection into three main groups. The first group contains subspace-based approaches, which aim to project the original data onto certain subspace and extract the most discriminative components. Examples in this group include principal component analysis (PCA) and its variations, orthogonal subspace projection, and wavelet analysis.
The second group refers to optimization based feature selection approaches, using techniques like                Neyman-Pearson detection-theory based thresholding of signal energy,        Jeffries-Matusita distance based statistical significance test,        constrained energy minimization,        mutual information based (as disclosed by B. Guo, S. R. Gunn, R. I. Damper and J. D. B. Nelson, “Band selection for hyperspectral image classification using mutual information,” IEEE Geoscience and Remote Sensing Letters, 3(4): 522-526, October 2006. (hereinafter Guo et al., [2006]) and B. Guo, R. I. Damper, S. R. Gunn, and J. D. B. Nelson, “A fast separability-based feature-selection method for high-dimensional remotely sensed image classification,” Pattern Recognition, 41(5): 1653-1662, May 2008 (hereinafter Guo et al. [2008]),        minimization of dependent information, as disclosed in J. M. Sotoca, F. Pla, and J. S. Sanchez, “Band selection in multispectral images by minimization of dependent information”, IEEE Trans. Systems, Man, and Cybernetics (Part C), 37(2): 258-267, March 2007 (hereinafter Sotoca et al. [2007]),        classification-based inter-class distance, as disclosed by Sebastiano B. Serpico and Gabriele Moser, “Extraction of Spectral Channels From Hyperspectral Images for Classification Purposes”, IEEE Transactions on Geoscience and Remote Sensing, vol. 45, no. 2, February 2007 (hereinafter Serpico and Moser [2007]), and        minimum estimated abundance covariance.        
In the third group, clustering of bands is applied for dimensionality reduction, such as top-down hierarchical clustering using Ward's linkage criterion and Kullback-Leibler divergence measurement, minimizing mutual information in separating band ranges, as disclosed in C. Cariou, K. Chehdi, and S. Le Moan, “BandClust: an unsupervised band reduction method for hyperspectral remote sensing,” IEEE Geoscience and Remote Sensing Letters, 8(3):564-568, May 2011 (hereinafter Cariou et al. [2011]) and correlation-based subset clustering.
For those interested in selecting a small subset of the existing image data to represent the whole image while maintaining a good performance of analysis, i.e. removal of redundant information, the first group of techniques mentioned by Ren et al. [2011], is not relevant. Except for the method of Sotoca et al. [2007], the second set of methods generally makes use of available information regarding the specific classes that need to be differentiated. It stands to reason that these supervised methods, making use of class information, are the methods most likely to be successful in differentiating the specified classes.
The third group of techniques mentioned above is generally concerned with unsupervised methods for removal of redundant information, methods that do not make use of available known class information. The latter are therefore less likely to be successful in selecting a data subset optimized for differentiating a prespecified set of classes.
The methods and systems described herein are concerned with an improved method of supervised data selection, a method that makes use of available information regarding the specific classes that need to be differentiated. As such, they belong primarily to the second group of techniques identified above.
The advantages of information theoretic measures, specifically mutual information, for correlation have become widely recognized in recent years. Imaging applications include image registration, as disclosed by M. R., Sabuncu, “Entropy-based Image Registration”, Ph.D. Thesis, Princeton University, November 2006 (hereinafter Sabuncu [2006]); P. Viola and W. Wells, “Alignment by maximization of mutual information”, International Journal of Computer Vision, 24(2): 137-154, 1997 (hereinafter Viola and Wells [1997]); U.S. Pat. No. 7,639,896 to Z. Sun et al., as well as the second and third groups of feature selection techniques. Ren et al. [2011] describe, in detail, a method belonging to the third group of techniques, unsupervised classification. Their method uses measurement of mutual information between adjacent bands to identify clusters of similar bands so that redundant information can be removed.
Serpico et al. [2007] emphasize the difference between feature selection and feature extraction. The latter attempts to reduce the number of features by constructing new features based on combining the original features. The methods described by Ren et al., would generally be considered to be unsupervised feature selection. Serpico et al. construct new features to reduce the dimensionality. They demonstrate supervised feature extraction methods for optimizing correlations between the new features, treated as continuous variables, and classes that can also be treated as continuous variables. As such their methods are not appropriate for differentiating classes based only on qualitative descriptions. They do not make use of mutual information.
Xuexing Zeng, Suvitha. Karthick, Tariq S Durrani, John Gilchrist, “Exploiting Copulas for Optimising Band Selection for Hyperspectral Image Data Sequences”, Proceedings of University of Strathclyde's Second Annual Academic Hyperspectral Imaging Conference, 17-18 May 2011 (hereinafter Zeng et al. [2011]) cite the need to develop an automated process for selecting bands that carry significant information and rejecting bands that carry redundant information, for effective hyperspectral image analysis. They mention a joint histogram method of estimating mutual information.
In working with information theoretic methods it is common to organize data and other information as histograms, i.e. as used herein, intended to include broadly arrays of discrete elements in terms of frequency distribution vs. class interval or range, not necessarily normalized.
After mentioning the joint histogram method of estimating mutual information, Zeng et al. [2011] dismiss the method because of the difficulty of reliably determining the optimum number of histogram cells (also known as bins). They then describe a method of estimating mutual information among spectral bands treated as continuous information in order to identify bands with strongest correlation with a reference image and minimum correlation with other bands. Whatever the potential advantages of their method for unsupervised band selection, if applied to supervised band selection it suffers from the need to approximate the true joint probability distribution function by an assumed one.
Guo et al. [2008] select and rank a number of bands based on approximate measures of mutual information between pairs of bands that are well correlated with the class but poorly correlated with one another. One mutual information calculation is required for each possible two-band combination. Therefore the number of required mutual information calculations increases very rapidly, roughly in proportion to the square of the number of bands present.
Guo et al. [2008] also emphasize the well-known fact that for a direct implementation of mutual information when multiple bands are involved the number of required histogram cells can quickly become unmanageable. Also, with a large cell count, the amount of available data may be insufficient to produce a representative population of the cells.
An additional disadvantage of the method described by Guo et al. [2008] is that it does not provide a measure of the remaining uncertainty after each new band has been selected into the subset to be retained.
Mutual information has also been applied to a wide variety of classification problems. For example U.S. Pat. No. 7,007,001 to Oliver et al. describes a method of using mutual information between observations and hidden states to minimize classification errors.
Therefore, it is desirable to simply and efficiently identify relevant data for class differentiation, as well as more precisely identify parameters of the relevant data. It is also desirable to reduce the amount of data that needs to be inter alia generated, collected, transmitted, stored, and/or processed for future classification. It is further desirable to reduce in cost, time, and effort the selection of identifying input for class differentiation, and to improve the predictive quality of the identifying input.