Conventional near/nearest neighbor matching methods are useful for solving classification problems in compression schemes using vector quantization or similar approaches. Near/nearest neighbor methods match a data vector to an exemplar vector from a set of exemplar vectors.
In a hyperspectral image data set, each spatial pixel is associated with a high-dimensional vector, or spectrum. The dimensionality of the vector is equal to the number of spectral bands, and can be in the hundreds or more. Compression is achieved by choosing a subset of the data, known as exemplars, to represent the full data set. Each of the observed spectra is replaced by a reference to a member of the exemplar set that “matches” it, in a sense to be made precise. The compressed data set then need only include the exemplar vectors, which are a small subset of the total data set, plus, a codebook that is an array that contains the index of the matching exemplar for each spatial location. In addition, the magnitude of the original vector may be recorded also. In general, it may be desirable to record other salient features of the vector if it will improve reconstruction of the data.
There are a number of methods for finding a near or even nearest neighbor of a data vector from a set of exemplar vectors. When both the vector dimensionality and the number of data points to be searched are small, an exhaustive search through the set of exemplar vectors is a practical approach. However, exhaustive searches quickly become prohibitive as either the number of data vectors or the dimensionality of the data increases, and alternative methods must be used. In vectors of two- or three-dimensions, Voronoi diagrams are useful in finding near/nearest neighbors. For vectors having moderate-dimensional spaces, e.g. around 20 or less, a k-d tree data structure provides sufficient efficiency.
These conventional methods are inadequate for many applications involving near/nearest neighbor matching of a large number of vectors in higher dimensional space, e.g. data sets containing tens or hundreds of thousands of vectors with up to several hundred dimensions. This is particularly true in a real time system where is may be necessary to repeat the search many times as new spectra are observed. A typical hyperspectral imager produces 50,000 spectra/s with 128 bands measured. As technology improves both the number of spectra/s and the number of bands measured will increase. Therefore, alternative search/matching methods have been developed for high-dimensional data.
Two previous methods for finding near neighbors in hyperspectral data include the Intelligent Hypersensor Processing System (IHPS) optimized for hyperspectral data, described in U.S. Pat. No. 6,038,344 and the CHOMPS compression system used in ORASIS (Optical Real-time Adaptive Signature Identification System), described in U.S. Pat. No. 6,167,156.
The search method in CHOMPS with the ORASIS prescreener accommodates the speed requirements needed to process hyperspectral data. CHOMPS with the ORASIS prescreener is a near neighbor selection method in which the first exemplar vector, i.e., the near neighbor vector, found within a pre-specified distance from the data vector is selected. As explained in more detail below, because of this selection process the near neighbor vector is not necessarily the nearest neighbor vector within the set of exemplar vectors.
IHPS provides a means for rapid detection of small, weak or hidden objects, substances, or patterns embedded in complex backgrounds, by providing a fast adaptive processing system for demixing and recognizing patterns or signatures in the data provided by certain types of “hypersensors.” The IHPS hypersensor produces as its output a high dimensional vector or matrix consisting of many separate elements, each of which is a measurement of a different attribute of the system or scene under observation. A hyperspectral imaging system, in which the sensed attributes are different wavelength bands, is one example of such a hypersensor.
The IHPS system eliminates redundancy in the incoming sensor data stream, for learning purposes, as will be discussed in much greater detail below, by using a prescreener module that compares each new observed spectrum, i.e., vectors, with a set of “exemplar” or “survivor” vectors chosen from earlier data vectors. The new data vector is ignored, for signature learning purposes, if it is sufficiently similar to some exemplar, and otherwise the new vector is added to the exemplar set. Two vectors are “sufficiently similar” if the difference between them, as measured using some appropriate metric, is less than a pre-specified value, ε. A similarity measure or distance metric appropriate when the data vectors are reflection spectra is obtained by normalizing the two vectors to unit magnitude and subtracting their dot product from 1. If the resulting number is less than ε, the vectors are considered to “match.”
ORASIS is an implementation of IHPS that uses many of the algorithms from the CHOMPS compression system and is optimized for hyperspectral data. CHOMPS provides automatic compression of hyperdata with high compression ratios and low information loss. CHOMPS matches the data vector, or candidate, with the first exemplar vector found in the search process that satisfies the match error condition, even though there may be another exemplar that is more similar to the candidate. The matching exemplar vector is thus a near neighbor of the data vector, but not necessarily the nearest neighbor, as indicated above.
A problem with current near neighbor matching methods, such as ORASIS, is that a nearer neighbor exemplar vector (i.e. an exemplar vector which better fits/matches the data vector) may exist within the set of exemplar vectors. Current methods, such as ORASIS, search the set of exemplars until a match is found; once a match occurs, the search is stopped. However, as indicated above, it is possible a better matching exemplar vector is present in the set of exemplars, yet because a “match” has been determined, the search is stopped, and the better matching exemplar vector is not found. Consequently, the compressed data may not represent the most accurate reproduction, or fidelity, of the original data vector that is possible with the current set of exemplars. As a result, the compressed data may not be acceptable where a more precise reproduction of data vector is desired or necessary.
A serious problem with current nearest neighbor search methods, such as exhaustive searches that are carried out until the nearest neighbor is determined, is that these methods are too inefficient to be effective in large, high dimensional data sets. For example, nearest neighbor methods that employ exhaustive searching simply take too long and are too inefficient to accommodate large, high dimensional data sets.