Database systems and search and retrieval from such databases are known. For example, U.S. Pat. No. 6,778,995 to Gallivan describes a system and method for efficiently generating cluster groupings in a multi-dimensional concept space. A plurality of terms are extracted from documents of a collection of stored, unstructured documents. A concept space is built over the collection and terms correlated between documents such that a vector may be mapped for each correlated term. Referring to FIG. 14 of the '995 patent, a cluster is populated with documents having vector differences falling within a predetermined variance such that a view may be generated of overlapping clusters.
Much research has been conducted for military and related purposes in the field of target signature recognition via sonar and radar or by related spectral data identification. U.S. Pat. Nos. 4,992,797 (Gjessing et al.), 5,012,252 (Faulkner), 5,867,118 (McCoy et al.), 6,337,654 (Richardson et al.), and 6,943,724 (Brace et al.) are examples of target acquisition of mobile objects such as ships, aircraft and missiles. By way of example, U.S. Pat. No. 7,046,192 (Nagel) is directed to a radar process for classifying or identifying helicopters, and U.S. Pat. No. 8,049,659 (Sullivan et al.) is directed to firearm threat detection, classification and location using wideband radar. U.S. Pat. No. 6,580,388 (Stoyanov et al.) is directed to a calculation methodology for complex target signatures of a sample object such as an antenna array which are used to extrapolate a more complex, for example, ship size system. U.S. Pat. No. 5,828,334 (Deegan) expects frequencies between 10 and 50 Hz for cyclotron radiation, 100 to 5000 Hertz for jet turbulence and above ten kHz for, for example, turbine blade frequencies for vehicle, ship, missile and aircraft classification. U.S. Pat. No. 7,920,088 is directed to an apparatus and method for identifying targets through opaque barriers. The apparatus may be worn and identify a target utilizing transceiver circuitry in the 300 MHz to 1100 MHz range.
Target recognition and identification must be immediate so that an enemy target does not acquire target recognition first, for example, in military applications. Targets may intentionally change their signatures to attempt to foil recognition. J. D. Birdwell (one of the named inventors of the present application) and B. C. Moore wrote the chapter “Condensation of Information from Signals for Process Modeling and Control,” at pp. 45-63 of Hybrid Systems II, published by Springer of Berlin/Heidelberg, 1995 in which the importance of collecting historical data over time is identified and demonstrated as valuable, for example, in process modeling and control. For example, historically collected data for a process may trigger recognition of an irregularity in collected, expected data such that the imminent failure of a machine component may be detected before the component fails. Birdwell and Moore recognize that when historical data are used to characterize behavior it is often not appropriate to model such behavior as a curve because too much information may be lost. As an example, classification can occur by comparison of a spectrum to a region or cluster of classified or known spectra and is not limited to comparison to a centroid or other small set of statistics derived from the known spectra. When applied to target recognition, the target's attempt at disguising its true identity may be determined and, perhaps, more importantly, with the classification of the target, the target's means for disguising its identity determined for future encounters with the classified target. The limitations of the method disclosed by Birdwell and Moore are the difficulties inherent in the representation of regions or clusters obtained from classified or known spectra, and in the comparison of a spectrum to a region. Birdwell and Moore offer an approach to representation of signals and portions of regions using interval arithmetic but do not address representation of the entire region in a manner that enables efficient comparison for large volumes of data. Although spectra are used as an illustrative example herein, the approach described herein is not limited to spectra.
U.S. Pat. No. 7,127,372 to Boysworth describes an improved regression-based qualitative analysis algorithm when a mixture, not in a library of spectra, and being an “unknown” is subjected to regression analysis of “peaks” in a residual error computed between an estimated spectrum and a measured spectrum. The process is repeated using information from a retro-regression.
Research into data clustering and indexing in regard to DNA analysis began in the 1990's at the University of Tennessee. U.S. Pat. Nos. 6,741,983; 7,272,612; 7,454,411; 7,769,803; 7,782,106; 8,060,522 and 8,099,733 are representative of disclosures issued to Birdwell et al. for a method of indexed storage and retrieval of multidimensional information. These methods provide an opportunity to address any limitations of the method disclosed by Birdwell and Moore discussed above. Also, a method and apparatus for allele peak fitting and attribute extraction from DNA sample data is known from U.S. Ser. No. 11/913,098 filed Oct. 30, 2007 and published as US 2009/022845 on Sep. 10, 2009, still pending. Other patents and pending applications of the University of Tennessee will be referenced herein.
Spectral data may comprise time and frequency-based spectral data that may be sonic, for, example, ultrasonic, or radio frequency, at any frequency from 0 Hz (or direct current) to teraHertz range. (There is no sound or vibration at 0 Hz.) For example, voice sounds are known to have most information content in the frequency band between 200 and 1600 Hertz. Higher voice or music frequencies provide fidelity and other properties making the voice or music sound recognizable as a particular human (or particular animal or particular musical instrument or collection of musical instruments). Ultrasound transducers (ultrasound used to be known as frequencies above the sound frequencies capable of being heard by the human ear) are now known that vibrate in the 100 MHz range and higher. The radio frequency spectrum is practically unlimited.
Underwater submarine data transmission is typically accomplished at very low frequencies, but skilled sonar operators can utilize underwater acoustic signal measurements that have information content at frequencies from subsonic through the audible spectrum to distinguish ships and fish of various types. Radio operators can utilize return signals in the radio spectrum (up through the microwave range) from aircraft and birds in a similar fashion. Spectral data are utilized in astronomy to detect and identify solar, interstellar and extragalactic processes and objects over frequency (or wavelength) ranges from radio frequencies through X-rays. Both the representation and the analysis of signals such as these, identified herein as “spectra” can be accomplished using either frequency (or wavelength or energy) or time domain methods, and it is understood that the terms “spectra” and “spectrum” as used herein can refer to either type of signal or analysis.
Optical fiber now carries light frequencies modulated with data at very high frequencies. Black body radiation is passive and is radiated by any specimen without the connection of electrical leads or any electrical or electronic stimulation. Spectral frequencies are intended to include all these spectral data and spectral data is not intended to be limited in this disclosure. For example, spectral data may include data for visible and invisible light frequencies, acoustic vibration (including ultrasound) and X-rays or gamma radiation.
Electrical impedance is defined as “A measure of the complex resistive and reactive attributes of a component in an alternating-current circuit” by the Institute of Electrical and Electronics Engineers (IEEE). Impedance measurements can be represented in several different forms. The most popular methods involve representing the phase angle and magnitude, in a manner similar to measurements in polar coordinates, or as a complex number, in Cartesian coordinates. While spectral data may take many forms, spectral data will be represented in the present application, by way of example, as spectral impedance data. The impedance may be represented, by way of example, as a complex number involving real and imaginary numbers or in alternative form. An example of an alternate equivalent form of specifying spectral data is by magnitude and phase angle. In some applications, spectral data may be represented by the magnitude or the phase or a similar partial capture of the entire spectral information content.
Spectral data collected passively as black body radiation may be influenced by received radiation that is man-produced such as transmissions on radio frequency channels. Any unabsorbed radiation is reflected and included in the black body radiation. For example, a biological or other specimen receiving sunlight may exhibit different passive black body and reflected radiation characteristics at night or when influenced by the weather. As suggested by U.S. Pat. Nos. 7,724,134 and 8,044,788, for example, the “noise” influence on passively received black body radiation emission may be avoided by utilizing antennae tuned to frequencies not utilized for any radio frequency transmission such as frequencies reserved for listening for transmissions from the stars (astronomy uses).
By spectrum analyzer as used herein is intended a device for obtaining frequency spectrum data generally which may be optical, acoustic, electromagnetic, radiation and other frequency spectrum data of an unknown specimen or media and includes and is not limited to including a network analyzer. Network analyzers are known that measure the impedance of a system over a range of frequencies and are capable of taking measurements from ten MHz (106) to over one THz (1012) with over 20,000 sample points. Analysis of all 20,000 points may be time consuming and unneeded because initial investigations show that the similarity of two impedance curves can be determined by their general shape, which can be represented with a lot fewer measurements at selected frequencies. Spectrum analyzers may capture frequency spectrum data over a range of frequencies (that does not include zero frequency or DC) and may use methods such as modulation to capture high resolution spectral data over a narrow band of frequencies. Windowing methods, such as the use of a Hamming window, can be used to improve the accuracy of spectral data measured as a function of frequency or wavelength. Spectrum analyzers are known for receiving passive black body radiation or ultrasound transmission via directed antenna or microphone respectively. A spectrum analyzer may also be used that can capture or measure a signal as a function of time. Filtering methods are known in the art that can be used to modify the spectral characteristics of the captured or measured signal. For example, a band pass or low pass filter may be used. Such filters are not limited in their application to electrical signals; gratings, zone plates, band pass, and low pass filters, for example, may be used to filter optical and electromagnetic signals.
For analysis, the spectral impedance data used by way of example herein may be represented as vectors of complex floating point values, for instance, a vector of 20,000 elements representing the real and imaginary parts of the impedance at 20,000 frequencies and measured over time. If correlation exists among the data variables, then one of several known techniques can be used to reduce the dimensionality of a set of spectral data while still retaining most of the information represented in the data. Different dimension reduction techniques will be reviewed and analyzed. Using knowledge of the data and an analysis of the loss of information, one may be able to reduce the spectral data to less than 1/1000th of the original size, allowing much more efficient calculation of results, avoiding instability of solution results, and rendering more possibilities for application. Two known methods of dimension reduction are now reviewed: principal component analysis and peak binning.
Dimensionality reduction is known in the context of principal component analysis and binning, for example. Principal component analysis (PCA) identifies the principal components of the data that exhibit the most variability. With PCA, the data are represented with respect to a new ordered set of orthogonal bases that capture successively less variability in the data. In many cases, 90% of the variability in the data can be represented in the first few principal components. One method of performing PCA is with singular value decomposition (SVD). With SVD, the data matrix of size m×n with ranks is factored into three matrices that have unique properties.X=UΣV′  (1)
The V matrix is of size n×r, and the columns of V are right singular vectors of X. The columns of V represent a set of basis where each column shows the directions onto which the sum of squares of the projection of the rows of X is of decreasing magnitude. The columns of V are orthonormal. The U matrix is of size m×r and the columns of U are orthonormal, and are the left singular vectors of X. The Σ matrix is a diagonal r×r matrix. The values in the Σ matrix are referred to as the singular values (of SVD) and are the positive square roots of the eigenvalues of both XX′ and X′X. The singular values are in decreasing order and can be used to determine the real, or effective, rank of the original data matrix by looking at the number of non-zero singular values. To determine the effective rank, the ratio of each singular value to the maximum singular value is calculated and low ratios below a threshold are typically taken to be zero. The number of non-zero singular values above the threshold may represent the ‘effective rank’.
If the data to be dimensionally reduced are comprised of samples of a stochastic process, it is well-known in the field of mathematics that the Karhunen-Loève theorem can be employed to represent the stochastic process as a linear combination (a series) of a (typically infinite) set of orthogonal functions, where the orthogonal functions are deterministic and the coefficients in the terms of the linear combination are random variables. The terms of the linear combination or series are ordered such that a finite truncation of the series is a best fit to the characteristics of the stochastic process in the sense that it minimizes a squared error measure of the difference between the truncated series and the original stochastic process. In this manner the stochastic process can be optimally approximated in an n-dimensional space be retaining the first n terms of the series. When applied to samples of a stochastic process this approach is equivalent to principal component analysis and yields both optimally selected (with respect to a specific type of error criterion) orthogonal functions and reduced dimension representations of the samples.
Binning was also studied and is known from Puryear et al., where data were binned and multiple peaks combined falling in one bin. Other methods of data reduction are known; see, for example, the projection search method used to determine classifiers of data, as discussed in United States Patent Application Publication 2010/0332475 dated Dec. 30, 2010, by Birdwell et al.
Now the following known metrics will be individually discussed as known in the art: inner product, Euclidean distance, Mahalanobis distance, Manhattan distance, average, squared cord, canberra, coefficient of divergence, modified Boolean correlation, average weight of shared terms, overlap, cosine, similarity index and Tanimoto's (fourteen different similarity metric possibilities).
The inner product has some interesting geometric features that make it a very good basis for many of the equations used here. The inner product of a vector with itself is equivalent to the length of the vector squared. The inner product between two orthogonal vectors is equal to zero. The inner product between two vectors X and Y having components Xi and Yi respectively is given byΣXiYi.  (2)The inner product between a vector and another vector grows as the angle between the two vectors grows for angles less than or equal to ninety degrees. If two vectors are being compared to a query vector and they both have the same angle of separation from the query vector but are of different lengths, the longer vector will have a larger inner product value.
Euclidean distance is a standard, known metric used in most distance measurements because in R2, the distance can be measured with any standard ruler. An equation for Euclidean distance is√{square root over (Σ(Xi2−Yi)2)}.  (3)The contour graphs for equivalent Euclidean distance resemble circles around points in the graphs.
The Mahalanobis distance was first introduced by Prasanta Chandra Mahalanobis in 1936 with his publication, “On the generalized distance in statistics.” The metric is very similar to the Euclidean distance with the modification that it takes into account the density and dispersion of known members in a group. This metric is different from many of the examined metrics in that it will calculate the distance to the center of a cluster of points based on the dispersion of existing members in the group. Mahalanobis distance scores may also be calculated using the singular value decomposition (SVD).
The Manhattan distance measure is often called the city block, or taxicab, distance because it measures the distance between points in space if the path of travel is only able to follow directions parallel to the coordinate space, as a taxicab driver in Manhattan, N.Y. would have to do when traveling between two points in the city where the streets are only North-South or East-West. The advantages and disadvantages of using the Manhattan distance are similar to those of the Euclidean distance with the exception that vectors of equal similarity to a query point vector form a diamond shape around the query point.
The average distance is a known metric and is defined as
                              1          M                ⁢                  ∑                      (                                          X                i                            -                              Y                i                                      )                                              (        4        )            where M is the number of coordinates in X. Y contains the same number of points as X. The equation calculates an average distance.
The squared chord distance has been actively used in palynology for pollen assemblage comparison based on the work by Gavin et al. and Overpeck et al. in comparing distance metrics with respect to pollen assemblages. The equation only allows comparisons of vectors with positive elements and produces a shape similar to Euclidean distance, but stretched in the direction of the X axis.
The canberra distance was first published in 1966 and then refined in 1967 by the same authors, Lance and Williams. In two dimensions, a contour structure yields similar values and all points with a similarity distance may be determined with a similarity distance value.
The coefficient of divergence was introduced by Sneath and Sokal, studied by McGill and contour graphs of equivalent coefficient of divergence distances may be derived.
The modified Boolean correlation was introduced in Sager and is almost the same as the arithmetic mean of two vectors. Its modified form from the arithmetic mean to include another term that is a value of zero or one depending on whether the terms in the vector are positive or negative (if either term is negative, Xi and Yi equal zero; Xi and Yi equal one otherwise. After graphing, the contours of similar distances reveal a structure very similar to the average distance measure but shifted with a different slope and y-intercept.
The average weight of shared terms metric was introduced in Reitsma and Sagalyn's report in 1967 and is equivalent to the average value of all of the terms in both vectors, excluding any dimensions with negative values. An analysis of this metric in two dimensions reveals that contours of equal distance d to the vector (X1, Y1) have a −1 slope and a y-intercept at 4d−Y1−Y2.
The overlap measure of similarity between vectors X and Y is defined by Σmin (Xi, Yi)/min (ΣXi, ΣYi). If all members of vector X are less than, or greater than, the members of vector Y, the two vectors are considered to have maximum similarity. If the members overlap, the vectors will have a similarity of a magnitude between 0.5 and 1.
The cosine similarity distance metric is equivalent to the cosine of the angle between two vectors and is the inner product divided by the norm/length of the inner product. This metric has been used in many areas due to the easy and intuitive interpretation of the similarity. The metric is also bounded on the interval from zero to one with a value of zero indicating the vectors are perpendicular and a value of one indicating the vectors are collinear. Noreault and McGill cite Torgerson's 1958 book as the origin of the metric; however, many linear algebra texts show the proof of this metric. This metric has a benefit of being scale independent.
The similarity index is a metric introduced by Lay, Gross, Zwinselman, and Nibbering in 1983 and later refined by Wan, Vidaysky, and Gross in 2002. The similarity index is an unbounded metric where a value of 0 indicates an exact match and the value increases as the two vectors become less and less similar.
The final known metric evaluated is that of Tanimoto. The Tanimoto coefficient is an extension of the cosine similarity distance that is documented in Tanimoto's internal IBM memos and Rogers. The calculation is equivalent to the Jaccard coefficient when all elements of the vectors are binary values. A Partial Bibliography provides citations for all references for the similarity metrics examined in the Detailed Description of the Preferred Embodiments.
Given the several known metrics for clustering data, for example, into groups of similar spectral data for known specimens and media, the desirability for evaluating spectral data in multiple dimensions and building a database of the known spectral data that may be accumulated, the potential applications for identifying properties of unknown specimens and media by classifying unknown spectral data using preferred known metrics to compare with the classified known spectral data in the database, a method and apparatus for identifying unknown specimens and media and predicting properties becomes desirable in view of the prior art.