It is known that a spectral analysis can be performed on a sample to identify its type. For example, a spectral analysis of an unknown sample under test can include identifying one or more peak wavelength values. Careful matching of the peak wavelengths of the unknown sample to a peak signature of a known reference sample can indicate whether the unknown sample is likely the same matter as the reference sample.
In modern multi-channel instrumentation, a reference database in which to search and match an unknown sample to a known sample can require a substantial amount of processing and storage resources. For example, it is not uncommon to receive a spectral analysis including a continuum of a thousand or more data points for each measurement of an unknown sample.
Furthermore, in conventional spectrum searching applications, it is not uncommon that a reference database includes more than 10,000 sets of spectral information; one set of spectral information for each reference sample.
Each set of spectral information for a respective reference sample can consist of a thousand or more data points defining peaks, valleys, etc. Matching of spectral information (e.g., a thousand or more data points) in an unknown sample to a corresponding one of more than 10,000 sets of spectral information (e.g., a thousand or more data points) may be computationally challenging. For example, when a 2000 point unknown spectrum is searched against a library of 10,000 reference spectra (each of which contain 2000 channels of information), 20,000,000 or more operations (point by point comparisons) must be executed if no provisions are taken to use a computationally efficient search methodology.
One approach to make the search methodology more efficient is to compress the spectral data into a binary format prior to analysis. In accordance with conventional spectroscopy, one prior approach to binarize a spectrum is to make an assessment regarding the presence or absence of a peak. When comparing two spectra, such as spectrum A and spectrum B, one approach would be to look at a table of peaks for spectrum A and assign a value of one to any location where spectrum B also contains a feature within n wavenumbers (or other suitable units, e.g. pixels, m/z, etc.) of that peak. A value of zero is assigned to any location of the peaktable for spectrum A where spectrum B does not contain a peak. This approach was described by Clerc, et al., in the 1980s.
Although this conventional approach provides a general similarity ranking, it does not afford any kind of probabilistic interpretation. Also worth noting (and noted by Clerc) is that, in accordance with this conventional approach, the scores of candidate matches depend on the direction of the search (results are non-symmetric). For example suppose spectrum A contains 10 peaks, spectrum B contains 12 peaks, and 8 of the peaks are found to be in common. Since respective scores are normalized based on the number of peaks present, values of 8/10 or 8/12 can be generated.