Comprehensive chemical characterization of a complex microstructure is a daunting task. Recently, full spectrum imaging instruments have become available that can collect a complete spectrum at each point in a spatial array and which promise to provide the data needed to perform such a characterization. A remaining hurdle is the computational difficulty of reducing the very large quantities of raw spectral data to meaningful chemical information.
Analysis of full spectrum images having sizes that can be collected with current instrumentation can outstrip the computational resources typically found in the laboratory. For example, a typical commercial energy dispersive x-ray imaging system can produce a 1024×1024 pixel (one megapixel) image with each pixel being represented by a complete 1024 channel x-ray spectrum. Four Gbytes of computer memory are required just to hold a single precision floating point representation of this data set. To put this number in context, 4 Gbytes equals the full physical address space of a modern 32-bit microprocessor. This data overload precludes the use of standard analysis algorithms, which require that all of the data to be in the computer's physical memory at one time.
A related issue is the computational complexity of the analysis algorithms. While the time required to fully read a given data set obviously scales linearly with the size of the data set, the analysis algorithms may scale up more quickly, leading to unacceptable computation times. To take one example, given a m×p matrix with m≧p, Singular Value Decomposition (SVD), a popular method for effecting Principal Components Analysis (PCA), requires computing time proportional to m×p2.
Researchers have taken a number of different approaches in attempting to overcome the foregoing computational problems. Current approaches to analyzing very large spectrum images range from downsampling to simply omitting data to various data compression schemes. In the context of spectroscopic data, the general idea behind data compression is to represent a high dimensional spectrum in terms of a lower dimension basis. Convenient bases include B-splines and wavelets. A principal component basis derived from PCA can yield good compression, provided the PCA can be computed. While these approaches are effective at reducing the size of the problem to be solved, they suffer the potential drawback that they work with an approximation to the data rather than the data itself. Whether or not the approximation is acceptable depends on the details of the data and the problem at hand. See, e.g., J. Andrew and T. Hancewicz, “Rapid Analysis of Raman Image Data Using Two-Way Multivariate Curve Resolution, Applied Spectroscopy 52, 797 (1998); B. Alsberg and O. Kvalheim, “Speed improvement of multivariate algorithms by the method of postponed basis matrix multiplication Part I. Principal component analysis”, Chemometrics Intell. Lab. Syst. 24, 31 (1994); H. Kiers and R. Harshman, “Relating two proposed methods for speedup of algorithms for fitting two- and three-way principal component and related multilinear models”, Chemometrics Intell. Lab. Syst. 36, 31 (1997); F. Vogt and M. Tacke, “Fast principal component analysis of large data sets”, Chemometrics Intell. Lab Syst. 59, 1 (2001); and F. Vogt and M. Tacke, “Fast principal component analysis of large data sets based on information extraction,” J. Chemometrics 16, 562 (2002).
The current invention overcomes these limitations by combining compression strategies with image analysis algorithms that operate directly on the compressed data. Spectral compression is described herein using an exemplary PCA-factored representation of the data. Furthermore, a block algorithm can be used for performing common operations more efficiently. A key advantage of the block algorithm is that at no point is it required to have all of the data residing simultaneously in the computer's main memory, enabling the operations to be performed on larger-than-memory data sets. For example, the block algorithm can be used that is suitable for out-of-core kernel PCA and a Multivariate Curve Resolution—Alternating Least Squares (MCR-ALS) procedure that directly employs a PCA representation of the data. The crossproduct matrix that is formed during the PCA compression using the block algorithm is mathematically identical to the one that would be computed if the entire data set could have been held in memory. Consequently, data sets that are larger than memory are readily analyzed. The performance of the algorithm is characterized for data sets ranging in size to upward of one billion individual data elements. For sufficiently large data sets, an out-of-core implementation of the algorithm extracts only a minor performance penalty over a fully in-core solution.
Multiresolution spatial processing, or spatial compression, can be used separately or in combination with the spectral compression algorithm and image analysis techniques to provide speed improvements without the loss of spatial or spectral resolution. Spatial compression is described herein using an exemplary Haar wavelet transform. Multiresolution techniques have been used in traditional image processing and, in fact, serve as the basis of the recently introduced JPEG2000 image compression standard. Similar ideas can be applied to spatially compress high-spectral-dimension images. The basic idea is to reduce the number of pixels analyzed to a smaller number of coefficients that capture all of the spatial information of interest. Using orthogonal wavelets to perform this compression allows standard multivariate curve resolution techniques to be applied to the coefficients themselves rather than the raw data. This can lead to a tremendous reduction in both the computational resources and time required to complete the analysis. In addition to improved computational performance, the multiresolution approach can also yield improved sensitivity to minor constituents for data having low signal to noise ratios. This is particularly true if the multiresolution spatial filters are matched to the features of interest in the sample.