The collection of data from pluralities of two-dimensional sample data sets of the same data type, modality, submodality, etc., generates rich repositories of information. Such is the case with regard to the data obtained from mass spectroscopy, which is an analytical technique for the resolution of the chemical composition of a subject compound or molecular sample based upon the mass to charge (m/Z) ratio of the component particles. Briefly, a chemical or biological sample is fragmented into charged particles, or ions, by an ion source, and the resultant ions are passed through an electric and magnetic field where they are sorted by their respective atomic masses. A detector then measures the value of an indicator quantity of the ions in the given fragmented sample, and this value is used to calculate the relative abundances of each ion fragment present in the given sample. The product of this chemical analysis is a mass spectrum having peaks (i.e., signals, points, loci, intersections, vertices) of data that can be presented as a graphical plot of m/Z (i.e., X-values in a two-dimensional coordinate plane system) to intensity or abundance values (i.e., Y-values in a two-dimensional coordinate plane) of the component fragments or ions.
Historically, the amount of time and energy (in the form of both human and machine hours) required to sift through the volumes of mass spectroscopy information, decipher and extract the important or relevant peaks, normalize or align peaks from across multiple samples, compare said peaks in an effort to elucidate commonalities or differences between and among the samples, and eventually formulate conclusions about or hypotheses from said data was cost-prohibitive. However, there have been many advances in data pre-processing techniques that have made the former dilemmas much more manageable.
U.S. Pat. No. 6,147,344 by Annis, et al., teaches a method for peak identification in which detection errors are reduced through the elimination of, inter alia, background noise, system resolution inaccuracies, sample contamination, multiply charged ions, and isotope substitutions, all of which commonly plague mass spectroscopy data sets. The method as described therein generates two groups of output values resulting from the performance of the same operation on a control sample and a test sample. The first m/Z value for a material or compound that is expected to be present in the mixture (as obtained from a previously established library of output spectra) is selected, and the difference between the value of the control sample at this expected output value and the value of the test sample at the same is calculated. This difference is compared to a formerly determined value, and a resultant difference that is greater than the predetermined value indicates that the peak, or signal, in question exists above the background noise level. This operation can be repeated multiple times in an effort to eliminate random noise and background contamination and can be further enhanced to delimit peaks resulting from proper retention time in accordance with the separation method used, those from multiply charged ions, and those related to atomic isotopic substitution.
U.S. Pat. No. 6,449,584 by Bertrand, et al., describes a method for peak extraction wherein intensity values of a measurement signal, which can be characterized by a series of peaks mixed with substantially regular background noise, are processed as a function of a discrete variable (e.g., time) in an effort to detect said peaks through noise attenuation. The method comprises the formation of an intensity histogram vector, which represents a frequency distribution from the intensity values of a measurement signal; the zeroing of a portion of the data corresponding to the intensity values below an intensity threshold value derived from shape characteristics of the distribution; and the subtraction of the intensity threshold value from the remaining portion(s) of the data to obtain processed data representing the measurement signal in which each peak exhibits an enhanced signal-to-noise ratio.
U.S. Pat. No. 7,087,896 by Becker, et al., teaches a method for spectra normalization to yield peak intensity values that accurately reflect concentrations of the responsible species. The method first calculates a normalization factor from peak intensities of those inherent components whose concentration remains constant across a series of samples. Relative concentrations of a component occurring in different samples can be estimated from the normalized peak intensities.
U.S. Pat. No. 6,642,059 by Chait, et al., prefers a method for accurately comparing the levels of components present in different samples that comprises culturing a first sample in a first medium and a second sample of the same matter in a second medium, wherein at least one isotope in the second medium has a different abundance than the abundance of the same isotope in the first medium; modulating one sample by treatment with a bacteria, virus, etc; combining said samples and removing at least one component; subjecting the removed component to mass spectroscopy to yield a mass spectrum; and computing a ratio between the peak intensities of at least one closely spaced pair of peaks to determine the relative abundance of the component in each sample.
U.S. Pat. No. 6,925,389 by Hitt, et al., teaches a method for peak classification that uses pattern discovery methods and algorithms to detect subtle patterns in the expression of certain molecules in potentially diagnostic, biological samples. The pattern, which is made up of an optimal set of features (i.e., peaks in mass spectroscopy data), can be defined as a vector of three or more values, obtained from a subset of the data stream or from the total data stream, whose position in an N-dimensional space is discriminatory. This method couples a genetic algorithm directly to an adaptive pattern recognition algorithm to derive the optimal feature set characterizing a given biological state or data stream; first, a vector, which is characteristic of the given data stream, is calculated; and this is followed by determination of which, if any, known data clusters (which are previously determined) the vector rests.
While each of the aforementioned works demonstrate clear advances in peak identification, extraction, normalization, and classification within multi-sample, two-dimensional data, the latter dilemmas of illuminating patterns between and among the pluralities of sample data sets and subsequently deriving accurate conclusions as to what these patterns may indicate are not so thoroughly managed or resolved.