Principal component analysis (PCA) is a well-known multivariate statistical technique for reducing the number of correlated variables to a smaller number of independent variables, known as principal components. PCA transforms the original set of variables into a smaller set of principal components that account for most of the variance of the original data set, thereby reducing the dimensionality of the data. The components are rank ordered in terms of the variability they represent with respect to the original variables. PCA has traditionally been used with a group of closely related data as a training set to generate a principal component defined model of the correlated variables, which is in turn used to predict membership of an unknown entity based on its relationship to the PCA-based model. The independent principal components are used in place of the original dependent variables for plotting, regression, clustering, and the like.
Nuclear magnetic resonance (NMR) is a phenomenon that is based on the magnetic properties of nuclei such as hydrogen-1, carbon-13 and phosphorous-31. When these nuclei are placed in a static magnetic field and are subjected to electromagnetic radiation, the nuclei absorb the radiation's energy at certain frequencies characteristic of each nucleus. Pulsed NMR is a well known technique which uses a burst or pulse of energy to excite the nuclei of a target atom in an essentially static magnetic field. After the application of the pulse of radio frequency (RF) radiation, all of the nuclei excited re-emit RF radiation at their respective resonance frequencies. The emission over time, known as free induction decay (FID), is measured and the frequencies are extracted from the FID by a Fourier transform of the time-based data.
NMR has been widely used for molecular structure determination. Because the resonance frequency of each NMR-active nucleus is typically determined by its surrounding environment in the molecular structure, structural information of a molecule can be determined by correlating NMR spectral features of the NMR-active nuclei in the molecule. See, for example, R. M. Silverstein and F. X. Webster, “Spectrometric Identification of Organic Compounds,” John Wiley & Sons, Inc. (sixth edition), 1998.
PCA techniques have been used to analyze NMR data obtained from mixtures of substances in order to compare an unknown mixture to a standardized mixture. Such techniques have been used to assure the standardization of juices, oils, and plant material. As an example, International Patent Application WO 00/47992, assigned to Oxford Natural Products PLC, discloses the use of NMR spectroscopy combined with computer-based pattern recognition statistical procedures to analyze mixtures of medicinal plant material for consistency in content and bioactivity with a reference mixture. The spectrum of a known standard sample of the material (possessing the desired property) is compared with the spectrum of an unknown sample to determine the similarity of the two materials.                U.S. Pat. No. 5,446,681 ('681) assigned to Exxon Research and Engineering Company describes a method of estimating physical property and/or composition data of a mixture via on-line spectral measurement using a computer controlled spectrometer, followed by statistical analysis of the resultant data compared with a statistical model using sample calibration data. This comparison permits automatically classifying a sample based upon statistical and rule-based criteria.        These methods rely on spectral data derived from samples having known compositions which are then compared to those of an unknown composition in order to estimate an identity/property of the unknown composition. See also:        P. S. Belton, I. J. Colquhoun, E. K. Kemsley, I. Delgadillo, P. Roma, M. J. Dennis, M. Sharman, E. Holmes, J. K. Nicholson and M. Spraul, “Application of chemometrics to the 1H NMR spectra of apple juices: discrimination between apple varieties,” Food Chemistry, 1998, 61, 207-213 (PCA and linear discriminant analysis to predict membership amongst apple varieties); and        E. Holmes, A. W. Nicholls, J. C. Lindon, S. C. Connor, J. C. Connelly, J. N. Haselden, S. J. P. Damment, M. Spraul, P. Neidig and J. K. Nicholson, “Chemometric Models for Toxicity Classification Based on NMR Spectra of Biofluids,” Chem. Res. Toxicol, 2000, 13, 471-478 (1H-NMR spectroscopic and pattern recognition-based methods-including PCA—were used to compare rat urine samples).        
E. Holmes, J. K. Nicholson, A. W. Nicholls, J. C. Lindon, S. C. Connor, S. Polley, and J. Connelly, in “The identification of novel biomarkers of renal toxicity using automatic data reduction techniques and PCA of proton NMR spectra of urine,” Chemometrics and Intelligent Laboratory Systems, 1998, 44, 245-255, describe a technique which utilizes PCA of 1H-NMR spectroscopy to predict drug toxicity. A method analyzes urine samples by comparing NMR data to that of reference urine samples having standardized toxicity spectra. The presence or absence of key regions, or markers, of region-specific toxicity is made by comparison of test urine samples with the standards to assess whether a potential drug may be toxic.
See also M. Spraul, M. Hofmann, M. Ackermann, A. W. Nicholls, S. J. P. Damment, J. M. Haselden, J. P. Shockcor, J. K. Nicholson, and J. C. Lindon, “Flow Injection Proton Nuclear Magnetic Resonance Spectroscopy Combined With Pattern Recognition Methods: Implications for Rapid Structural Studies and High Throughput Biochemical Screening,” Analytical Communications, November 1997, 34, 339-341 (High throughput analysis of urine samples to identify drug toxicity).
PCA techniques have been used in analyses of NMR data relating to wood processing. One technique examines aliphatic and phenolic hydroxyl groups in the lignin of wood liquors to confirm the cleavage of Beta-aryl-ethers in native lignin during kraft pulping. NMR data from both carbon-13 and phosphorous-31, along with additional data, are used to predict the overall effects of kraft pulping using multivariate techniques including PCA. This technique, which does not analyze molecular structure, is described by P. Malkavaara, R. Alen, and E. Kolehmainen in “Chemometrics: An Important Tool for the Modern Chemist, an Example from Wood-Processing Chemistry,” J. Chem. Inf. Comput. Sci. 2000, 40, 438-441.
PCA techniques are also used to calibrate NMR spectrometers in order to assure consistency across trials. U.S. Pat. No. 5,420,508 ('508) assigned to Auburn International, Inc. describes a pulsed NMR analysis system and process comprising an on-line system to extract a sample and establish digitized FID curves, from which curve components functions are determined using regression techniques including PCA to correlate the curve components to the target nuclei, crystalline or amorphous, and to analyze other material characteristics, such as flow rates in plastic. This technique, while assuring proper calibration of the pulsed NMR analysis system, does not examine chemical structure.
U.S. Pat. No. 5,121,337 assigned to Exxon Research and Engineering Company describes both calibration and correction of spectral data and the analysis of an unknown sample using statistical techniques including Principal Component Regression (PCA followed by regression analysis). Data correction deals with baseline variations or ex-sample chemical contamination. The analysis method predicts mixture properties such as: component concentrations, API gravity, estimation of cetane number for petroleum mid-distillates, estimation of hydrogen contents of mid-distillates, calibration of the apparatus with reference to mixture spectra, and component estimation of an unknown composition which is compared to a known standard mixture.
Another calibration technique is disclosed in U.S. Pat. No. 5,610,836, assigned to Eastman Chemical Company, which utilizes PCA in connection with spectrum analysis to compensate for sample volume discrepancies or other interferences that prevent correct quantitative analysis of samples.
NMR methods coupled with statistical analysis have been used to reveal the protein counterpart of a pharmacophore. U.S. Pat. No. 6,027,941 ('941) assigned to CuraGen Corporation discloses a method for obtaining distance measurements of known proteins/chemical compounds using solid-state NMR data subjected to statistical analysis methods to provide information for the elucidation of structures of pharmaceutical lead compounds, drug molecules, or their targets. This technique requires labeling of the known proteins/chemical compounds tested in order to produce a highly accurate three dimensional analysis thereof, but does not provide an automated method to identify whether or not a chemical compound is a potential pharmaceutical lead compound.
Analytical methods have used PCA coupled with other techniques in order to generate information pertaining to the structure of organic compounds. C. Ebert, T. Gianferrara, P. Linda and P. Masotti, in “Multivariate Investigation of 1H and 13C NMR Shifts of 2- and 3-Substituted Furans, Thiophenes, Selenophenes and Tellurophenes,” Magnetic Resonance in Chemistry, 1990, 28, 397-407, indicate that PCA alone is appropriate only for classification problems and not for prediction of chemical shifts (or identification of chemical structure). In that reference, PCA coupled with a partial least squares (PLS) analysis was used to predict the chemical shift values of different ring structures having the same substituents. The PCA was used to demonstrate possible groupings of objects, and the PLS analysis was used to predict chemical shift values within the groupings.
As should be noted, none of the above techniques are designed to readily evaluate structural and functional similarity or diversity by identifying a substructure of an unknown compound; classifying membership of a compound in a family of compounds; analyzing a compound with respect to a computer generated model of a pharmacophore; or quantifying diversity or similarity within a set of compounds.