The present invention generally relates to methods of matching peaks in datasets from a plurality of liquid chromatography-mass spectroscopy and apparatuses for the same.
Liquid chromatography-mass spectroscopy (LC-MS) is an analytical chemistry technique that combines the physical separation capabilities of liquid chromatography with the mass analysis capabilities of mass spectrometry. Optionally, LC-MS may employ tandem mass spectroscopy (MS/MS), in which multiple mass spectrometry steps are employed with at least one intervening fragmentation step between the multiple mass spectroscopy steps. Liquid chromatography in combination with tandem mass spectroscopy is typically referred to as liquid chromatography-tandem mass spectroscopy (LC-MS/MS), and is a subset of LC-MS.
Data from a liquid chromatography-mass spectroscopy is typically generated as “features” in a multi-dimensional space including a mass-to-charge ratio of a detected material as one axis and a retention time of the detected material as another axis. The retention time is the time it takes for a material to travel through a capillary column that leads into a vacuum environment in which the material is ionized for detection by a mass spectrometer. The mass-to-charge ratio is the ratio of the mass of the material to the electrical charge of the material as detected by a mass spectrometer after the material is ionized in a vacuum environment. In its simplest form, a feature is simply a peak in the LC-MS chromatogram, but a feature may also be a monoisotopic mass deduced from an isotope series, with corresponding retention time and optional intensity.
Multiple LC-MS runs result in multiple datasets, in which each dataset includes a list of peaks from one LC-MS run. The list of peaks is represented in the multi-dimensional space of a mass-to-charge ratio, a retention time, and optionally, an intensity of the peak. It is a challenge to compare proteomics data from different LC-MS experiments because not all the peaks coincide with corresponding peaks from other runs in the multi-dimensional space of the mass-to-charge ratio and the retention time.
There is an increasing need for computational methods to compare protein expression measured by LC-MS or LC-MS/MS proteomics experiments. Public domain proteomic databases such as the Open Proteomic Database and PeptideAtlas have accumulated thousands of LC runs from various laboratories, and the numbers continue to increase. Comparisons of multiple proteomic experiments based on identified proteins and peptides are feasible, but limited because most LC-MS or LC-MS/MS peaks are unidentified and therefore overlooked. In addition, many peaks in LC-MS/MS are unidentified because peptide identification by MS/MS ion search is still a low percentage sampling process with imperfect reproducibility.
Without sequence information, the common practice is to match peptides between different runs based solely on similarity in mass and normalized retention time. However, this method is prone to some level of mismatches because different peptides may share similar mass and normalized retention times by chance. Thus, peptide matching based on mass and retention time similarity should be accompanied by error rate estimation, especially for complex protein mixture.
The error rate in matching is largely overlooked in the literature. Some of the few references that consider the error rate in matching include Jaffe, J. D. et al., “PEPPeR, a platform for experimental proteomic pattern recognition,” Mol. Cell. Proteomics 5, 1927-1941 (2006), Monroe, M. E. et al., “VIPER: an advanced software package to support high-throughput LC-MS peptide identification,” Bioinformatics 23, 2021-2023 (2007), and Anderson, K. K., Monroe, M. E. & Daly, D. S., “Estimating probabilities of peptide database identifications to LC-FTICR-MS observations,” Proteome Sci 4, 1 (2006). The PEPPeR pipeline estimates the mismatching rate by bootstrapping, while VIPER estimates the probability of correct matching by Expectation Maximization (EM). VIPER uses Accurate Mass and Time Tag (AMT) peptide identification, which matches mass and retention time pairs to a database of identified peptides. VIPER estimates the mismatching rate by searching against the database of identified peptides where every mass is shifted by a constant amount, such as 7 Dalton (Da.). However, for both PEPPeR and VIPER, the accuracy of the estimated mismatching rates is unclear, and require some peptides to be identified. More importantly, both are limited to comparison among similar proteomic experiments.