Proteomics is one of the sciences of the post genomic era that has a greater impact on modern biotechnology, since it comprises the identification and quantification of large amounts of protein in extremely complex matrices (biological fluids, tissues or cell cultures, among others). Currently, the most successful and academically and industrially relevant techniques used in proteomics are those based on tandem mass spectrometry (MS/MS), which consists of the extraction of proteins from the sample to be tested, the digestion of these proteins with enzymes or other chemicals to obtain peptides (easier to analyze), the separation of these peptides, usually done by chromatographic techniques and placing them in a mass spectrometer in ionized form to measure their mass and fragmenting them within the mass spectrometer in order to obtain structural information so as to permit the identification of proteins formed by the analyzed peptides.
Current research in Proteomics based on tandem mass spectrometry involves the generation of large volumes of data typically containing thousands to millions of mass spectra. These spectra are allocated to sequences of peptides recorded in databases, using software called search engines. In the historical development of proteomics based on MS/MS, given the high number of spectra involved in the analysis, manual validation of spectrum-peptide correspondence has become impracticable in a short time, so it has become necessary to develop automatic procedures, not handled by the user, to identify the analyzed peptides, and discard spurious matches (known as false positives or false detections). These procedures include the use of algorithms based on statistical scoring systems to classify each spectrum analyzed in a sample, so that the higher the score, the greater the probability that the spectrum-peptide assignment is correct.
Currently, the existing differences among the various search engines on the market are derived from the pre-processing and standardization of MS/MS spectrum analyzed, as a result of the use of various statistical models and numerical methods in the scoring system for each engine. These differences pose a major problem when analyzing MS/MS spectrum using multiple search engines, as some sequences of peptides identified correctly in one of the engines may not be in others. This is a widely known fact by experienced spectrometers. The present invention comprises a method of combined search using multiple engines (hereinafter defined as meta-search) aimed at solving this problem, as well as optimization techniques for analyzing the spectra obtained by MS/MS. This method also provides a general criterion score (which we define as a goal-scoring) of the results obtained by different database engines using a sufficiently robust statistical modeling leading to a unique peptide spectrum allocation.
Despite the potential benefits that a meta-search method with multiple engines has, few attempts have been made in this direction so far. Among the most relevant ones, it is worth mentioning the work developed by Rohrbough et al. [1], Higgs et al. [2], Searle et al. [3] and Alves et al. [4]. On the other hand, within the state of the art related to research in proteomics, it is more abundant the existence of commercial products with comparative search options (which differs from the concept of meta-search) using several engines that present some software applications found in the market, such as the option “InChorus” by PEAKS search engine (distributed by Bioinformatics Solutions Inc.), the Rosetta Elucidator data analysis system (distributed by Rosetta Biosoftware), the Proteome Discoverer platform analysis (distributed by Thermo Fisher Scientific Inc.) or the Phenyx engine, distributed by Geneva Bioinformatics SA.
Another application of this technical field is the embodiment of the search methods in analytical devices of peptides and proteins that combine both hardware and software, and are marketed independently as “plug-and-play” workstations or servers that can be used simultaneously by multiple users. An example of such devices would be the Sorcerer 2 workstation, sold by Sage-N Research, Inc., or the configurable server distributed jointly by IBM and Thermo Electron Corporation. These devices do not integrate, to date, the simultaneous use of several engines through a meta-search method.
While the present invention shares some approaches and objectives with each of the aforementioned techniques, it is the only one of all methods that presents the following set of advantages:
The method for meta-search and meta-scoring system adds additional information that can not be obtained by searching with only one engine.
It uses a robust statistical modeling that allows the selection of a unique combination of peptide sequence, electric charge and chemical composition by spectrum (as opposed to the methods used by PEAKS, Rosetta Elucidator, Proteome Discoverer and Phenyx, which only use the results of multiple engines for comparative purposes, without the possibility of using a common statistical and a common meta-scoring system).
This method can be completely generalized for the use of any number of search engines (as opposed to the methods proposed in References [1] and [2], whose generalization to more than two engines is not feasible).
It uses a standard method that applies to the results of any search engine to obtain the statistical distribution functions, unlike the method described in Reference [3] and its commercial embodiment in the Scaffold application (distributed by Proteome Software Inc.) whose extension to more than the three studied engines would require a satisfactory distribution for each new search engine used.
It integrates, in its formulation, the use of matching parameters, defined as the number of other search engines that have supplied the same peptide candidate than a given engine. The use of matching parameters is not covered in the method contemplated in Reference [4], missing because of its absence a valuable part of the information, which contributes significantly to the increase in the number of identified peptides.
It automatically optimizes the values of all parameters involved in the process through statistical modeling, without the need to define any type of filter, arbitrary scoring mechanism or presetting values for the coefficients of the latter, unlike methods based on arbitrary multiple filters or predefined scoring mechanisms described in references [4] and [5].
As for protein detection, it uses a rigorous statistical method, unbiased, which uses a filtering defined by the error rates in the peptide-sequence allocations.
Additionally, the claimed method is flexible enough to incorporate other sources of additional information to the consistency of the engine, such as the filtering through the mass error of the sequence precursor ion (defined as the difference between the theoretical mass of an ion peptide and the measuring of the mass obtained by the spectrometer, using either its molecular mass or its mass/charge ratio, m/Z), the error in the retention time (defined as the characteristic retaining time during chromatographic separation), the prediction error of the isoelectric point (similar to the previous factor, when the peptides are fractioned using isoelectric focusing separation techniques), ionic mobility (in the mass spectrometer incorporating such an analysis, based on ion accumulation of chemical species under the action of an electric field), the specificity of the enzymatic digestion used (ie, the characteristics of protein segmentation depending on the type of enzymes used for digestion), the detection of multiple isotopic patterns for a same peptide (common in stable isotopic labelling experiments used in quantitative proteomics applications) or consistent with the sequence obtained by MS/MS without using a search engine (known as de novo sequencing of the information.) This flexibility makes it possible for the meta-search method to integrate data using different sample preparations, different methods of protein digestion and different mechanisms of ion fragmentation, which makes it a suitable tool for large-scale identification of proteins.
The present invention is based on a meta-search method using the results of spectrum-peptide allocation obtained in different search engines on hybrid target/decoy databases containing a 1:1 ratio of real proteins against false proteins. These false proteins are usually obtained reversing the sequence of each of the real proteins. As a preliminary step to the allocation of meta-results, the method of analysis of results in each of the engines studied separately is performed using the technique developed by Ramos-Fernández et al [6] (developed for using a single search engine) based on the use of generalized Lambda distributions (GLD's). Said GLD's are functions of four extremely flexible parameters that can represent with great precision the majority of the most important families of continuous probability distributions used in statistical modeling of histograms. The GLD's model (described in, for example, the work of Karian et al [6]) has not been previously used to perform combined searches on multiple engines of sequence databases, and provides the theoretical framework of the statistical model on which the meta-search method and meta-scoring here claimed operates. Unlike the model of reference [7], the invention claimed here is presented as a method that can be implemented automatically, providing objective criteria that will allow the election of the GLD that best fits the observed results without the need of personally supervising each of the candidate models.