An unknown biological molecule can be identified by comparing the mass data of the unknown biological molecule with mass data of known biological molecules.
For example, the rapid growth of available high quality DNA sequence data has made mass spectrometry (MS) combined with genome database searching a popular and potentially accurate method to identify proteins. Protein identification by mass spectrometry has proven to be a powerful tool to elucidate biological function and to find the composition of protein complexes and entire organelles.
In protein identification experiments, proteins are typically separated by gel electrophoresis, subjected to a protease having high digestion specificity (e.g. trypsin) and the resulting mixture of peptides is extracted from the gel and subjected to MS-analysis. The distribution of proteolytic peptide masses (peptide map) is compared with theoretical proteolytical peptide masses calculated for each protein stored in a protein/DNA sequence database.
There are various algorithms that attempt to identify the protein with the highest degree of similarity to the experimentally obtained peptide map. These algorithms yield the protein identified and an identification score. Due to imperfections in the protein separation and to incomplete extraction of the proteolytic peptides from the gel, the peptide map is typically incomplete with respect to the protein identified, and also contains a background of proteolytic peptide masses from one or several other proteins. Even if separation and extraction were perfect, posttranslational modifications of proteins would cause a proteolytic peptide mass distribution different from that predicted by the genome. Mass spectrometry determines a peptide mass mi to an accuracy xc2x1xcex94mi, with xcex94mi/mi typically  greater than 30 ppm. Within the mass range mixc2x1xcex94mi proteolytic peptide masses of several proteins in the genome can match. For these reasons, a database search using the information in a peptide map will not always identify a protein unambiguously.
Methods for evaluating the quality of a protein identification result have recently been provided. However, such methods may be computationally intensive, may not always be readily integrated with search programs and may need to set different standards for different databases. As increasingly complex biological problems are explored, simplified methods to evaluate the quality of a protein identification result are critical.
The object of the present invention is to provide a method for evaluating the quality of a biological molecule identification which is substantially less computationally intensive than prior methods. In one embodiment the present invention provides an evaluation of the quality of a protein identification score in a fraction of a second. Additionally, the present invention provides a criterion which indicates the quality of a particular protein identification result that will be the same level of significance regardless of the size of the database.
This and other objects, as will be apparent to those having ordinary skill in the art, have been met by providing a method for determining the probability that a biological molecule identification is incorrect for a chosen significance level and for a particular experimental condition, the method comprising: a)generating theoretical mass data for biological molecules; b) generating an experimental mass data for an unknown biological molecule; c) comparing the experimental mass data generated in step (b) with each theoretical mass data generated in step (a); d) calculating a score for each comparison in step (c), wherein the score is a function of the similarity between each of the data generated in step (a) and the data generated in step (b); e) selecting at least two scores from the scores in step (d) to form a primary data set, wherein the scores correspond to a comparison that denotes a degree of similarity between each of the data generated in step (a) and the data generated in step (b); f) generating a sufficient quantity of artificial data sets from the primary data set in step (e); g) calculating a sample mean for each artificial data set in step (f); h) estimating population mean and population standard deviation from the sample means generated in step (g); wherein the population is based on the distribution underlying the primary dataset; i) computing a Z score from the population mean and population standard deviation for each score calculated in step (d) to standardize the scores; j) choosing a significance level; and k) comparing a test Z score to a Z score of the chosen significance level to determine the probability that the biological molecule identification is incorrect. No particular order is required for the performance of these steps.
The invention further provides a computer usable medium for determining a probability that a biological molecule identification is incorrect for a chosen significance level and for a particular experimental condition, the computer usable medium comprising: a) a means for generating theoretical mass data for biological molecules; b) a means for generating experimental mass data for an unknown biological molecule; c) a means for comparing the experimental mass data generated in step (b) with each theoretical mass data generated in step (a); d) a means for calculating a score for each comparison in step (c), wherein the score is a function of the similarity between each of the data generated in step (a) and the data generated in step (b); e) a means for selecting at least two scores from the scores in step (d) to form a primary data set, wherein the scores correspond to a comparison that denotes a degree of similarity between each of the data generated in step (a) and the data generated in step (b); f) a means for generating a sufficient quantity of artificial data sets from the primary data set in step (e); g) a means for calculating a sample mean for each artificial data set in step (f); h) a means for using the sample means generated in step (g) to estimate population mean and population standard deviation; wherein the population is based on the distribution underlying the primary data set; i) a means for computing a Z score from the population mean and population standard deviation for each score calculated in step (d) to standardize the scores, j)a means for choosing a significance level; and k) a means for comparing a test Z score to the Z score of the chosen significance level to determine the probability that the identification is incorrect. No particular order is required for the performance of these steps.
The invention further provides a computer program product comprising: a computer usable medium having computer readable program code means embodied in said medium for determining a probability that a biological identification is incorrect for a chosen significance level and for a particular experimental condition, said computer program product including: computer readable program code means for causing a computer to generate theoretical mass data for known biological molecules, the biological molecules having been cleaved into constituent parts by a method that produces constituent parts; computer readable program code means for causing a computer to generate experimental mass data for an unknown biological molecule, the unknown biological molecule having been cleaved into constituent parts by a method that produces constituent parts; computer readable program code means for causing the computer to compare the mass data of the unknown biological molecule with mass data generated for the experimental condition for known biological molecules; computer readable program code means for causing the computer to calculate scores for each mass data comparison, wherein the scores are a function of similarity between mass data of the unknown biological molecule and mass data generated from the biological molecule database; computer readable program code means for causing the computer to select at least two scores from the calculated scores to form a primary data set, wherein the selected scores corresponds to a comparison which denotes a high degree of similarity; computer readable program code means for causing the computer to generate a sufficient quantity of artificial data sets from the primary data set; computer readable program code means for causing the computer to calculate a sample mean for each artificial data set; computer readable program code means for causing the computer to estimate population mean and standard deviation; wherein the population is based on the distribution underlying the primary data set; computer readable program code means for causing the computer to calculate a Z score from the population mean and population standard deviation for each score; computer readable program code means for causing the computer to choose a significance level; computer readable program code means for causing the computer to compare a test Z score to a Z score of the chosen significance level to determine the probability that the identification is incorrect. No particular order is required for the performance of these steps.