An unknown biological molecule can be identified by comparing the mass data of the unknown biological molecule with mass data of known biological molecules.
For example, the rapid growth of available high quality DNA sequence data has made mass spectrometry (MS) combined with genome database searching a popular and potentially accurate method to identify proteins. Protein identification by mass spectrometry has proven to be a powerful tool to elucidate biological function and to find the composition of protein complexes and entire organelles.
In protein identification experiments, proteins are typically separated by gel electrophoresis, subjected to a protease having high digestion specificity (e.g. trypsin) and the resulting mixture of peptides is extracted from the gel and subjected to MS-analysis (1998). The distribution of proteolytic peptide masses (peptide map) is compared with theoretical proteolytical peptide masses calculated for each protein stored in a protein/DNA sequence database.
There are various algorithms that attempt to identify an unknown protein by determining the database protein which has a peptide map with the highest degree of similarity to the experimentally obtained peptide map of the unknown protein. These algorithms yield the protein identified and an identification score. Due to imperfections in the protein separation and to incomplete extraction of the proteolytic peptides from the gel, the peptide map is typically incomplete with respect to the protein identified, and also contains a background of proteolytic peptide masses from one or several other proteins. Even if separation and extraction were perfect, posttranslational modifications of proteins would cause a proteolytic peptide mass distribution to be different from that predicted by the genome. Mass spectrometry determines a peptide mass mi to an accuracy xc2x1xcex94mi, with xcex94mi/mi typically  greater than 30 ppm. Within the mass range mixc2x1xcex94mi, proteolytic peptide masses of several proteins in the genome can match. For these reasons, a database search using the information in a peptide map will not always identify a protein unambiguously.
Despite the momentum mass spectrometric protein identification has given to protein research, the problem of objectively assessing the significance of a protein identification result has been overlooked. As increasingly complex biological problems are explored, knowledge of the significance of each protein identification result is likely to become critical.
The object of the present invention is to provide a method for assessing the significance of a biological molecule identification.
This and other objects, as will be apparent to those having ordinary skill in the art, have been met by providing a method of determining the statistical significance of a biological molecule identification score. The method comprises a) selecting a significance level that represents a level of confidence in a biological molecule identification b) calculating a score associated with an unknown biological molecule, wherein the score is a function of similarity between mass data of the unknown biological molecule and mass data generated for known biological molecules of a biological molecule database; c) comparing the score with a score frequency distribution, wherein the distribution is generated by comparing mass data of a hypothetical biological molecule with mass data generated for known biological molecules of a biological molecule database, and wherein the frequency distribution has associated therewith the significance level; and d) determining whether the score associated with the unknown biological molecule identification is within the significance level.
The invention further provides a method of generating a frequency distribution of scores for a particular experimental condition, wherein the scores relate to random identifications of biological molecules. The method comprises a) generating mass data for the particular experimental condition for known biological molecules in a biological molecule database; b) generating mass data of a hypothetical biological molecule for the experimental condition; c) comparing the data generated in step (b) with the data generated for each known biological molecule in step (a); d) calculating a score for each comparison in step (c), wherein the score is a function of similarity between the data generated in step (a) which corresponds to a particular known biological molecule and the data generated in step (b); e) selecting a score from the scores calculated in step (d), wherein the selected score corresponds to the comparison which denotes a high degree of similarity between the data generated in step (a) and the data generated in step (b); f) repeating steps (b) through (e) with different hypothetical biological molecules until a sufficient quantity of scores are selected; and g) determining the frequency of selecting each score and generating therefrom a frequency distribution of scores.
The invention provides another method of generating a frequency distribution of scores for a particular experimental condition, wherein the scores relate to random identifications of biological molecules. The method comprises a) generating mass data to for the particular experimental condition for known biological molecules in a biological molecule database; b) randomly selecting a biological molecule from the database; c) comparing the mass data of the randomly selected biological molecule with the mass data of each known biological molecule; d) calculating a score for each comparison in step (c), wherein the score is a function of similarity between the data; e) selecting a score from the scores calculated in step (d), wherein the selected score corresponds to the comparison which denotes a degree of similarity between the data which is lower than the highest degree of similarity; f) repeating steps (b) through (d) with different randomly selected biological molecules until a sufficient quantity of scores are selected; and g) determining the frequency of selecting each score and generating therefrom a frequency distribution of scores.
The invention also provides a method of identifying an unknown biological molecule for a particular experimental condition and a particular significance level. The method comprises a) selecting a significance level that represents a level of confidence in a biological molecule identification; b) cleaving the unknown biological molecule into constituent parts by a method that produces constituent parts; c) generating mass data for these constituent parts; d) comparing the mass data generated in step (c) with mass data generated for the experimental condition from known biological molecules of a biological molecule database; e) calculating scores for each comparison in step (d), wherein the scores are a function of similarity between mass data of the unknown biological molecule and mass data generated from the biological molecule database; f) selecting a score generated in step (e) wherein the score corresponds to a comparison which denotes a high degree of similarity and wherein the score corresponds to a particular known biological molecule in the biological molecule database; and g) determining whether the score selected in step (f) is equal to or larger than the critical score.
In another embodiment the invention comprises a computer program product which comprises a computer usable medium having computer readable program code means embodied in said medium for generating a frequency distribution of scores, wherein the scores relate to random identifications of biological molecules. The computer program product includes: a computer readable program code means for causing a computer to generate mass data for each known biological molecule in a biological molecule database for a particular experimental condition; computer readable program code means for causing the computer to generate mass data of a hypothetical biological molecule for the experimental condition; computer readable program code means for causing the computer to compare the mass data of the hypothetical biological molecule with the mass data generated for each known biological molecule in the biological molecule database for the particular experimental condition; computer readable program code means for causing the computer to calculate a score for each mass data comparison, wherein the score is a function of similarity between the mass data corresponding to a particular known biological molecule and the mass data corresponding to the hypothetical biological molecule; computer readable program code means for causing the computer to select a score from the calculated scores, wherein the selected score corresponds to the comparison which denotes a high degree of similarity between the mass data corresponding to the particular known biological molecule and the mass data corresponding to the hypothetical biological molecule; computer readable program code means for causing the computer to repeatedly generate mass data of different hypothetical biological molecules, compare the mass data each of the hypothetical molecules with the mass data generated for each known biological molecule in the biological molecule database, calculate a score for each of the mass data comparisons and select a score from the calculated scores until a sufficient quantity of scores are selected; and computer readable program code means for causing the computer to determine the frequency of selecting each score and to generate therefrom a frequency distribution of scores.
In another embodiment the invention comprises a computer program product which comprises a computer usable medium having computer readable program code means embodied in said medium for identifying an unknown biological molecule for a particular experimental condition and a particular significance level. The computer program product includes: computer readable program code means for causing a computer to generate mass data of an unknown biological molecule, the unknown biological molecule having been cleaved into constituent parts by a method that produces constituent parts; computer readable program code means for causing the computer to compare the mass data of the unknown biological molecule with mass data generated for the experimental condition from known biological molecules of a biological molecule database; computer readable program code means for causing the computer to calculate scores for each mass data comparison, wherein the scores are a function of similarity between mass data of the unknown biological molecule and mass data generated from the biological molecule database; computer readable program code means for causing the computer to select a score from the calculated scores, wherein the selected score corresponds to a particular known biological molecule in the biological molecule database, and wherein the selected score corresponds to a comparison which denotes a high degree of similarity; computer readable program code means for causing the computer to compare the selected score with a frequency distribution of scores for the experimental condition, wherein the distribution is generated by comparing mass data of a hypothetical biological molecule with mass data generated from a biological molecule database, and wherein the frequency distribution has associated therewith a critical score which corresponds to the significance level; and computer readable program code means for causing the computer to determine whether the selected score is equal to or larger than the critical score.
In another embodiment the invention comprises a computer program product which comprises a computer usable medium having computer readable program code means embodied in said medium for determining statistical significance of a biological molecule identification score. The computer program product includes: a computer readable program code means for causing a computer to calculate a score associated with an unknown biological molecule, wherein the score is a function of similarity between mass data of the unknown biological molecule and mass data generated from a biological molecule database; computer readable program code means for causing the computer to compare the score with a score frequency distribution, wherein the distribution is generated by comparing mass data of a hypothetical biological molecule with mass data generated from a biological molecule database, and wherein the frequency distribution has associated therewith a significance level determined to represent a confident biological molecule identification; and computer readable program code means for causing the computer to determine whether the score associated with the unknown biological molecule identification is within the significance level.