1. Field of the Invention
The present invention relates to data processing methods that store, search retrieve and process cellular and biochemical information efficiently. The invention is a method that offers substantial benefits as contrasted with prior methods, to allow extremely accurate analysis of complex proteomic data produced by test equipment. More specifically, the method uses the data output of mass spectrometry equipment to produce refined measurements of protein functions and to infer protein interactions, including functions over a complex network of protein interactions within and between biological cells. The resulting information identifies and describes the significant functional relationships for each protein within a group of proteins that are components of a larger biological system. The biological system may be a single cell, a group of cells, or an entire organism. This unique method includes the following functions:
(a) accurate calculation of the relative abundance and activity of each protein for each functionally relevant categorical grouping, based on the measurements from laboratory test equipment, gene and protein sequence data from an external database, and the standard software that is used with the mass spectrometry equipment.
(b) multiple screening to reject errors, sorting by specified criteria, and specification of the functional relevance of proteins within a biological system. Based on sequential measurements over time or following a controlled change in conditions, the calculations include the amount of incremental change in the activity of proteins, and the correlation of activity and patterns of activity to identify and measure the functionally relevant classifications, for all possible combinations and permutations between all observed peptides, proteins, and functional categories.
(c) detailed description of the calculated results in a manner that shows the structure of the complex relationships, as a graphic pattern than can be readily understood by the equipment operator.
2. Description of the Related Art
As a result of the advances in genomic sequencing technologies, complete genomic sequences have been derived for several species. Before these technological advances, it was not possible to determine the complete sequence of deoxyribonucleic acids (DNA) in an organism and organize the sequence into functional genes. However, in recent years the technology for sequencing genes has advanced so that it is not only feasible to determine complete genomic sequences but also to quantitatively measure the abundance of every expressed gene, based on mRNA levels for every gene found in a cell.
Nevertheless, analyses of gene sequence and abundance do not provide sufficient information to explain the mechanism, functions, and activity of biological processes. Proteins are essential for the control and execution of virtually every biological process. Accurate correlations between gene sequences and protein functions are limited by the degree of similarity between sequences and the availability of prior experimental results that demonstrate correlation or causation for protein function under specified conditions. Genomic data fails to provide correlations between biological processes and protein activities. The state of protein activity in a cell cannot be determined by gene sequence or the expression level of the corresponding mRNA transcript. Therefore, novel methods are required to monitor biological processes in terms of protein function.
Determining the complete sequence of DNA and mRNA for an organism is only a partial solution to the larger issues of how to understand basic biological functions. Advances in genomic research were based in large part on developments in computer technology and insightful software design. Using new computational approaches researchers were able to generate, store, organize and analyze large amounts of sequencing data. The investigation of protein function at a similar scale also requires advances in computation and software design so that large amounts of complex data can be accurately analyzed. Software designed for the analysis of protein function at a cellular and organismal scale faces separate challenges than those considered in genomic research.
The critical unsolved issue is to accurately describe the functions of all proteins that are derived from the genome, in terms of protein activity and functions over time or following controlled changes in conditions. This protein activity must also be understood with regard to the interactions of multiple proteins within a total system. The identity of each protein is based on messenger RNA transcribed from the DNA sequence of a gene. The function of each protein is determined by changing conditions within and around each cell. Accordingly, measurement of protein functions, and description of protein interactions and complex interrelationships, is separate from sequence identity.
Proteomics is the study of protein activity, functions and interactions. The scope of protein interactions depends on the extent of the cellular signaling network. Accordingly, the effects of protein activity may include interactions within a cell, interrelated interactions among an integrated group of cells, or complex interactions within an entire biological entity. With current technology, the critical unsolved issues include accurate measurement of protein function and activity, description of the protein-to-protein interactions, and the effects of selected compounds for modulation of protein activity. Specialized equipment and methods are a practical necessity to approach new challenges in the field of proteomics.
The process of inter- and intracellular signaling involves a complex network of protein interactions that change rapidly in response to different stimuli. Despite the critical importance of protein activity and protein interactions, the accurate measurement of incremental changes in function over time has remained an unresolved issue. The accurate measurement of protein function is made particularly difficult by the frequent modulation of post-translational modifications that significantly alter protein function. Changes to protein function are associated with critical human diseases, such as sepsis, emphysema, and various forms of cancer. To address the mechanistic cause and progression of these diseases, it is essential to measure biological activity in terms of quantitative changes in protein function.
3. Overview of Proteomics Technology
Proteome analysis is typically based on the separation of complex protein samples by one-dimensional or two-dimensional gel electrophoresis (2DE) and liquid chromatography, followed by identification of the individual protein species (Gygi and Aebersold 1999). Spectrometric techniques and basic computer algorithms have been designed to rapidly identify proteins by matching peptide mass spectra data to protein and translated nucleic acid sequence databases (Eng, McCormack et al. 1994; Yates III, Eng et al. 1995).
The prior art is shown in the listing of patents and relevant technical publications. The failures of prior art are demonstrated by the omissions in prior patents. The methods proposed by prior patents focus on the accurate identification of proteins by amino acid sequence and modifications. For example, prior art recognizes the need to consider the statistical significance of possibly erroneous matches (Sachs, 2005, page 18, lines 49-51) and potential errors caused by incomplete sequence databases, incomplete splicing, protein modifications, or protein polymorphism (Emili, 2005, paragraph 0199). It is recognized that reliance on the standard software contained in an automatic search engine (e.g., Mascot) and a protein database results in a great number of errors, termed pseudo-positive results, but the proposed solutions do not provide a standard method to prevent or correct the identified errors (Oda, 2008, paragraph 2065). The establishment of a custom database to correct errors in a single run (Oda, 2008, paragraphs 2065-2068) merely confounds the problem because the custom database is not shared or verified by other unbiased independent investigators.
Quantitative methods for proteomic research assign measures of abundance to identified proteins. To date, quantitative proteomic methods lag behind methods for the identification of proteins from complex samples. Currently applied quantitative methods fail to provide statistical confidence intervals and correct for sources of measurement error. For example, exclusion of measured data through the subjective exclusion of outliers (Sachs, 2005, page 18, lines 54-60) results in data that is the result of investigator selection of preferred data, as contrasted to empirical and unbiased measurement. A recent patent describes a method for measurement of protein phosphorylation with mass spectrometry, but no method to correct the underlying software efforts (Hunt, 2006, page 6, lines 5-15). Similarly, a recent patent describes methods to provide a baseline for quantitative comparison through internal controls, but no method to screen out erroneous equipment measurements or incorrect software calculations (Aebersold, 2009, page 7, lines 12-22).
Prior methods note that keyword categories are useful to selectively focus on biological functions of interest within a database (Yamashita, 2010, page 8, lines 10-15), (James 2007, paragraph 0013). However, the prior methods do not include quantitative measures of abundance for keyword categories nor do they calculate statistical correlations between all observed categories. Protein sequence similarity alone has also been used to infer functional similarity and molecular interactions (Mallal, 2006, paragraphs 0007-0012). However, this method merely compares selected sequences and does not provide for a quantitative measure of the degree of shared functionality between all possible proteins within and between samples.
Significantly, the widely used standard methods and software for mass spectrometry analyses are based on the original formats and codes designed over a decade ago. These early developments contained fundamental errors. Accordingly, it is not surprising that the current software and associated methods fail to correct critical calculation and measurement errors. Recent studies by several investigators demonstrate that current methods exhibit significant errors in repeatability and reproducibility, so that typical results cannot be reliably reproduced even with the same machine, same sample, and same operator (Tabb, Vega-Montoto et al. 2010). The existing computer software, test protocol, and screening processes have not kept pace with the current need for analytical details that precisely describe protein function within an interaction network.
Accurate analysis of protein functions present difficult technical obstacles. Protein functions are interrelated as shown by the complex signaling within and among cells. Accordingly, measurement of protein functions over time is a continuing challenge. The required measurements include accurate identification, physical count of abundance, extent of activity, and the extent of interaction between any two proteins. The widely accepted equipment, consisting of liquid chromatography combined with tandem mass spectrometry (LC-MS/MS), is adequate to provide the essential input data. However, substantial improvements in software and computational methods are necessary to correct errors that result from the use of standard but outdated software to analyze data obtained by mass spectrometry.
4. Current Test Procedures for Mass Spectrometry Equipment
Continuing technical developments have allowed improvements in mass spectrometry equipment and procedures. The following is a typical procedure for the identification of proteins using mass spectrometry. Samples are prepared from cell lysates that contain tens of thousands of distinct protein species. The sample can be simplified by separating proteins based on size using gel electrophoresis and isolating slices of the gel that contain only several hundred proteins per slice. Handling each slice separately allows the identification of a more complete portion of the original sample. Sample proteins are broken up into short segments termed peptides using a proteolytic enzyme. This resulting mixture of peptides is then separated by liquid chromatography (LC).
This separated mixture is injected into the mass spectrometer, which measures the mass/charge ratio (m/z) for ionized peptides. Then, the MS equipment selects individual peptide ions, fragments them using low-energy collisions, and measures the mass of the ion fragments to obtain amino acid sequence information. The observed m/z ratio of the intact and fragmented peptide ions allows inference of the amino acid sequence. External software accomplishes this by matching fragmentation data to sequence databases. Meanwhile, the intensity of the intact peptide ions allows measurement of relative abundance for species that share the same ionization potential. Internal standards have been developed for the purpose of providing a means for accurate quantitation. A typical MS/MS test procedure takes several hours and may produce hundreds of thousands of line items.
Specialized software is required to interpret the data produced by mass spectrometry equipment, with emphasis on matching the measured sequence to a database that allows accurate identification of each protein. Over the past 16 years, several data analysis programs have been developed for protein identification (Xu and Ma 2006). Typical commercially available software includes: Sequest (Eng, McCormack et al. 1994; MacCoss, Wu et al. 2002; Sachs, Wiener et al. 2005), Mascot (Matrix Science) (Perkins, Pappin et al. 1999), and Peaks (Ma, Zhang et al. 2003). Many additional programs have been developed, typically using different scoring functions, and different methods for error correction and interpretation of the MS/MS results.
However, despite the many alternative mass spectrometry software programs that are available, these programs exhibit serious deficiencies that prevent accurate measurement and detailed analyses. Based on the identified deficiencies in the prior methods, there is basic need for a novel method to meet the requirements. The novel method must correct errors in the mass spectrometry measurements, correct errors in the software used by the MS/MS equipment, and screen out results that fail to meet accuracy criteria. The novel software must allow accurate measurement of protein functions and the full range of multiple protein-protein interactions.
With prior methods, test results were typically inconclusive, due to the inherent complexity in the identification of major biological trends and errors in the measurement of relative protein abundance.
5. The Need for New Mass Spectrometry Data Analysis Methods
Significantly, the prior methods and software exhibit serious unresolved issues, such as multiple identification errors, counting errors, and even basic mathematical errors in the original software code. Importantly, the existing methods do not produce the details necessary to derive protein function, or to measure statistical correlations that allow inference of protein-protein interaction for each protein pair in the sample. Accordingly, the existing methods fail to provide the information required for important protein analysis decisions, such as the formulation and design of new compounds for diagnosis or treatment of disease.
Thorough use of mass spectrometry equipment and related methods have resulted in clear identification of the requirements for new methods. Accurate measurement and complete disclosure of all measured data is necessary. The measurements must allow precise and unambiguous identification of each protein from the peptide samples. As a practical necessity, the new methods must be able to use the data output of existing software and testing methods, so that each existing laboratory is not required to purchase new equipment or to learn to use entirely different methods. The transition from the old methods to the new method should be without severe obstacles.
There is a need for an enhanced spectrum of information. Mere identification and classification is not sufficient. The data must support calculation of protein functions and protein interactions over time or following controlled changes in conditions. Detailed measurements of protein activity are necessary to describe and understand cellular signaling, deficiencies in the immune system, and the effects of modulation of signaling through inhibitors. This detailed information as to protein function is necessary to design effective treatments of critical diseases, such as cancer or sepsis.