This invention relates to the field of protein separation and proteomics.
A goal of genomics research and differential gene expression analysis is to develop correlations between gene expression and particular cellular states (e.g., disease states, particular developmental stages, states resulting from exposure to certain environmental stimuli and states associated with therapeutic treatments). Such correlations have the potential to provide significant insight into the mechanism of disease, cellular development and differentiation, as well as in the identification of new therapeutics, drug targets, and disease markers. Correlations of patterns of gene expression can also be used to provide similar insights into disease and organism metabolism that can be used to speed the development of agricultural products, transgenic species, and for metabolic engineering of organisms to increase bioproduct yields or desirable metabolic activities.
Many functional genomic studies focus on changes in mRNA levels as being indicative of a cellular response to a particular condition or state. Recent research, however, has demonstrated that often there is a poor correlation between gene expression as measured by mRNA levels and actual active gene product formed (i.e., protein encoded by the mRNA). [4] This finding is not surprising since many factorsxe2x80x94including differences in translational efficiency, turnover rates, extracellular expression or compartmentalization, and post-translational modificationxe2x80x94affect protein levels independently of transcriptional controls. Thus, the evidence indicates that functional genomics is best accomplished by measuring actual protein levels (i.e., utilizing proteomic methods) rather than with nucleic acid based methods. The successful use of proteins for functional genomic analyses, however, requires reproducible quantification and identification of individual proteins expressed in cell or tissue samples.
It is at the protein level that metabolic control is exercised in cells and tissues. Comparison of the levels of protein expression between healthy and diseased tissues, or between pathogenic and nonpathogenic microbial strains, can speed the discovery and development of new drug compounds or agricultural products. Analysis of the protein expression pattern in diseased tissues or in tissues excised from organisms undergoing treatment can also serve as diagnostics of disease states or the efficacy of treatment strategies, as well as provide prognostic information regarding suitable treatment modalities and therapeutic options for individual patients.
Many proteins are expressed at varying levels in different cells. Proteins extracted from tissue or cell samples, using conventional techniques, must first be separated into individual proteins by gel or capillary electrophoresis or affinity techniques, before the individual proteins levels can be compared both within a sample and across samples obtained from different tissue sources. Because of the number of proteins expressed by a cell at any given time, multiple electrophoretic techniques (e.g., isoelectric focussing followed by electroporation through a polyacrylamide gel) are often applied to isolate all the individual proteins contained in a given sample.
Several techniques have been used to quantify the relative amounts of each protein present after the separation, including: staining proteins separated in a polyacrylamide gel with dyes (e.g., Brilliant Blue and Fast Green), with colloidial metals (e.g., gold or silver staining), or by prior labelling of the proteins during cellular synthesis by the addition of radioactive compounds (e.g., with 35S-methionine or 14C-amino acids, or 3H-leucine). Staining techniques yield poorly quantitative results because varying amounts of stain are incorporated into each protein and the stained protein must be resolved against the stained background of the gel or electroblotting substrate. Since radioactive labels are applied only to the proteins prior to separation, they overcome the background problem of staining techniques. However, feeding radioactive compounds to human subjects or handling radioactive materials in an uncontrolled field environment (e.g., crop plants) restricts the usefulness of this approach. Both staining and radiolabelling techniques also require inordinately long times to achieve detection. Staining and destaining of gels is a diffusion limited process requiring hours. Radiolabels must be quantified by exposing the labelled gel to photographic film or a phosphor screen for several hours to days while waiting for the radioactive decay process to produce a quantitative image. Direct infrared spectrophotometric interrogation of the proteins in a gel has also been used previously as a method for providing quantitative protein expression data. However, this quantitative resolution possible from this approach is adversely affected by variations in gel thickness and differential spreading of the protein spot between gels (changing the local concentration). Furthermore, the comparatively low absorption cross-section of proteins in the infrared limits the detection sensitivity. Analysis of the protein expression pattern does not provide sufficient information for many applications.
Several methods have also been proposed for the identification of proteins once they are resolved. The most common methods involve referencing the separation coordinates of individual proteins (e.g., isoelectric point and apparent molecular weight) to those obtained from archived separation coordinate data (e.g., annotated 2-D gel image databases) or control samples, performing a chemilytic or enzymatic digestion of a protein coupled with determination of the mass of the resulting peptide fragments and correlating this peptide mass fingerprint with that predicted to arise from the predicted genetic sequence of a set of known proteins (see James, P., M. Quandroni, E. Carafoli, and G. Gonnet, Biochem. Biophys. Res. Commun., 195:58-64 (1993); Yates, J. R., S. Speicher, P. R. Griffin, and T. Hunkapiller, Anal. Biochem., 214:397-408 (1993)), the generation of a partial protein sequence that is compared to the predicted sequences obtained from a genomic database (see Mann, M., paper presented at the IBC Proteomics conference, Boston, Mass. (Nov. 10-11, 1997); Wilm, M., A. Shevchenko, T. Houthaeve, S. Breit, L. Schweiger, T. Fotsis and M. Mann, Nature, 379:466-469 (1996); Chait, B. T, R. Wang, R. C. Beavis and S. B. H. Kent, Science, 262:89-92 (1993)), or combinations of these methods (see Mann, M., paper presented at the IBC Proteomics conference, Boston, Mass. (Nov 10-11, 1997); Wilm, M., A. Shevchenko, T. Houthaeve, S. Breit, L. Schweiger, T. Fotsis and M. Mann, Nature, 379:466-469 (1996); Chait, B. T, R. Wang, R. C. Beavis and S. B. H. Kent, Science, 262:89-92 (1993)). Recent work indicates that proteins can only be unambiguously identified through the determination of a partial sequence, called a protein sequence tag (PST), that allows reference to the theoretical sequences determined from genomic databases (see Clauser, K. R., S. C. Hall, D. M. Smith, J. W. Webb, L. E. Andrews, H. M. Tran, L. B. Epstein, and A. L. Burlingame, xe2x80x9cProc. Natl. Acad. Sci. (USA), 92:5072-5076 (1995); Li, G., M. Walthan, N. L. Anderson, E. Unworth, A. Treston and J. N. Weinstein, Electrophoresis, 18:391-402 (1997)). However, between 8 to 18 hours is currently required to generate a PST for a single protein sample by conventional techniques, with a substantial fraction of this time devoted to recovery of the protein sample from the separation method in a form suitable for subsequent sequencing (see Shevchenko, A., et al., Proc. Natl. Acad. Sci. (USA), 93:14440-14445 (1996); Mark, J., paper presented at the PE/Sciex Seminar Series, Protein Characterization and Proteomics: Automated high throughput technologies for drug discovery, Foster City, Calif. (March, 1998). This makes the identification of all separated proteins from a tissue a time and cost prohibitive endeavor. This has restricted more widespread use of proteomic methods, despite their advantages for functional genomics and inhibited the development of proteomic databases, analogous to the genome databases now available (e.g., Genbank and the Genome Sequence Database).
Thus, current methods for identifying and quantitating the protein expression patterns (xe2x80x9cprotein fingerprintsxe2x80x9d) of cells, tissues, and organs are lacking sufficient resolution, precision, and/or sensitivity. The present invention addresses these features lacking in the methods known in the art.
Two-dimensional (2-D) gel electrophoresis is currently the most widely adopted method for separating individual proteins isolated from cell or tissue samples [5, 6, 7]. Evidence for this is seen in the proliferation (more than 20) of protein gel image databases, such as the Protein-Disease Database maintained by the NIH [8]. These databases provide images of reference 2-D gels to assist in the identification of proteins in gels prepared from various tissues.
Capillary electrophoresis (CE) is a different type of electrophoresis, and involves resolving components in a mixture within a capillary to which an electric field is applied. The capillary used to conduct electrophoresis is filled with an electrolyte and a sample introduced into one end of the capillary using various methods such as hydrodynamic pressure, electroosmotically-induced flow, and electrokinetic transport. The ends of the capillary are then placed in contact with an anode solution and a cathode solution and a voltage applied across the capillary. Positively charged ions are attracted towards the cathode, whereas negatively charged ions are attracted to the anode. Species with the highest mobility travel the fastest through the capillary matrix. However, the order of elution of each species, and even from which end of the capillary a species elutes, depends on its apparent mobility. Apparent mobility is the sum of a species electrophoretic mobility in the electrophoretic matrix and the mobility of the electrophoretic matrix itself relative to the capillary. The electrophoretic matrix may be mobilized by hydrodynamic pressure gradients across the capillary or by electroosmotically-induced flow (electrosomotic flow).
A number of different electrophoretic methods exist. Capillary isoelectric focusing (CIEF) involves separating analytes such as proteins within a pH gradient according to their isoelectric point (i.e., the pH at which the analyte has no net charge) of the analytes. A second method, capillary zone electrophoresis (CZE) fractionates analytes on the basis of their intrinsic charge-to-mass ratio. Capillary gel electrophoresis (CGE) is designed to separate proteins according to their molecular weight. (For reviews of electrophoresis generally, and CIEF and CZE specifically, see, e.g., Palmieri, R. and Nolan, J. A., xe2x80x9cProtein Capillary Electrophoresis: Theoretical and Experimental Considerations for Methods Development,xe2x80x9d in CRC Handbook of Capillary Electrophoresis: A Practical Approach, CRC Press, chapter 13, pp. 325-368 (1994) (electrophoresis generally); Kilar, F., xe2x80x9cIsoelectric Focusing in Capillaries,xe2x80x9d in CRC Handbook of Capillary Electrophoresis: A Practical Approach, CRC Press, chapter 4, pp. 325-368 (1994); and McCormick, R. M., xe2x80x9cCapillary Zone Electrophoresis of Peptides,xe2x80x9d in CRC Handbook of Capillary Electrophoresis: A Practical Approach, CRC Press, chapter 12, pp. 287-323 (1994). All of these references are incorporated by reference in their entirety for all purposes).
While 2-D gel electrophoresis is widely practiced, several limitations restrict its utility in functional genomics research. First, because 2-D gels are limited to spatial resolution, it is difficult to resolve the large number of proteins that are expressed in the average cell (1000 to 10,000 proteins). High abundance proteins can distort carrier ampholyte gradients in capillary isoelectric focusing electrophoresis and result in crowding in the gel matrix of size sieving electrophoretic methods (e.g., the second dimension of 2-D gel electrophoresis and CGE), thus causing irreproducibility in the spatial pattern of resolved proteins [20, 21 and 22]. High abundance proteins can also precipitate in a gel and cause streaking of fractionated proteins [20]. Variations in the cross-linking density and electric field strength in cast gels can further distort the spatial pattern of resolved proteins [23, 24]. Another problem is the inability to resolve low abundance proteins neighboring high abundance proteins in a gel because of the high staining background and limited dynamic range of gel staining and imaging techniques [25, 22]. Limitations with staining also make it difficult to obtain reproducible and quantifiable protein concentration values. In some recent experiments, for example, investigators were only able to match 62% of test spots of the spots formed on 37 gels run under similar conditions [21; see also 28, 29]. Additionally, many proteins are not soluble in buffers compatible with acrylamide gels, or fail to enter the gel efficiently because of their high molecular weight [26, 27].
Thus, currently used methods of capillary electrophoresis provide significant limitations with regard to their usefulness is providing a detailed protein expression fingerprint of a cell or tissue sample.
In contrast to characterizing proteins on the basis of their electrophoretic mobility or isoelectric point, an approach to identifying the protein species that are expressed in a tissue or cell sample is to obtain partial or complete peptide sequence information from proteins purified from the sample. Needless to say, but this approach is laborious and is of limited sensitivity as it requires extensive and often problematic purification steps to isolate individual protein species to allow for unambiguous sequence determination, and in many cases is simply not feasible for proteins which are not highly abundant and/or are not readily purifiable free from contaminant protein species.
It is also important that primary amino acid sequence or a partial sequence (i.e., a protein sequence tag, xe2x80x9cPSTxe2x80x9d) be determined so that the reason underlying changes in the protein expression pattern related to proteins that appearing at different separation coordinates, can be determined. Proteins may appear at more than one separation coordinate, depending on the degree of post-translational modification exercised on that protein by the cell or tissue. The separation coordinate for a protein may also change due to genetic mutations. Changes in the relative abundance of a protein at any given separation coordinate may also be due to changes in the regulation of gene expression. Only by unambiguously identifying each of the proteins resolved can the reason underlying any variations in protein expression across different samples be deduced.
Several methods have previously been proposed for determining the sequence or a protein sequence tag of separated proteins. These include: sequential rounds of N-terminal or C-terminal labeling followed by liberation and determination of the labeled amino acid, exoproteolytic digestion of the protein one amino acid at a time, endoproteolytic digestion of larger proteins into smaller peptides followed by N- and C-terminal labeling and amino acid determination, and mass spectrometric fragmentation pattern recognition. Sequential labeling and digestion techniques (e.g., Edman chemistry) are time consuming, even when automated, because the process must be repeated through many cycles before a sufficiently large protein sequence tag can be accumulated. Propagation of errors-i.e., either from incomplete labeling on each round, incomplete liberation of the labeled amino acid, or both-also limits the length of protein sequence that can be determined using these techniques. While a more complete protein sequence can be obtained by first using endoproteases to cleave the protein into smaller fragments prior to application of the sequential labeling and digestion chemistry, this also introduces the time and labor intensive step of reseparating and purifying the protein fragments, usually by reapplication of an electrophoretic separation technique. Determining the sequence order of these peptide fragments in the original protein can also present additional problems. Carboxy-terminal methoxy labeling of cyanogen bromide digests has been used to identify the C-terminal peptide fragment from other fragments formed by cyanogen bromide digestion of a larger protein.
Mass spectrometric techniques are increasingly being applied to protein identification because of their speed advantage over the more traditional methods. Electrospray and matrix assisted laser desorption ionization (MALDI) are the most common mass spectrometric techniques applied to protein analysis because they are best able to ionize large, low volatility, molecular species. Two basic strategies have been proposed for the MS identification of proteins after separation: 1) mass profile fingerprinting (xe2x80x98MS fingerprintingxe2x80x99)and 2) sequencing of one or more peptide domains by MS/MS (xe2x80x98MS/MS sequencingxe2x80x99). MS fingerprinting is achieved by accurately measuring the masses of several peptides generated by a proteolytic digest of the intact protein and searching a database for a known protein with that peptide mass fingerprint. MS/MS sequencing involves actual determination of one or more PSTs of peptides derived from the protein digest by generation of sequence-specific fragmentation ions in the quadrapole of an MS/MS instrument. Refinements in both of these techniques have also reduced the amount of individual proteins needed to achieve signature detection.
In one approach, a protein is chemilytically (e.g., cyanogen bromide) or enzymatically (e.g., trypsin) digested at sequence specific sites to form peptides. The specificity of the cleavage yields peptides of reproducible masses that can subsequently be determined by MS. The mass spectrometric peptide pattern detected from an individual protein is then compared to a database of similar patterns generated from purified proteins with known sequences or predicted from the theoretical protein sequence based on the expected digestion pattern. The identity of the unknown protein is then inferred to be that of the known protein that best matches its peptide mass fingerprint.
Historically, techniques such as Edman degradation have been extensively used for protein sequencing. However, sequencing by collision-induced dissociation MS methods (MS/MS sequencing) has rapidly evolved and has proved to be faster and require less protein than Edman techniques. MS sequencing is accomplished either by using higher voltages in the ionization zone of the MS to randomly fragment a single peptide isolated from a protein digest, or more typically by tandem MS using collision-induced dissociation in the ion trap (quadrapole). However, the application of CID methods to protein sequencing require that the protein first be chemilytically or enzymatically digested.
Several techniques can be used to select the peptide fragment used for MS/MS sequencing, including accumulation of the parent peptide fragment ion in the quadrapole MS unit, capillary electrophoretic separation coupled to ES-TOF MS detection, or other liquid chromatographic separations. The amino acid sequence of the peptide is deduced from the molecular weight differences observed in the resulting MS fragmentation pattern of the peptide using the published masses associated with individual amino acid residues in the MS, and has been codified into a semi-autonomous peptide sequencing algorithm. In this approach the peptide to be sequenced is typically accumulated in the quadrapole of a mass spectrometer. CID is then accomplished by injecting a neutral collision gas, typically Ar, into this ion trap to force high energy collisions with the peptide ion. Some of these collisions result in cleavage of the peptide backbone and the generation of smaller ions that, by virtue of their different mass to charge ratio, leave the quadrapole and are detected. The majority of the peptide cleavage reactions occur in a relatively few number of ways, resulting in a high abundance of certain types of cleavage ions. The peptide sequence is then deduced from the apparent masses of these high abundance peptide fragments detected.
Mass spectrometry has the additional advantage in that it can be efficiently coupled to electrophoretic separation techniques both with or without endoproteolytic (e.g., trypsin digestion) or chemilytic (e.g., cyanogen bromide) cleavage of the protein into smaller fragments. However, no mass spectrometric technique has previously been described that directly determines the protein sequence or a protein sequence tag of unknown proteins. Furthermore, no MS sequencing technique has previously been described that directly couples to electrophoretic methods used to separate large numbers of proteins from a mixed protein sample.
For example, in the mass spectrum of a 1425.7 Da peptide (HSDAVFTDNYTR) isolated in an MS/MS experiment acquired in positive ion mode, the difference between the full peptide 1425.7 Da and the next largest mass fragment (y11, 1288.7 Da) is 137 Da. This corresponds to the expected mass of an N-terminal histidine residue that is cleaved at the amide bond. For this peptide, complete sequencing is possible as a result of the generation of high-abundance fragment ions that correspond to cleavage of the peptide at almost every residue along the peptide backbone. The generation of an essentially complete set of positively-charged fragment ions that include either end of the peptide is a result of the basicity of both the N- and C-terminal residues (H and R, respectively). If a basic residue is located at the N- or C-terminus, especially R, most of the ions produced in the CID spectrum will contain that residue since positive charge is essentially localized at that site. This greatly simplifies the resulting spectrum since these basic sites direct the fragmentation into a limited series of specific daughter ions. Peptides that lack basic residues tend to fragment into a more complex mixture of fragment ions that makes sequence determination more difficult.
Extending this idea, others demonstrated that attaching a hard positive charge to the N-terminus is an effective approach for directing the production of a complete series of N-terminal fragment ions from a parent peptide in CID experiments regardless of the presence of a basic residue at the N-terminus. Theoretically, all fragment ions are produced by charge-remote fragmentation directed by the fixed-charged group. Peptides have now been modified with several classes of fixed-charged groups, including dimethylalkylammonium, substituted pyridinium, quaternary phosphonium, and sulfonium derivatives. The characteristics of the most desirable labels are that they are easily synthesized, increase the ionization efficiency of the peptide, and (most importantly) direct the formation of a specific fragment ion series with minimal unfavorable label fragmentation. The most favorable derivatives that satisfy these criteria are those of the dimethylalkylammonium class with quaternary phosphonium derivatives being only less favorable due to their more difficult synthesis. Substituted pyridinium derivatives are better suited for high-energy CID as opposed to alkylammonium derivatives.
Despite some progress in peptide analysis, protein identification remains a major bottleneck in field of Proteomics, with up to 18 hours being required to generate a protein sequence tag of sufficient length to allow the identification of a single purified protein from its predicted genomic sequence. Unambiguous protein identification is attained by generating a protein sequence tag (PST), which is now preferentially accomplished by collision-induced dissociation in the quadrapole of an MS/MS instrument. Limitations on the ionization efficiency of larger peptides and proteins restrict the intrinsic detection sensitivity of MS techniques and inhibit the use of MS for the identification of low abundance proteins. Limitations on the mass accuracy of time of flight (TOF) detectors can also constrain the usefulness of MS/MS sequencing, requiring that proteins be digested by proteolytic and chemolytic means into more manageable peptides prior to sequencing. Clearly, rapid and cost effective protein sequencing techniques would improve the speed and lower the cost of proteomics research. Finally, the separation agents and buffers used in traditional protein separation techniques are often incompatible with MS identification methods.
The present invention provides such methods.
Although the limited usefulness of existing protein expression profiling techniques have yielded fairly small and incomplete datasets of protein expression information, the art has been considering theoretical uses of higher resolution protein expression datasets, should they become available in view of new or improved techniques.
If high-resolution, high-sensitivity protein expression profiling methods and datasets were to become available to the art, significant progress in the areas of diagnostics, therapeutics, drug development, biosensor development, and other related areas would be possible. For example, multiple disease markers could be identified and utilized for better confirmation of a disease condition or stage (see U.S. Pat. NoS. 5, 672,480; 5,599,677; 5,939,533; and 5,710,007). Subcellular toxicological information could be generated to better direct drug structure and activity correlations (see Anderson, L., xe2x80x9cPharmaceutical Proteomics: Targets, Mechanism, and Function,xe2x80x9d paper presented at the IBC Proteomics conference, Coronado, Calif. (Jun. 11-12, 1998). Subcellular toxicological information can also be utilized in a biological sensor device to predict the likely toxicological effect of chemical exposures and likely tolerable exposure thresholds (see U.S. Pat. No. 5,811,231).
The present invention provides compositions, methods, apparatus, and computer-based databasing systems for high-throughput, high-resolution, and sensitive protein expression profiling from samples containing a plurality of polypeptide species, such as for example cells, tissues, and organs of bacteria, plants, and animals, and related aspects and uses thereof.
The literature citations discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the inventors are not entitled to antedate such disclosure by virtue of prior invention.
The present invention provides electrophoretic methods and devices for separating biological macromolecules (including polypeptides), methods for determining the partial or complete sequence of a polypeptide using mass spectroscopy, methods combining electrophoretic methods with polypeptide sequencing by mass spectroscopy, methods using the above to generate protein expression fingerprint datasets from a sample or a plurality of samples, and computer-based database query and retrieval systems for utilizing said protein expression fingerprint datasets for various uses, including but not limited to diagnostics, therapeutics, drug discovery, drug development, environmental monitoring by bioassay, toxin quantitation, biosensor development, gene therapy, pharmacological monitoring, illicit drug testing, transgenics, metabolic engineering, and related uses described herein or evident to the ordinarily-skilled artisan in view of the present teaching of the specification. The invention also provides the use of each of these methods, apparatuses, compositions, and computerized database query and retrieval systems.
In an aspect, the invention provides a method for separating a polypeptide species from a sample solution containing a plurality of polypeptide species and identifying said polypeptide species, the method comprising electrophoresing said sample solution containing a plurality of polypeptide species in a capillary electrophoresis device to separate and elute polypeptide species and thereby resolving said protein species based on at least one first biophysical parameter which discriminates protein species; and obtaining, by mass spectrographic fragmentation of eluted polypeptide species, a polypeptide sequence tag (xe2x80x9cPSTxe2x80x9d) identifying at least one resolved protein species. In a variation of the method, at least two capillary electrophoresis methods are used sequentially prior to mass spectrographic fragmentation of one or more eluted polypeptide species. In a variation of the method, a suitable mass spectrometry label is covalently attached to polypeptide species prior to mass spectrographic fragmentation. In a variation of the method the PST comprises at least 2, and preferably 3 or 4 amino acid residues of the carboxy and/or amino terminal sequence of the eluted polypeptide species. In an embodiment of the method, at least 75 percent of polypeptide species present in the sample solution are separated and identified by PST determination. In an embodiment of the method, at least 5,000 unique polypeptide species present in the sample solution are separated and identified by PST determination; preferably at least 7,500 or more unique polypeptide species can be separated and identified in this method. In an embodiment of the method the polypeptide species in the sample solution are naturally-occurring polypeptides obtained from a sample of a tissue, organ, or cell population.
In an aspect, the invention provides a method of obtaining a protein expression profile from a sample containing a cell population or a protein containing extract thereof, the method comprising: electrophoresing in a first capillary electrophoresis apparatus a solution containing a plurality of protein species obtained from a cell population and thereby resolving said protein species based on at least one first biophysical parameter which discriminates protein species, eluting fractions from said first electrophoresis apparatus and electrophoresing said fractions, separately, in a second capillary electrophoresis apparatus and thereby resolving said protein species based on at least one second biophysical parameter which discriminates protein species, and eluting the protein species and identifying the PSTs of a plurality of protein species from the sample by mass spectroscopy fragmentation. In an embodiment, at least 1,000 resolved proteins from the sample are identified by PST determination; in an embodiment at least 5,000 to 7,500 or more resolved proteins from the sample are identified by PST determination. In a variation, two samples are employed, a first sample from a standard (control or normal) cell population and a second sample from a test cell population; test cell populations can be, for example and not limitation, cells of a different histological type than the standard cell population, pathological cells of the same histological type as the standard cells, treated cells that have been exposed to a toxicological or pharmacological agent but which are of the same histological type as the standard cells, cells of a different passage level or age or replicative potential than the standard cells, or any other variation apparent to those skilled in the art seeking to ascertain protein expression profile differences between a first cell sample and a second cell sample. In an embodiment the test cell population is a biopsy of a putative neoplastic lesion and the standard cell population is a biopsy of surrounding apparently non-neoplastic tissue of the same histological origin, both obtained from a human patient, animal, or plant (e.g., crown gall tumor).
The present invention provides a variety of electrophoretic methods and apparatus for separating mixtures of proteins. The methods involve conducting multiple capillary electrophoresis methods in series, wherein samples for each method other than the initial method contain only a subset of the proteins from the preceding step (e.g., from fractions containing resolved protein from the preceding method). By using a variety of techniques to control elution during electrophoresis, the methods are capable of resolving proteins in even complex mixtures such as obtained from tissues and native cells. Utilizing various-labeling schemes and detection methods, certain methods can provide quantitative information on the amount of each of the separated proteins. Such information can be used in the development of protein databases in which proteins expressed under certain conditions are characterized and catalogued. Comparative studies to identify proteins that are differentially expressed between different types of cells or tissues can also be conducted with the methods of the present invention. The methods can also be used in diagnostic, structure activity and metabolic engineering studies.
In general, the methods involve performing a plurality of electrophoretic methods in series. Each method in the series includes electrophoresing a sample containing multiple proteins to obtain a plurality of resolved proteins. The sample that is electrophoresed contains only a subset of the plurality of resolved proteins from the immediately preceding method in the series (except the first method of the series in which the sample is the initial sample that contains all the proteins). The resolved proteins from the final electrophoretic method are then detected using various techniques.
The electrophoretic methods typically are capillary electrophoresis methods, such as capillary isoelectric focusing electrophoresis (CIEF), capillary zone electrophoresis (CZE) and capillary gel electrophoresis (CGE), although the methods are amenable to other capillary electrophoresis methods as well. The particular order of the methods can vary. Typically, the methods utilize combinations of electrophoretic methods which separate proteins on the basis of different characteristics (e.g., size, charge, isoelectric point).
In certain methods, the proteins are labeled so that the resolved proteins are more easily detected and to increase the signal-to-noise ratio. Labeling also enables certain methods to be conducted such that the resolved proteins obtained from the final electrophoretic method are quantitated. Quantitation allows the relative abundance of proteins within a sample, or within different samples, to be determined. In certain methods, the time at which proteins are labeled is selected to precede electrophoresis by capillary zone electrophoresis. By selectively labeling certain residues, resolution of proteins during capillary zone electrophoresis can be increased.
Resolution, quantitation and reproducibility are enhanced by utilizing a variety of techniques to control elution of proteins during an electrophoretic method. The particular elution technique employed depends in part upon the particular electrophoretic method. However, in general, hydrodynamic, salt mobilization, pH mobilization and electroosmotic flow are utilized to controllably elute resolved proteins at the end of each electrophoretic separation.
Some methods provide for additional analysis after the electrophoretic separation. The type of analysis can vary and include, for example, infra-red spectroscopy, nuclear magnetic resonance spectroscopy, UV/VIS spectroscopy and complete or partial sequencing. In certain methods, proteins in the final fractions are further analyzed by mass spectroscopy to determine at least a partial sequence for each of the resolved proteins (i.e., to determine a protein sequence tag).
Thus, certain other methods involve performing one or more capillary electrophoretic methods, each of the one or more methods involving: (i) electrophoresing a sample containing multiple proteins within an electrophoretic medium contained within a capillary, and withdrawing and collecting multiple fractions, each fraction containing proteins resolved during the electrophoresing step. Each method in the series is conducted with a sample from a fraction collected in the preceding electrophoretic method, except the first electrophoretic method which is conducted with a sample containing the original mixture of proteins. Prior to conducting the last electrophoretic method, either the proteins in the initial sample are labeled (i.e., labeling precedes all the electrophoretic separations) or by labeling proteins contained in fractions collected prior to the last electrophoretic method. The final electrophoretic method is performed, and resolved protein within, or withdrawn from, the capillary utilized to conduct the final method is detected with a detector. Hence, the detector is adapted to detect resolved protein within the capillary used in the final method or is connected in line with the capillary to detect resolved proteins as they elute from the capillary. In some instances, the detected proteins are quantitated and further analyzed by mass spectroscopy to determine the relative abundance and to establish a protein sequence tag for each resolved protein.
In one aspect, the present invention provides a method for sequencing a portion of a protein, comprising:
(a) contacting a protein with a C-terminus or N-terminus labeling moiety to covalently attach a label to the C- or N-terminus of the protein and form a labeled protein; and
(b) analyzing the labeled protein using a mass spectrometric fragmentation method to determine the sequence of at least the two C-terminus or two N-terminus residues.
In one group of embodiments, the method further comprises:
(c) identifying the protein by using the sequence of the at least two C-terminus or two N-terminus residues to search predicted protein sequences from a database of gene sequence data.
In a variation, the method further comprises:
(d) further identifying the protein by using one or more of the separation coordinates (i.e., approximate values of the biophysical parameters used to separate the protein prior to sequencing), for example, the apparent molecular weight, isoelectric point, or electrophoretic mobility.
In another variation, the method further comprises:
(e) further identifying the protein by using other known biological or measurable biophysical parameters of the protein (e.g., cell or tissue type extracted from, subcellular localization, the total or partial amino acid composition, the masses of any peptides resulting from chemilytic or enzymatic digestion).
In a variation, the method further comprises assisted fragmentation of the labeled protein in the mass spectrometer through the use of reactive collision gasses. Illustrative reactive gases may include hydrazine, cyanogen bromide, hydrogen peroxide, ozone, and peracetic acid. Other similar reactive gases will be obvious to those skilled in the art.
In another variation, the method further comprises assisted fragmentation of the labeled protein in the mass spectrometer through the injection of high energy materials in the ionization zone. High energy materials may include transient compounds formed in a plasma or corona discharge, high energy electrons from a beta emitter or electron beam, high energy photons from a laser or high intensity light source of a minimum wavelength of 560 nm. Other high energy materials will be obvious to those skilled in the art.
In another aspect, the present invention provides a method for sequencing a portion of a protein in a protein mixture, the method comprising:
(a) contacting the protein mixture with a C-terminus or N-terminus labeling moiety to covalently attach a label to the C- or N-terminus of the protein and form a labeled protein mixture;
(b) separating individual labeled proteins in the labeled protein mixture; and
(c) analyzing the labeled proteins from step (b) by a mass spectrometric method to determine the sequence of at least two C-terminus or two N-terminus residues.
In one group of embodiments, the method further comprises:
(d) identifying the protein by using the sequence of at least two C-terminus or two N-terminus residues in combination with a separation coordinate of the labeled protein and the protein terminus location of the sequence to search predicted protein sequences from a database of gene sequence data.
In each of the methods above, the use of nonproteolytic protein sequencing by in-source fragmentation provides advantages over conventional MS/MS sequencing approaches. One particular advantage is time savings due to elimination of protein digestion steps and elimination of the need to accumulate low volatility peptide ions in the quadrapole. Another advantage is that fewer sequence ambiguities result due to the improved absolute mass accuracy gained by working at the low end of the mass spectrum. Yet another advantage is that better ionization efficiency and corresponding detection sensitivity result from using more energetic ionization conditions and adding one or more charged groups on the labeled fragments. A charged group consisting of a xe2x80x9chardxe2x80x9d charge, that is a permanently ionized group such as tetraalkyl- or tetraaryl-ammonium, tetraalkyl- or tetraaryl-phoshonium, N-substituted pyridinium, or tetraalkyl- or tetraaryl-borate species. A charged group further consisting of a xe2x80x9csoftxe2x80x9d charge, that is an ionizable group which accepts or donates a proton to become ionized, such as carboxylate, phosphonate, sulfonate, alkyl ammonium, pyridinium species. This method provides a contiguous protein sequence tag (PST) that can be used both for unambiguous protein identification by query of a computer database containing genomic sequence information or mRNA sequence information to establish naturally-occurring encoding sequences corresponding to the PST or to generate an N- or C-terminal nucleic acid probe useful for isolating the corresponding cDNA from native cell or tissue samples by polymerase chain reaction amplification or nucleic acid hybridization techniques.
The invention further provides the identification and method of use of chemical labels suitable for enhanced quantitation of the proteins upon electrophoretic separation and subsequent sequencing of said proteins. In one embodiment a single chemical label contains groups that: (i) react with primary amino or carboxylic acid functionalities on the protein, including the N-terminus and C-terminus, (ii) enhance detection sensitivity, and (iii) provide a unique mass signature for the N- or C-terminal labeled peptide fragments generated during fragmentation in a mass spectrometer. In a variation, the label may consist of a mixture of isotopically distinct labels, such that the unique mass signature consists of two or more peaks for each peptide fragment that are separated by more than one amu at a single charge state in the mass spectrum. In another variation, the unique mass signature component and the detection enhancement component may be one and the same. In another embodiment, the chemical label may be modified by partial cleavage and/or addition subsequent to its use for protein quantitation and prior to its use for protein sequencing. In one variation, label addition or cleavage is conducted in solution during withdrawal and transport between the last capillary separation step and injection into the mass spectrometer. In another variation, label addition or cleavage is conducted in the gas phase during ionization in the mass spectrometer.
The invention further provides a method incorporating volatile buffers and surfactants in the final capillary electrophoretic method to facilitate direct coupling of the separation and mass spectrometric detection methods. A volatile buffer is a salt composed of an anion and cation that readily accept or give up a proton to for a volatile organic compound that negligibly interfere with the ionization of proteins or peptides in the mass spectrometer. Illustrative examples include ammonium acetate, ammonium carbonate or bicarbonate, ammonium N-morpholinoethanesulfonate, triethylammonium acetate, pyridium acetate, and pryidium N-morpholinoethanesulfonate. Illustrative examples of volatile surfactants include ammonium, pyridinium, tetramethylammonium, and trimethyl ammonium salts of dodecylsulfate and partially fluorinated or perfluorinated carboxylic, sulfonic, or phosphonic acids of aliphatic hydrocarbons with at least 5 carbon atoms. Many other examples will be evident to those skilled in the art.
The present invention overcomes many of the difficulties associated with current MS-based protein sequencing technologies, including, for example, ionization inefficiency and inaccuracies in fragment mass. Because the methods of the invention preferably eliminate the need for proteolytic or chemolytic digestion of the protein, the present methods provide protein sequencing times that are significantly reduced from the times obtainable using prior methods. Moreover, because the proteins being sequenced are highly fragmented using the present methods, the ionization efficiency and the volatility of the resulting fragments are higher than those of the parent protein, thus leading to a detection sensitivity that is improved over prior methods.
The invention provides a method for identifying a high-resolution protein expression fingerprint for a cell type, tissue, or pathological sample, comprising obtaining a protein-containing extract of a cellular sample and electrophoresing said extract with a first capillary electrophoresis apparatus, eluting protein-containing fractions therefrom, electrophoresing said protein containing fractions on a second capillary electrophoresis apparatus, or plurality thereof in parallel, and identifying the species of proteins by fragmentation mass spectroscopy sequencing to obtain PSTs for a plurality of protein species, and compiling a dataset (or fingerprint record) containing the collection of PSTs obtained thereby. A variation of the method comprises quantitative detection of protein species and compiling a dataset wherein the relative abundance and/or absolute amount of a plurality of protein species eluted from said second capillary electrophoresis is/are cross-tabulated with the PST identification. A typical embodiment comprises attachment of a mass spectroscopy label to the proteins in the protein-containing prior to the last capillary electrophoresis step. In a variation, more than two capillary electrophoresis steps are used; in an embodiment, capillary isoelectric focusing (CIEF) is the first capillary electrophoresis, and the second capillary electrophoresis is either capillary zone electrophoresis (CZE) or capillary gel electrophoresis (CGE).
A protein expression fingerprint comprises an array of at least 100 protein species each having a unique identifier (which may comprise PST and/or electrophoretic mobility data and/or pI and/or any other biophysical property ascertainable by capillary electrophoresis, and/or any other biophysical property known by virtue of the origin of the sample prior to electrophoresis, and/or any other measurable biophysical property), optionally including cross-tabulation with quantitative data indicating relative and/or absolute abundance of each species in the sample. A protein expression fingerprint record comprises a protein expression fingerprint cross-tabulated to data indicating sample source and optionally other bioinformational data (pathological condition, age, passage history, etc.).
In a variation, the invention provides a method for producing a computer database comprising a computer and software for storing in computer-retrievable form a collection of protein expression fingerprint records cross-tabulated with data specifying the source of the protein-containing sample from which each protein expression fingerprint record was obtained. In a variation, at least one of the sources is from a tissue sample known to be free of pathological disorders. In a variation, at least one of the sources is a known pathological tissue specimen, for example but not limitation a neoplastic lesion or a tissue specimen containing an infectious agent such as a virus, or the like. In a variation, the protein expression fingerprint records cross-tabulate at least the following parameters for each protein species in a sample: (1) a unique identification code, which can comprise a PST and/or characteristic electrophoretic separation coordinate; (2) sample source; optionally (3) absolute and/or relative quantity of the protein species present in the sample, optionally (4) presence or absence of amino or carboxyterminal post-translational modifications, and/or optionally (5) original electropherograms and/or mass spectra used to identify the proteins and PST. A database comprises a plurality of protein expression fingerprint, records, each of which represents a protein expression fingerprint from one sample or a subfraction thereof.
The invention also provides for the storage and retrieval of a collection of such polypeptide fingerprints in a computer data storage apparatus, which can include magnetic disks, optical disks, magneto-optical disks, DRAM, SRAM, SGRAM, SDRAM, magnetic bubble memory devices, and other data storage devices, including CPU registers and on-CPU data storage arrays. Typically, the polypeptide fingerprint records are stored as a bit pattern in an array of magnetic domains on a magnetizable medium or as an array of charge states or transistor gate states, such as an array of cells in a DRAM device (e.g., each cell comprised of a transistor and a charge storage area, which may be on said transistor). The invention provides such storage devices, and computer systems built therewith, comprising a bit pattern encoding a protein expression fingerprint record comprising unique identifiers for at least 100 protein species cross-tabulated with sample source. The invention provides a method for identifying related polynucleotide or polypeptide sequences, comprising performing a computerized comparison between a PST sequence stored in or retrieved from a computer storage device or database and at least one other sequence; such comparison can comprise a sequence analysis or comparison algorithm or computer program embodiment thereof (e.g., FASTA, TFASTA, GAP, BESTFIT) and/or the comparison may be of the relative amount of a PST sequence in a pool of sequences determined from a polynucleotide sample of a specimen. The invention provides a computer system comprising a storage device having a bit pattern encoding a database having at least 100 protein expression fingerprint records obtained by the methods of the invention, and a program for sequence alignment and comparison to predetermined genetic or protein sequences. The invention also provides a magnetic disk, such as an IBM-compatible (DOS, Windows, Windows95/98/2000, Windows NT, OS/2) or other format (e.g., Linux, SunOS, Solaris, AIX, SCO Unix, VMS, MV, Macintosh, etc.) floppy diskette or hard (fixed, Winchester) disk drive, comprising a bit pattern encoding a protein expression fingerprint record; often the disk will comprise at least one other bit pattern encoding a polynucleotide and/or polypeptide sequence other than a protein expression fingerprint record of the invention, typically in a file format suitable for retrieval and processing in a computerized sequence analysis, comparison, or relative quantitation method. The invention also provides a network, comprising a plurality of computing devices linked via a data link, such as an Ethernet cable (coax or 10BaseT), telephone line, ISDN line, wireless network, optical fiber, or other suitable signal transmission medium, whereby at least one network device (e.g., computer, disk array, etc.) comprises a pattern of magnetic domains (e.g., magnetic disk) and/or charge domains (e.g., an array of DRAM cells) composing a bit pattern encoding a protein expression fingerprint record of the invention. The invention also provides a method for transmitting a protein expression fingerprint record of the invention, which is uniquely determined by the methodology employed to generate it, comprising generating an electronic signal on an electronic communications device, such as a modem, ISDN terminal adapter, DSL, cable modem, ATM switch, or the like, whereby said signal comprises (in native or encrypted format) a bit pattern encoding a protein expression fingerprint record or a database comprising a plurality of protein expression fingerprint records obtained by the method of the invention, respectively.
The invention provides a computer system for comparing a query polypeptide sequence or query polynucleotide sequence to a database containing an array of PST sequences and other data structures of a protein expression fingerprint record obtained by the method of the invention, and ranking database sequences based on the degree of sequence identity and gap weight to query sequence. A central processor is initialized to load and execute computer program for alignment and/or comparison of amino acid sequences or nucleotide sequences. A query sequence comprising at least 4 amino acids or 12 nucleotides is entered into the central processor via I/O device. Execution of computer program results in central processor retrieving sequence data from data file, which comprises a binary description of a protein expression fingerprint record or portion thereof containing polypeptide sequence data for the record. Said sequence data or record and said computer program can be transferred to secondary memory, which is typically random access memory (e.g., DRAM, SRAM, SGRAM, or SDRAM). Sequences are ranked according to the degree of sequence identity to the query sequence and results are output via an I/O device. For example and not to limit the invention, a central processor can be a conventional computer (e.g., Intel Pentium, PowerPC, Alpha, PA-8000, SPARC, MIPS 4400, MIPS 10000, VAX, etc.); a program can be a commercial or public domain molecular biology software package (e.g., UWGCG Sequence Analysis Software, Darwin, blastn); a data file can be an optical or magnetic disk, a data server, a memory device (e.g., DRAM, SRAM, SGRAM, SDRAM, EPROM, bubble memory, flash memory, etc.); an I/O device can be a terminal comprising a video display and a keyboard, a modem, an ISDN terminal adapter, an Ethernet port, a punched card reader, a magnetic strip reader, or other suitable I/O device.
The invention provides a computer program for comparing query polypeptide sequence(s) or query polynucleotide sequence(s) or a query protein expression fingerprint to a protein expression fingerprint database obtained by a method of the invention and ranking database sequences based on the degree of similarity of protein species expressed and relative and/or absolute abundances in a sample. The initial step is input of a query polynucleotide or polypeptide sequence, or protein expression fingerprint record obtained by a method of the invention, input via n I/O device. A data file is accessed in to retrieve a collection of protein expression fingerprint records for comparison to the query; said collection comprises protein expression fingerprint records obtained by a method of the invention. Individually or collectively sequences or other cross-tabulated information of the protein expression fingerprint collection are optimally matched to the query sequence(s) or query protein expression record such as by the algorithm of Needleman and Wunsch or the algorithm of Smith and Waterman or other suitable algorithm obtainable by those skilled in the art. Once aligned or matched, the percentage of sequence or fingerprint similarity is computed in for each aligned or matched sequence to generate a similarity value for each sequence or fingerprint in the protein expression fingerprint record collection as compared to the query sequence(s) or fingerprint(s). Sequences are ranked in order of greatest sequence identity or weighted match to the query sequence or query fingerprint, and the relative ranking of the sequence or fingerprint to the best matches in the collection of records is thus generated. A determination is made: if more sequences or fingerprint records exist in the data file, the additional sequences/fingerprints or a subset thereof are retrieved and the process is iterated; if no additional sequences/fingerprints exist in the data file, the rank ordered sequences/fingerprints are via an I/O device, thereby displaying the relative ranking of sequences/fingerprints among the sequences/fingerprints of the data file optimally matched and compared to the query sequence(s) or fingerprint(s).
The invention also provides the use of a computer system described above, which comprises: (1) a computer, (2) a stored bit pattern encoding a collection of protein expression fingerprint records obtained by the methods of the invention, which may be located in said computer, (3) a comparison sequence or fingerprint, such as a query sequence or a data file containing fingerprint information, and (4) a program for alignment and comparison, typically with rank-ordering of comparison results on the basis of computed similarity values. In an embodiment, neural network pattern matching/recognition software is trained to identify and match fingerprint records based on backpropagation using empirical data input by a user. The computer system and methods described permit the identification of the relative relationship of a query protein expression fingerprint to a collection of protein expression fingerprints; preferably all protein expression fingerprints (query and database) are obtained by the methods of the invention.
A further understanding of the nature and advantages of the invention will become apparent by reference to the remaining portions of the specification and drawings.