This invention concerns a method and apparatus for determining peptide sequences, and particularly automated sequencing by the method and apparatus.
The chemical process employed by protein/peptide sequencers is derived from a technique originated by Pehr Edman in the 1950s for the sequential degradation of peptide chains (Edman, Acta Chem. Scand. 4, 283, 1950; Edman and Begg, Eur. J. Biochem. 1, 80, 1967). The first step in this degradation is selective coupling of a peptide's amino-terminal amino acid with the Edman reagent, phenylisothiocyanate (PITC), a reaction catalyzed by an organic base delivered with the coupling reagent. The second step is cleavage of this derivatized amino acid from the remainder of the peptide, a reaction effected by treating the peptide with a strong organic acid. Each repeated coupling/cleavage cycle occurs at the newly-formed amino-terminal amino acid left by the previous cycle. Thus, repetitive cycles provide sequential separation of the amino acids which form the primary structure of the peptide.
The sequencing process is not completed by the Edman degradation alone. Once the amino acids are removed from the sample, they must be analyzed to determine their identity. Since the cleaved amino acid derivative, the anilinothiazolinone (ATZ), is not generally suitable for analysis, it is converted to the more stable phenylthiohydantoin (PTH) form before analysis is attempted. In modern sequencers (Wittmann-Liebold et al Anal. Biochem 75, 621, 1976; Hewick et al J. Biol. Chem. 256, 7990, 1981), this conversion is accomplished automatically in a reaction vessel separate from that in which the Edman degradation occurs. The ATZ produced at each degradation cycle is extracted from the peptide with an organic solvent, transferred to the reaction vessel and treated with an aqueous solution of a strong organic acid to effect conversion to the PTH. The PTHs produced from each degradation cycle may be transferred to fraction collector vials until several are manually collected and prepared for analysis. Alternatively, the PTHs may be transferred directly and automatically from the sequencer conversion vessel to an on-line analysis system (Machleidt, W. and Hoffner, H., in Methods in Peptide and Protein Sequence Analysis, pp 35-47, Birr, ed., Elsevier (1980); Wittman-Liebold and Ashman, in Modern Methods in Protein Chemistry, pp 303-327, Tschesche, ed., de Gruyter (1985); Rodriguez, J. Chromotography 350, pp 217-225, (1985)).
Although a variety of analytical procedures have been used to identify the amino acids released during the Edman degradation, only high performance liquid chromatography (HPLC) is currently in widespread use. In fact, HPLC on reverse phase, silica-based packings has revolutionized peptide sequencing. It provides rapid, sensitive and quantitative analysis of PTH amino acids and is presently the only technique used for PTH analysis that can reliably resolve all of the PTH amino acids in a single chromatograph run. Moreover, because it provides quantitative data at the picomole level, HPLC is the only analytical method suitable for microsequencing by automated Edman sequencers at the present time.
Ideally, each chromatogram would provide a simple qualitative answer, i.e., the presence of one and only one PTH. As a practical matter, this is never the case; each chromatogram contains some amount of all PTHs, and a quantitative evaluation of the relative amounts must be made in order to make the sequence assignment. Several factors give rise to this problem. First, protein or peptide samples are unlikely to be pure. They always contain some level of other peptides or free amino acids that give rise to PTH signals during sequencing. Second, repeated exposure of the sample to the cleavage acid during the Edman chemistry causes splitting of the peptide chain at sites other than the amino terminus. The newly exposed amino terminii resulting from these internal splits then produce PTHs after subsequent coupling/cleavage cycles. As a result, each type of amino acid generally exhibits a background PTH level that slowly changes during the sequence run, typically rising through the early cycles and falling slowly during later ones. The absolute levels of the background are dependent on the amino acid composition of the peptide, the Edman chemistry conditions, and the molecular weight of the peptide. Third, removal of an amino terminal amino acid at any given cycle of the Edman chemistry is incomplete. Therefore, some of the amino acid that should have been released at that cycle will remain for the next coupling/cleavage cycle and be released then. This carryover, or lag, is cumulative; multiple failures on any single peptide molecule will result in a steadily increasing proportion of a population of molecules being out of phase with the expected release order. Fourth, the recovery of the expected PTH is slowly decreased during the run by side reactions that block the amino terminal group, physical loss of peptide from the reaction chamber, internal chain cleavage, and lag. This decrease in signal, measured as the repetitive cycle yield, occurs simultaneously with the increase in noise (due to the factors described above), making correct amino acid assignment ever more difficult as a sequencing run proceeds further into the peptide. Fifth, the relative recoveries of the PTH amino acids from the Edman chemistry vary. Some are recovered almost quantitatively, while others are largely destroyed before analysis.
Despite these problems, rigorous interpretation of the chromatographic data from a sequencer run in terms of an amino acid sequence has not received as much attention as the chemistry and instrumentation employed. Many, perhaps most, sequences are assigned by visual inspection of chromatograms to distinguish the specific increase in the PTH level of one amino acid at each cycle from the general backgroune level of all the PTHs. This method is remarkably simple and effective, but it does have limitations. It relies on the scientist's pattern recognition abilities, skills that are largely subjective and limited to direct comparison of only two to three chromatograms at any one time.
Because of these limitations, an increasing number of scientists are using HPLC peak integration systems to translate the analog signals displayed on chart recorder traces into a simpler set of digital numbers. This allows the recovery of each PTH at each cycle to be plotted on a graph that more clearly shows the specific sequence signals superimposed on the background noise levels. Smithies et al., see Biochemistry 10, 4912, (1971), were the first to define the mathematics of the sequencing chemistry in terms of initial yield, repetitive yield, lag, and amino acid background and to attempt quantitative sequence analysis based on peak integration. Machleidt, W. and Hofner, F., (1981), in High Performance Chromatography in Protein & Peptide Chemistry, pp 245-258, Walter de Gruyter, Berlin, have also contributed to this process, but all of the previous methods have relied on the subjective grading of the integrated peak values by the skilled scientist performing the sequence analysis. The scientist's subjective interpretation of the relative importance of an elevated level of one amino acid versus another at any given cycle has still been required for the final sequence assignment.
In addition to all of the above difficulties having to do with background PTH levels, cumulative lag, side reactions, etc., other important problems are associated with the chromatographic data itself. While most chromatography software available commercially works well with ideal data (i.e. with large, well-resolved peaks), they perform much less well with real world data. With respect to analyses of amino acid derivaties, such non-ideal data is the rule rather than the exception. Generally, amino acid analyses involved separations of a complex mixture of closely-related compounds, frequently at such minute levels that conventional software fails to provide satisfactory results unless the user provides extensive manual input to correct the deficiencies in the software.
In concept, HPLC data systems collect chromatographic data by periodically sampling the output of the HPLC detector and the process this digitized data. Quantitation is the performed using peak integration, which requires locating the start and end points of a peak, measuring the total signal between these points, and subtracting any background signal. The center position of the peak (i.e., its retention time) is also required to identify it as a known component based on retention times obtained with standards. Then, the measured area of a sample peak can be converted to a molar amount based on the measured area of the corresponding standard. This conceptually simple process is, however, complicated by several factors, e.g. such as chromatographic noise, peak overlap, and retention time drift.
The chromatographic noise arises from the detector electronics, incomplete mixing of solvents during gradient chromatography, passage of gas bubbles or particulates through the detector, refractive index changes due to solvent or temperature gradients, and the elution of solvent or column contaminants. At the present time, conventional HPLC systems deal rather imperfectly with both low and high frequency chromatographic noise. Most high frequency filtering relies on hardware implementations and is performed by analog filters built into the detector circuitry, and some HPLC systems attempt to remove low frequency noise (often called baseline drift) by using point-by-point subtraction of a blank chromatogram from the sample data. This latter technique is particularly troublesome since it introduces additional high frequency noise and because baselines can vary substantially from run to run. New and less cumbersome techniques are clearly needed for the reduction of chromatographic noise.
Peak overlap, i.e. incompletely resolved peaks are particularly troublesome to HPLC software. Small peaks that partially overlap larger ones may be missed by the slope threshold routines and hence incorrect chromatogram quantitation can result. When fused peaks are detected, several methods for splitting the total area can be used, which are distinguished by the method by which the baseline under the peak components is set. These include (i) a linear extrapolation between the beginning and end points of the multiplet with a linear drop from the valley between the peaks to the baseline, (ii) a similar extrapolation to set the baseline of a major component with a tangent skim to set the baseline of a minor component, and (iii) linear extrapolation between the beginning and end points of each separate component. The method which gives the most accurate peak measurements depends on both the degree of resolution between the peaks and the relative peak heights. It is, therefore, highly sample dependent and frequently requires user adjustment from one sample to another in a set of chromatograms in which these parameters are not constant.
Retention time drift is also a particular problem since once peaks have been located and quantitated, they must be identified by matching their retention times to those of known standards. This is simple if the variation in retention times from run-to-run is always less than the time separation between closely eluting peaks within a run. Typically, software routines are set to search an elution time "window" centered on the standard elution time to find the best match of an unknown peak with the standard. With complex separations that produce closely spaced peaks, this does not always work since elution time drift may move one peak outside its window and place another in it. This problem can be minimized, however, by using easily identified reference peaks to measure the drift and empirically correct the search windows for other peaks. The reference peaks must be well-separated from any neighboring peaks and present in all chromatograms so their search windows can be large enough to allow for the maximum observable drift.
What is needed is a method that resolves most of these problems with chromatogram quantitation and which can be used by a computer to evaluate the set of HPLC data derived from a peptide sequencing run to automatically arrive at an unequivocal call of the sequences, without having to rely on the subjective interpretations of especially skilled individuals.