Peptide mapping is a valuable approach to combine positional quantitative information with topographical and domain information of proteins. In particular, annotated peptide mapping, particularly of protein-coding genes, is a useful procedure and a critical goal of many genome sequencing projects and biomedical and biopharmaceutical research efforts. Despite advances in computational gene finding, the comprehensive annotation of proteins, including clinically relevant proteins remains challenging.
A tandem mass spectrum can be viewed as a collection of fragment masses from a single peptide (e.g., eight to 30 amino acids from an enzymatically digested protein). This set of mass values is a partial “fingerprint” that may be used to help identify the peptide. The spectra are usually not analyzed de novo. Instead, they are compared against peptides from a database of known proteins, and may be used in in conjunction with other sources of analytic information to converge on an interpretation of the large amount of data provided by MS. For example, additional data may include chromatography (UV) data. Much research has been devoted to improving the accuracy of this search by refining scoring, improving search speed, and handling post-translational modifications.
Due to the complexity of proteins and their biological production, characterization of protein pharmaceuticals (“biologics”) poses much more demanding analytical challenges than do small molecule drugs. Biologics are prone to production problems such as sequence variation, misfolding, variant glycosylation, and post-production degradation including aggregation and modifications such as oxidation and deamidation. These problems can lead to loss of safety and efficacy, so the biopharmaceutical industry would like to identify and quantify variant and degraded forms of the product down to low concentrations, plus obtain tertiary structure information.
In particular, in the pharmaceutical field, there is a need to characterize recombinantly produced protein molecules in new product development, biosimilar (generic) product development, and in quality assurance for existing products. Primary structure analyses can include total mass (as measured by MS), amino acid sequence (as measured by orthogonal peptide mapping with high resolution MS and MS/MS sequencing), disulfide bridging (as measured by non-reducing peptide mapping), free cysteines (as measured by Ellman's or peptide mapping), and thioether bridging (as measured by peptide mapping, SDS-PAGE, or CGE). Higher order structure can be analyzed using CD spectroscopy, DSC, H-D-exchange, and FT-IR. Glycosylation requires identification of glycan isoforms (by NP-HPLC-ESI-MS, exoglycosidase digestion, and/or MALDI TOF/TOF), sialic acid (by NP-HPLC, WAX, HPAEC, RP-HPLC) and aglycolsylation (by CGE and peptide mapping). Heterogeneity analyses must take into consideration C- and N-terminal modifications, glycation of lysine, oxidation, deamidation, aggregation, disulfide bond shuffling, and amino acid substitutions, insertions and deletions. The large variety of assays and techniques gives some idea of the daunting analytical challenge. Mass spectrometry (MS) can cover most of the physicochemical properties required for molecular analysis, but may be powerfully combined with other sources of information, including other modalities (including UV data).
Unfortunately, MS data is often complex and difficult to interpret. MS generally relies on automatic data analysis, due to the huge numbers of spectra (often >10,000/hour), the high accuracy of the measurements (often in the 1-10 ppm range), and the complexity of spectra (100 s of peaks spanning a dynamic range >1000). There are a large number of programs for “easy” MS-based proteomics, for example, SEQUEST, Mascot, X!Tandem, etc., but these programs were not designed for deep analysis of single proteins, and are incapable of difficult analytical tasks such as characterizing mutations, glycopeptides, or metabolically altered peptides. Moreover, the programs just named are all identification tools and must be coupled with other programs such as Rosetta Elucidator (now discontinued), Scaffold, or Thermo Sieve for differential quantification. There are also specialized tools such as PEAKS for de novo sequencing, along with a host of academic tools. The confusing array of software tools poses an obstacle to biotech companies adopting MS-based assays.
Described herein are methods and tools (including apparatuses) that may aid in the analysis of proteins, and in particular may allow protein mapping and particularly automatic and manual annotation of protein maps, in a manner that is accurate and efficient.