Current large-scale sequencing techniques are elucidating numerous novel genes in various eukaryotic and prokaryotic genomes. It is generally recognized that these efforts will be beneficial to the advancement of medicine and biology. To achieve this benefit, identified novel genes must be classified and biological function determined.
Secreted and non-secreted proteins are important classes. Targeting protein drugs into body fluids or onto receptors located on the surface of the cellular membrane is much easier than transporting proteinaceous compounds into the cytosol or the organelles. The relative ease of targeting protein drugs explains why most approved protein drugs are secreted proteins, a trend that is likely to remain so in the foreseeable future. Known secreted proteins include all peptide hormones and their receptors, breast and colon cancer gene products, leptin and its receptor, serum albumin, superoxide dismutase, spliceosome proteins, 7-transmembrane proteins, immunoglobulins, deoxyribonuclease I and several families of serine proteinases.
Secreted proteins are synthesized by a host organism in the cytosol, and in eukaryotes, are exported cotranslationally into the endoplasmatic reticulum and subsequently secreted into the extracellular space or inserted into the membrane. Proteins are transported either across a translocation pore or in direct interaction with lipids.
The vehicles for protein transportation are the mostly amino-terminal signal peptides, extending to 12 to 50 residues in eukaryotes. Signal peptides are the most abundant functional domain in proteins. They consist of three short segments: a usually basic cytosolic part that generally starts with a methionine; a hydrophobic segment; and a cleavage site for the signal peptidase enzyme. In contrast to this sophisticated and effective mechanism, only sporadic natural pathways, such as pinocytosis and membrane punching by performs, exist for the transport in the opposite direction, i.e., from the extracellular space into the cytosol.
Traditional experimental methods for the identification of secreted proteins are labor and cost-intensive. Exemplary methods include peptide sequencing, immunoelectron microscopy and membrane ultracentrifugation.
A more recently developed experimental method for identifying secreted proteins is the DiscoverEase.RTM. protocol of Genetics Institute (U.S. Pat. No. 5,536,637). This protocol utilizes a yeast mutant strain defective in the invertase signal peptide and has been used to identify thousands of secreted proteins.
While experimental techniques are indispensable in finding novel types of secretory signals, screening hundreds of thousands of Expressed Sequence Tag (EST) assemblies generated by large-scale sequencing efforts to identify secreted proteins is feasible only by extremely fast computational tools that can predict secreted proteins for large amounts of sequences at a reasonable cost. However, signal peptides are a classic example of how an identical function is preserved across extremely variable sequences (1). In the hydrophobic segment, for instance, the order of leucine, valine, isoleucine and other hydrophobic residues seems to be irrelevant to function (2). As a result of this variation, signal peptides cannot be identified by the usual sequence analysis methods like BLAST (3), FASTA (4,5) or dynamic programming (6), all of which are based on the conservation of sequences.
Gunnar von Heijne (7) has created a profile-like method for the recognition of the cleavage site, the least variable segment of signal peptides, that works with an accuracy of about 70% on protein data. His method has been coded into computer programs by several authors, including Pascarella and Bossa (8), Folz and Gordon (9) and Popowicz and Dash (10). Attempts to improve the accuracy of this method by increasing the minimal score eliminates several false positives only at the expense of false rejection of bona fide signal peptides.
Neural networks are more accurate tools for the prediction of signal peptides. Ladunga et al. (11) applied the so-called tiling algorithm (12) for the amino-terminal 20 residues of signal peptides in combination with von Heijne's method. Nielsen et al. (13,14) used a standard feed-forward algorithm trained by the back-propagation algorithm (15). In the production system created by Nakai, Kanehisa and coworkers (16-18), decision rules are applied in the form of binary decision trees. A fundamentally different approach (19) predicts signal peptides by modeling the angle of insertion into the membrane and the orientation of the signal peptide in the yeast invertase system. For a recent review on computational methods to predict secreted proteins, see (20).
Despite the current efforts, a need exists to increase the precision of computational methods for identifying secreted proteins. Since large-scale sequencing efforts produce nucleotide rather than amino acid sequence data, the accuracy of predicting secreted proteins will depend upon the sequencing accuracy, the reading frame prediction accuracy and the accuracy of the prediction of the signal peptide/membrane anchor on the protein level. Taking into account the rate of sequencing errors in ESTs (including artificial frameshifts), and the uncertainty in finding the correct reading frame and translation initiation site, the identification of secreted proteins from EST data will require significantly more accurate computational predictions than the ones described above.