The sequencing of the human genome has created the promise and opportunity for understanding the function of all genes and proteins relevant to human biology and disease, Peltonen and McKusick, Science, 291: 1224-1229 (2001). However, several important hurdles must be overcome before this promise can be fully attained. First, the sequence signals that indicate the location of a gene in the genome and that control its expression are not well understood, so it is frequently difficult to predict the presence of a gene that is actually transcribed, e.g. Guigo et al, Genome Res., 10: 1631-1642 (2000); Rudd et al, Electrophoresis, 19:536-544 (1998). Second, although monitoring gene expression at the transcript level has become more robust with the development of microarray technology, numerous problems still exist including variability relating to probe hybridization differences and cross-reactivity, element-to-element differences within microarrays, and microarray-to-microarray differences, e.g. Audic and Claverie, Genome Res., 7: 986-995 (1997); Wittes and Friedman, J. Natl. Cancer Inst., 91: 400-401 (1999); Richmond et al, Nucleic Acids Research, 27: 3821-3835 (1999). Finally, because of the scale of human molecular biology (about a third of the estimated 30-40 thousand genes appear to give rise to multiple splice variants and most appear to encode protein products with a plethora of post-translational modifications), potentially many tens of thousands of genes and their expression products will have to be isolated and tested in order to understand their role in health and disease, Dawson and Kent, Annu. Rev. Biochem., 69: 923-960 (2000).
In regard to the issue of scale, the application of conventional recombinant methodologies for cloning, expressing, recovering, and isolating proteins are still time consuming and labor-intensive processes, so that their application in screening large numbers of different gene products for determining function has been limited. Recently, a convergent synthesis approach has been developed which may address the need for facile access to highly purified research-scale amounts of protein for functional screening, Dawson and Kent (cited above); Dawson et al, Science, 266: 776-779 (1994). In its most attractive implementation, an unprotected oligopeptide intermediate having a C-terminal thioester reacts with an N-terminal cysteine of another oligopeptide intermediate under mild aqueous conditions to form a thioester linkage which spontaneously rearranges to a natural peptide linkage, Kent et al, U.S. Pat. No. 6,184,344. The approach has been used to assemble oligopeptides into active proteins both in solution phase, e.g. Kent et al, U.S. Pat. No. 6,184,344, and on a solid phase support, e.g. Canne et al, J. Am. Chem. Soc., 121: 8720-8727 (1999).
When the polypeptide to be synthesized by this approach exceeds 100-150 amino acids, it is necessary to join three or more fragments, as it is currently difficult to synthesize and purify oligopeptide intermediates longer than about 60 residues. In this case, the internal oligopeptide intermediates not only contain a C-terminal thioester moiety, but also an N-terminal cysteine. During the assembly process, the cysteine of such internal intermediates, if left free, will react with the C-terminal thioester of the same intermediate molecule or that of a different intermediate molecule, thereby interfering with the desired ligation reaction by the formation of an undesired cyclical peptide or concatemer of the intermediate. This problem can be circumvented by employing a protecting group for the N-terminal cysteine with the following properties: i) it must be stable to the conditions used to cleave the oligopeptide from the synthesis resin, ii) it must be removable after a native chemical ligation has been completed, and iii) preferably, removal takes place in the same ligation reaction mixture before purification, so that the ligation reaction and cysteine deprotection can be conducted in one pot.