The production and screening of libraries of nucleotide sequences has been reported useful for identifying novel peptides, polypeptides, and proteins having a particular biological or chemical property [See Ballivet and Kauffman, PCT application WO 86/05803, published Oct. 9, 1986, incorporated herein by reference]. As explained more fully below, large numbers of diverse DNA and RNA sequences have been screened by various in vitro methods to identify functional biological or chemical molecules such as growth factors, enzymes, and antigens,
In the past, randomly selected, genomic DNA was utilized in screening for functional sequences [See Ma and Ptashne, Cell, 51:113-119 (1987); Kaiser, et al., Science, 235:312-317 (1987)]. More particularly, Ma and Ptashne described a class of yeast activators encoded by genes bearing random genomic DNA fragments fused to the coding sequence of the DNA-binding portion of GAL4. It was reported that the activating sequences discovered showed no obvious sequence homology when compared with one another, but manifested the same biological function.
Chemically synthesized random sequence DNA has also been screened for functional properties. A wide variety of functional molecules have been identified from libraries of such random sequences. For example, functional promoter elements have been isolated from populations of randomly synthesized DNA [See Horwitz and Loeb, J. Biol. Chem., 263:14724-14731 (1988); Oliphant and Struhl, Nucl. Acids Res., 16:7673-7683 (1988)].
Likewise, functional molecules have been identified in chemically synthesized random RNA sequence libraries. Affinity selection on dye columns of a library of 100-base, random RNA sequences has shown that approximately one in 10.sup.10 such molecules can specifically bind a small ligand [Ellington and Szostak, Nature, 346:818-822 (1990)]. A random RNA sequence library has also been used to identify 8-base stretches which are recognized by T4 DNA polymerase [Tuerk and Gold, Science, 249:386-390 (1990)].
Fusion-phage systems have also been used to clone and express short, random sequence polypeptides as fusions with a phage coat protein [See Scott and Smith, Science; 249:386-390 (1990); Cwirla, et al., Proc. Natl. Acad. Sci., 87:6378-6382 (1990); Parmley and Smith, Gene, 73:305 (1988)]. Scott and Smith described construction of a library of approximately 4.times.10.sup.7 different hexapeptide epitopes. The library was then screened to identify hexapeptides capable of binding to specific monoclonal antibodies. Likewise, Cwirla, et al., reported that randomly generated peptide sequences are a rich source of ligands. A library of 3.times.10.sup.8 recombinants encoding millions of N-terminal hexapeptide sequences was constructed and then screened with a monoclonal antibody specific for the Tyr-Gly-Gly-Phe sequence present in .beta.-endorphin.
Peptides have also been identified which bind to streptavidin, a protein with no previously known affinity for peptides [Devlin, et al., Science, 249:404-406 (1990)]. Devlin et al. described nine different streptavidin-binding peptide sequences selected from a library of random peptide sequences. The method involved production of a library of sequences by cloning synthetic DNA into E. coli expression vectors. The random sequences were then expressed in a filamentous phage system.
The random sequences and libraries of random sequences described above were produced using various techniques. For example, random sequences were produced by chemical mutagenesis or site-specific mutagenesis of segments of genomic DNA. Also, repeated cycles of solid-phase peptide synthesis were used to produce populations of amino acid sequences [See Geysen, et al., Proc. Natl. Acad. Sci., 81:3998-4002 (1984)].
Alternatively, synthetic random sequences have been produced by mixing together nucleotide precursors in random, undetermined quantities. Further, synthetic random sequences have been produced by mixing together nucleotide precursors in equimolar quantities prior to oligonucleotide or polynucleotide synthesis.
These prior methods for producing random sequences and libraries of sequences are generally inadequate, however. In particular, these methods typically have not designed or synthesized the sequences or libraries to contain particular nucleotide or amino acid compositions or to possess particular biological or chemical characteristics.
For instance, methods for producing sequences using equimolar proportions of nucleotides typically result in amino acid sequences of relatively short length. Only about 9% of the polypeptides translated from DNA encoded by equimolar proportions of nucleotides will reach 50 residues in length. The shortened length of these polypeptides is primarily due to the presence of stop codons in the DNA sequence.
It is known in the art that nucleotides, and groups of nucleotides, in a gene sequence often have various functions in the reading frame of the gene. For example, there may be nucleotides having a regulatory function such as a promoter or start signal. Other nucleotides function in stopping transcription or translation. These nucleotide triplets or "codons" are typically referred to as termination or "stop" codons and generally consist of the nucleotides TAA, TGA, and TAG. In a DNA sequence synthesized from equimolar proportions of nucleotides, about three out of the sixty-four codons (4.7%) are stop codons.
Mandecki has described a method for generating a large pool of semi-random open reading frames ("ORFs") (200-900 residues) [Mandecki, Protein Engineering, 3:221-226 (1990)]. In particular, Mandecki described a method for constructing random DNA sequences using equimolar proportions of nucleotides. The DNA was designed to contain no stop codons by eliminating certain nucleotides in the third position of each codon. The DNA sequence design, however, failed to code for 2 of the 20 common amino acids and for 112 of the 400 possible amino acid pairs. Thus, although Mandecki's design of the sequences eliminated the presence of stop codons, the overall diversity of the sequences was limited. Furthermore, the sequences were cloned in an expression system which produced insufficient product to allow for its isolation.
Scott and Smith, [supra,] also described use of equimolar proportions of nucleotides in producing random oligonucleotide sequences. Specifically, the sequences were synthesized using oligonucleotides with a three residue repeating pattern of (NNK).sub.6, where N is a mixture of all four nucleotides and K is an equimolar mixture of T and G.
Likewise, Devlin et al., [supra,] produced random 15-residue peptide sequences using a three residue repeating pattern. The frequency of termination codons and variation in the number of codons for each amino acid residue was reduced by using (NNS).sub.15 to encode 15 random residues where N is a mixture of G, A, T, and C, and S is a mixture of G and C.
Although the methods described by Mandecki, Scott and Smith, and Devlin et al. resulted in gene sequences having greater length, the restrictions imposed on the addition of nucleotides reduced the diversity of the sequences. Moreover, the gene sequences synthesized from arbitrary or even equimolar quantities of nucleotides do not generally encode for polypeptides having characteristics like those found in functional, naturally-occurring proteins.
The nucleotide composition of such synthesized nucleotide sequences may also affect the cloning of the sequences into vectors or other expression systems, particularly with respect to cloning junctions. Cloning junctions in DNA sequences are constant regions of a determined nucleotide sequence which serve as primers and restriction enzyme recognition sites. Such constant regions not only have the affect of potentially limiting the diversity of the gene sequences cloned in a vector but may also adversely affect the secondary structure of the peptide or polypeptide encoded by the gene sequence [See Kolaskar, et al., Int. J. Peptide Protein Res., 22:83-91 (1983); Vonderviszt, et al., Int J. Peptide Protein Res., 27:483-492 (1986)].
Methods for synthesizing nucleotide sequences and libraries of sequences in the past have typically not addressed the problems associated with cloning junctions. For example, the random sequences described by Mandecki, [supra], had a high frequency of glycine in the cloning junctions, an amino acid which avoids both alpha helix and beta sheet in natural proteins. A repeating pattern of glycine residues can therefore have a negative impact on folding of the proteins by restricting the allowed patterns of secondary structure.
The nucleotide and amino acid composition of the synthesized sequences also affects the biological and chemical properties of the peptides or polypeptides encoded by the sequences. For example, the amino acid composition of a peptide or polypeptide will determine whether it is hydrophilic or hydrophobic and whether it will have a positive or negative electrical charge.
The properties possessed by typical, naturally-occurring proteins have been studied and statistical analyses of such protein sequences have been conducted. Naturally-occurring proteins have been described to characteristically contain certain amino acid compositions [Klapper, Biochem. Biophys. Res. Com., 78:1018-1024 (1977)]. As an example, the high frequency of N-terminal methionine in bacterial proteins is well-known and is explained by its role as a chain initiator [Waller, J. Mol. Biol., 7:483-496 (1967)]. Accordingly, in producing diverse nucleotide sequences and libraries of such sequences, it is desirable and useful to consider the respective nucleotide and amino acid compositions.