The complexity of an active sequence of a biological macromolecule, e.g. proteins, DNA etc., has been called its information content ("IC"; 5-9). The information content of a protein has been defined as the resistance of the active protein to amino acid sequence variation, calculated from the minimum number of invariable amino acids (bits) required to describe a family of related sequences with the same function (9, 10). Proteins that are sensitive to random mutagenesis have a high information content. In 1974, when this definition was coined, protein diversity existed only as taxonomic diversity.
Molecular biology developments such as molecular libraries have allowed the identification of a much larger number of variable bases, and even to select functional sequences from random libraries. Most residues can be varied, although typically not all at the same time, depending on compensating changes in the context. Thus a 100 amino acid protein can contain only 2,000 different mutations, but 20.sup.100 possible combinations of mutations.
Information density is the Information Content/unit length of a sequence. Active sites of enzymes tend to have a high information density. By contrast, flexible linkers in enzymes have a low information density (8).
Current methods in widespread use for creating mutant proteins in a library format are error-prone polymerase chain reaction (11, 12, 19) and cassette mutagenesis (8, 20, 21, 22, 40, 41, 42), in which the specific region to be optimized is replaced with a synthetically mutagenized oligonucleotide. Alternatively, mutator strains of host cells have been employed to add mutational frequency (Greener and Callahan (1995) Strategies in Mol. Biol. 7: 32). In each case, a `mutant cloud` (4) is generated around certain sites in the original sequence.
Error-prone PCR uses low-fidelity polymerization conditions to introduce a low level of point mutations randomly over a long sequence. Error prone PCR can be used to mutagenize a mixture of fragments of unknown sequence. However, computer simulations have suggested that point mutagenesis alone may often be too gradual to allow the block changes that are required for continued sequence evolution. The published error-prone PCR protocols are generally unsuited for reliable amplification of DNA fragments greater than 0.5 to 1.0 kb, limiting their practical application. Further, repeated cycles of error-prone PCR lead to an accumulation of neutral mutations, which, for example, may make a protein immunogenic.
In oligonucleotide-directed mutagenesis, a short sequence is replaced with a synthetically mutagenized oligonucleotide. This approach does not generate combinations of distant mutations and is thus not significantly combinatorial. The limited library size relative to the vast sequence length means that many rounds of selection are unavoidable for protein optimization. Mutagenesis with synthetic oligonucleotides requires sequencing of individual clones after each selection round followed by grouping into families, arbitrarily choosing a single family, and reducing it to a consensus motif, which is resynthesized and reinserted into a single gene followed by additional selection. This process constitutes a statistical bottleneck, it is labor intensive and not practical for many rounds of mutagenesis.
Error-prone PCR and oligonucleotide-directed mutagenesis are thus useful for single cycles of sequence fine tuning but rapidly become limiting when applied for multiple cycles.
Error-prone PCR can be used to mutagenize a mixture of fragments of unknown sequence (11, 12). However, the published error-prone PCR protocols (11, 12) suffer from a low processivity of the polymerase. Therefore, the protocol is very difficult to employ for the random mutagenesis of an average-sized gene. This inability limits the practical application of error-prone PCR.
Another serious limitation of error-prone PCR is that the rate of down-mutations grows with the information content of the sequence. At a certain information content, library size, and mutagenesis rate, the balance of down-mutations to up-mutations will statistically prevent the selection of further improvements (statistical ceiling).
Finally, repeated cycles of error-prone PCR will also lead to the accumulation of neutral mutations, which can affect, for example, immunogenicity but not binding affinity.
Thus error-prone PCR was found to be too gradual to allow the block changes that are required for continued sequence evolution (1, 2).
In cassette mutagenesis, a sequence block of a single template is typically replaced by a (partially) randomized sequence. Therefore, the maximum information content that can be obtained is statistically limited by the number of random sequences (i.e., library size). This constitutes a statistical bottleneck, eliminating other sequence families which are not currently best, but which may have greater long term potential.
Further, mutagenesis with synthetic oligonucleotides requires sequencing of individual clones after each selection round (20). Therefore, this approach is tedious and is not practical for many rounds of mutagenesis.
Error-prone PCR and cassette mutagenesis are thus best suited and have been widely used for fine-tuning areas of comparatively low information content. An example is the selection of an RNA ligase ribozyme from a random library using many rounds of amplification by error-prone PCR and selection (13).
It is becoming increasingly clear our scientific tools for the design of recombinant linear biological sequences such as protein, RNA and DNA are not suitable for generating the necessary sequence diversity needed to optimize many desired properties of a macromolecule or organism. Finding better and better mutants depends on searching more and more sequences within larger and larger libraries, and increasing numbers of cycles of mutagenic amplification and selection are necessary. However as discussed above, the existing mutagenesis methods that are in widespread use have distinct limitations when used for repeated cycles.
Evolution of most organisms occurs by natural selection and sexual reproduction. Sexual reproduction ensures mixing and combining of the genes of the offspring of the selected individuals. During meiosis, homologous chromosomes from the parents line up with one another and cross-over part way along their length, thus swapping genetic material. Such swapping or shuffling of the DNA allows organisms to evolve more rapidly (1, 2). In sexual recombination, because the inserted sequences were of proven utility in a homologous environment, the inserted sequences are likely to still have substantial information content once they are inserted into the new sequence.
Marton et al., (27) describes the use of PCR in vitro to monitor recombination in a plasmid having directly repeated sequences. Marton et al. discloses that recombination will occur during PCR as a result of breaking or nicking of the DNA. This will give rise to recombinant molecules. Meyerhans et al. (23) also disclose the existence of DNA recombination during in vitro PCR.
The term Applied Molecular Evolution ("AME") means the application of an evolutionary design algorithm to a specific, useful goal. While many different library formats for AME have been reported for polynucleotides (3, 11-14), peptides and proteins (phage (15-17), lacI (18) and polysomes, in none of these formats has recombination by random cross-overs been used to deliberately create a combinatorial library.
Theoretically there are 2,000 different single mutants of a 100 amino acid protein. A protein of 100 amino acids has 20.sup.100 possible combinations of mutations, a number which is too large to exhaustively explore by conventional methods. It would be advantageous to develop a system which would allow the generation and screening of all of these possible combination mutations.
Winter and coworkers (43,44) have utilized an in vivo site specific recombination system to combine light chain antibody genes with heavy chain antibody genes for expression in a phage system. However, their system relies on specific sites of recombination and thus is limited. Hayashi et al. (48) report simultaneous mutagenesis of antibody CDR regions in single chain antibodies (scFv) by overlap extension and PCR.
Caren et al. (45) describe a method for generating a large population of multiple mutants using random in vivo recombination. However, their method requires the recombination of two different libraries of plasmids, each library having a different selectable marker. Thus the method is limited to a finite number of recombinations equal to the number of selectable markers existing, and produces a concomitant linear increase in the number of marker genes linked to the selected sequence(s). Caren et al. does not describe the use of multiple selection cycles; recombination is used solely to construct larger libraries.
Calogero et al. (46) and Galizzi et al. (47) report that in vivo recombination between two homologous but truncated insect-toxin genes on a plasmid can produce a hybrid gene. Radman et al. (49) report in vivo recombination of substantially mismatched DNA sequences in a host cell having defective mismatch repair enzymes, resulting in hybrid molecule formation.
It would be advantageous to develop a method for the production of mutant proteins which method allowed for the development of large libraries of mutant nucleic acid sequences which were easily searched. The invention described herein is directed to the use of repeated cycles of point mutagenesis, nucleic acid shuffling and selection which allow for the directed molecular evolution in vitro of highly complex linear sequences, such as proteins through random recombination.
Accordingly, it would be advantageous to develop a method which allows for the production of large libraries of mutant DNA, RNA or proteins and the selection of particular mutants for a desired goal. The invention described herein is directed to the use of repeated cycles of mutagenesis, in vivo recombination and selection which allow for the directed molecular evolution in vivo and in vitro of highly complex linear sequences, such as DNA, RNA or proteins through recombination.
Further advantages of the present invention will become apparent from the following description of the invention with reference to the attached drawings.