The invention relates to the use of protein design automation (PDA) to generate computationally prescreened secondary libraries of proteins, and to methods and compositions utilizing the libraries.
Directed molecular evolution can be used to create proteins and enzymes with novel functions and properties. Starting with a known natural protein, several rounds of mutagenesis, functional screening, and propagation of successful sequences are performed. The advantage of this process is that it can be used to rapidly evolve any protein without knowledge of its structure. Several different mutagenesis strategies exist, including point mutagenesis by error-prone PCR, cassette mutagenesis, and DNA shuffling. These techniques have had many successes; however, they are all handicapped by their inability to produce more than a tiny fraction of the potential changes. For example, there are 20500 possible amino acid changes for an average protein approximately 500 amino acids long. Clearly, the mutagenesis and functional screening of so many mutants is impossible; directed evolution provides a very sparse sampling of the possible sequences and hence examines only a small portion of possible improved proteins, typically point mutants or recombinations of existing sequences. By sampling randomly from the vast number of possible sequences, directed evolution is unbiased and broadly applicable, but inherently inefficient because it ignores all structural and biophysical knowledge of proteins.
In contrast, computational methods can be used to screen enormous sequence libraries (up to 1080 in a single calculation) overcoming the key limitation of experimental library screening methods such as directed molecular evolution. There are a wide variety of methods known for generating and evaluating sequences. These include, but are not limited to, sequence profiling (Bowie and Eisenberg, Science 253(5016): 164-70, (1991)), rotamer library selections (Dahiyat and Mayo, Protein Sci 5(5): 895-903 (1996); Dahiyat and Mayo, Science 278(5335): 82-7 (1997); Desjarlais and Handel, Protein Science 4: 2006-2018 (1995); Harbury et al, PNAS USA 92(18): 8408-8412 (1995); Kono et al., Proteins: Structure, Function and Genetics 19: 244-255 (1994); Hellinga and Richards, PNAS USA 91: 5803-5807 (1994)); and residue pair potentials (Jones, Protein Science 3: 567-574, (1994)).
In particular, U.S. Ser. Nos. 60/061,097, 60/043,464, 60/054,678, 09/127,926 and PCT US98/07254 describe a method termed xe2x80x9cProtein Design Automationxe2x80x9d, or PDA, that utilizes a number of scoring functions to evaluate sequence stability.
It is an object of the present invention to provide computational methods for prescreening sequence libraries to generate and select secondary libraries, which can then be made and evaluated experimentally.
In accordance with the objects outlined above, the present invention provides methods for generating a secondary library of scaffold protein variants comprising providing a primary library comprising a rank-ordered list of scaffold protein primary variant sequences. A list of primary variant positions in the primary library is then generated, and a plurality of the primary variant positions is then combined to generate a secondary library of secondary sequences.
In an additional aspect, the invention provides methods for generating a secondary library of scaffold protein variants comprising providing a primary library comprising a rank-ordered list of scaffold protein primary variant sequences, and generating a probability distribution of amino acid residues in a plurality of variant positions. The plurality of the amino acid residues is combined to generate a secondary library of secondary sequences. These sequences may then be optionally synthesized and tested, in a variety of ways, including multiplexing PCR with pooled oligonucleotides, error prone PCR, gene shuffling, etc.
In a further aspect, the invention provides compositions comprising a plurality of secondary variant proteins or nucleic acids encoding the proteins, wherein the plurality comprises all or a subset of the secondary library. The invention further provides cells comprising the library, particularly mammalian cells.
In an additional aspect, the invention provides methods for generating a secondary library of scaffold protein variants comprising providing a first library rank-ordered list of scaffold protein primary variants;
generating a probability distribution of amino acid residues in a plurality of variant positions; and synthesizing a plurality of scaffold protein secondary variants comprising a plurality of the amino acid residues to form a secondary library. At least one of the secondary variants is different from the primary variants.
In an additional embodiment, the present invention provides methods executed by a computer under the control of a program, the computer including a memory for storing the program. The method comprising the steps of receiving a protein backbone structure with variable residue positions, establishing a group of potential rotamers for each of the variable residue positions, and analyzing the interaction of each of the rotamers with all or part of the remainder of the protein backbone structure to generate a set of optimized protein sequences. The methods further comprise classifying each variable residue position as either a core, surface or boundary residue. The analyzing step may include a Dead-End Elimination (DEE) computation. Generally, the analyzing step includes the use of at least one scoring function selected from the group consisting of a Van der Waals potential scoring function, a hydrogen bond potential scoring function, an atomic salvation scoring function, a secondary structure propensity scoring function and an electrostatic scoring function. The methods further comprise altering the protein backbone prior to the analysis, comprising altering at least one supersecondary structure parameter value. The methods may further comprise generating a rank ordered list of additional optimal sequences from the globally optimal protein sequence. Some or all of the protein sequences from the ordered list may be tested to produce potential energy test results. The methods may further comprise generating a secondary library and/or ranking a secondary library, using the techniques outlined herein. Thus devices comprising the computer code for running the programs are provided as well.