The invention resides in the technical fields of bioinformatics, and protein engineering.
Zinc finger proteins (ZFPs)are proteins that can bind to DNA in a sequence-specific manner. Zinc fingers were first identified in the transcription factor TFIIIA from the oocytes of the African clawed toad, Xenopus laevis. An exemplary motif characterizing one class of these protein (C2H2 class) is xe2x80x94Cysxe2x80x94(X)2-4xe2x80x94Cysxe2x80x94(X)12xe2x80x94Hisxe2x80x94(X)3-5xe2x80x94His (SEQ ID NO: 1) (where X is any amino acid). A single finger domain is about 30 amino acids in length, and several structural studies have demonstrated that it contains an alpha helix containing the two invariant histidine residues and two invariant cysteine residues in a beta turn co-ordinated through zinc. To date, over 10,000 zinc finger sequences have been identified in several thousand known or putative transcription factors. Zinc finger domains are involved not only in DNA-recognition, but also in RNA binding and in protein-protein binding. Current estimates are that this class of molecules will constitute about 2% of all human genes.
The x-ray crystal structure of Zif268, a three-finger domain from a murine transcription factor, has been solved in complex with a cognate DNA-sequence and shows that each finger can be superimposed on the next by a periodic rotation. The structure suggests that each finger interacts independently with DNA over 3 base-pair intervals, with side-chains at positions xe2x88x921, 2, 3 and 6 on each recognition helix making contacts with their respective DNA triplet subsites. The amino terminus of Zif268 is situated at the 3xe2x80x2 end of the DNA strand with which it makes most contacts. DNA recognition subsite. Recent results have indicated that some zinc fingers can bind to a fourth base in a target segment (Isalan et al., PNAS 94, 5617-5621 (1997)). If the strand with which a zinc finger protein makes most contacts is designated the target strand, some zinc finger proteins bind to a three base triplet in the target strand and a fourth base on the nontarget strand. The fourth base is complementary to the base immediately 3xe2x80x2 of the three base subsite.
The structure of the Zif268-DNA complex also suggested that the DNA sequence specificity of a zinc finger protein might be altered by making amino acid substitutions at the four helix positions (xe2x88x921, 2, 3 and 6) on each of the zinc finger recognition helices. Phage display experiments using zinc finger combinatorial libraries to test this observation were published in a series of papers in 1994 (Rebar et al., Science 263, 671-673 (1994); Jamieson et al., Biochemistry 33, 5689-5695 (1994); Choo et al, PNAS 91, 11163-11167 (1994)). Combinatorial libraries were constructed with randomized side-chains in either the first or middle finger of Zif268 and then used to select for an altered Zif268 binding site in which the appropriate DNA sub-site was replaced by an altered DNA triplet. Further, correlation between the nature of introduced mutations and the resulting alteration in binding specificity gave rise to a partial set of substitution rules for design of ZFPs with altered binding specificity.
Greisman and Pabo, Science 275, 657-661 (1997) discuss an elaboration of the phage display method in which each finger of a Zif268 was successively randomized and selected for binding to a new triplet sequence. This paper reported selection of ZFPs for a nuclear hormone response element, a p53 target site and a TATA box sequence.
A number of papers have reported attempts to produce ZFPs to modulate particular target sites. For example, Choo et al., Nature 372, 645 (1994), report an attempt to design a ZFP that would repress expression of a brc-abl oncogene. The target segment to which the ZFPs would bind was a nine base sequence (5xe2x80x2GCA GAA GCC3xe2x80x2): chosen to overlap the junction created by a specific oncogenic translocation fusing the genes encoding brc and abl. The intention was that a ZFP specific to this target site would bind to the oncogene without binding to abl or brc component genes. The authors used phage display to screen a mini-library of variant ZFPs for binding to this target segment. A variant ZFP thus isolated was then reported to repress expression of a stably transfected brc-able construct in a cell line.
Pomerantz et al., Science 267, 93-96 (1995) reported an attempt to design a novel DNA binding protein by fusing two fingers from Zif268 with a homeodomain from October 1. The hybrid protein was then fused with a transcriptional activator for expression as a chimeric protein. The chimeric protein was reported to bind a target site representing a hybrid of the subsites of its two components. The authors then constructed a reporter vector containing a luciferase gene operably linked to a promoter and a hybrid site for the chimeric DNA binding protein in proximity to the promoter. The authors reported that their chimeric DNA binding protein could activate expression of the luciferase gene.
Liu et al., PNAS 94, 5525-5530 (1997) report forming a composite zinc finger protein by using a peptide spacer to link two component zinc finger proteins each having three fingers. The composite protein was then further linked to transcriptional activation domain. It was reported that the resulting chimeric protein bound to a target site formed from the target segments bound by the two component zinc finger proteins. It was further reported that the chimeric zinc finger protein could activate transcription of a reporter gene when its target site was inserted into a reporter plasmid in proximity to a promoter operably linked to the reporter.
Choo et al., WO 98/53058, WO98/53059, and WO 98/53060 (1998) discuss selection of zinc finger proteins to bind to a target site within the HIV Tat gene. Choo et al. also discuss selection of a zinc finger protein to bind to a target site encompassing a site of a common mutation in the oncogene ras. The target site within ras was thus constrained by the position of the mutation.
None of the above studies provided criteria for systematically evaluating the respective merits of the different potential target sites within a candidate gene. The phage display studies by Rebar et al., supra, Jamieson et al., supra and Choo et al, PNAS.(1994) supra, all focused on alterations of the natural Zif268 binding-site, 5xe2x80x2GCG TGG GCGc3xe2x80x2(SEQ ID NO:11), and were not made with reference to a predetermined target gene. Choo et al. Nature (1994), supra""s selection of target site was constrained solely by the intent that the site overlap the interface between brc and abl segments and did not involve a comparison of different potential target sites. Likewise, Greisman and Pabo chose certain target sites because of their known regulatory roles and did not consider the relative merits of different potential target segments within a preselected target gene. Similarly, Choo et al. (1998), supra""s choice of target site within ras was constrained by the position of a mutation. No criterion is provided for Choo et al. (1998)""s selection of a target site in HIV Tat. Finally, both Pomerantz et al., supra and Liu et al., supra constructed artificial hybrid target sites for composite zinc fingers and then inserted the target sites into reporter constructs.
The invention provides methods of selecting a target site within a target sequence for targeting by a zinc finger protein. Some such methods comprise providing a target nucleic acid to be targeted by a zinc finger protein and outputting a target site within the target nucleic acid comprising 5xe2x80x2NNx aNy bNzc3xe2x80x2. Each of (x, a), (y, b) and (z, c) is (N, N) or (G, K) provided at least one of (x, a), (y, b) and (z, c) is (G, K). N and K are IUPAC-IUB ambiguity codes. In some methods, a plurality of segments within the target nucleic acid are selected and a subset of the plurality of segments comprising 5xe2x80x2NNx aNy bNzc3xe2x80x2 is output. Typically the target nucleic acid comprises a target gene. In some methods, at least two of (x, a), (y, b) and (z, c) is (G, K) In some methods, all three of (x, a), (y, b) and (z, c) are (G, K). Some methods further comprise identifying a second segment of the gene comprising 5xe2x80x2NNx aNy bNzc3xe2x80x2, wherein each of (x, a), (y, b) and (z, c) is (N, N) or (G, K); at least one of (x, a), (y, b) and (z, c) is (G, K). and N and K are IUPAC-IUB ambiguity codes. In some methods, in the second segment at least two of (x, a), (y, b) and (z, c) are (G, K). In some methods, all three of at least one of (x, a), (y, b) and (z, c) are (G, K). In some methods, the first and second segments are separated by fewer than 5 bases in the target site.
Some methods further comprise synthesizing a zinc finger protein comprising first, second and third fingers that bind to the bNz aNy and NNx triplets respectively. In some such methods, the synthesizing step comprises synthesizing a first zinc finger protein comprising three zinc fingers that respectively bind to the NNx aNy and bNz triplets in the target segment and a second three fingers that respectively bind to the NNx aNy and bNz triplets in the second target segment. In some methods, each of the first, second and third fingers is selected or designed independently. In some methods, a finger is designed from a database containing designations of zinc finger proteins, subdesignations of finger components, and nucleic acid sequences bound by the zinc finger proteins. In some methods, a finger is selected by screening variants of a zinc finger binding protein for specific binding to the target site to identify a variant that binds to the target site.
Some methods further comprise contacting a sample containing the target nucleic acid with the zinc finger protein, whereby the zinc finger protein binds to the target site revealing the presence of the target nucleic acid or a particular allelic form thereof. In some methods, a sample containing the target nucleic acid is contacted with the zinc finger protein, whereby the zinc finger protein binds to the target site thereby modulating expression of the target nucleic acid.
In some methods, the target site occurs in a coding region. In some methods, the target site occurs within or proximal to a promoter, enhancer, or transcription start site. In some methods, the target site occurs outside a promoter, regulatory sequence or polymorphic site within the target nucleic acid.
In another aspect, the invention provides alternate methods for selecting a target site within a polynucleotide for targeting by a zinc finger protein. These methods, comprising providing a polynucleotide sequence and selecting a potential target site within the polynucleotide sequence; the potential target site comprising contiguous first, second and third triplets of bases at first, second and third positions in the potential target site. A plurality of subscores are then determined by applying a correspondence regime between triplets and triplet position in a sequence of three contiguous triplets, wherein each triplet has first, second and third corresponding positions, and each combination of triplet and triplet position has a particular subscore. A score is then calculated for the potential target site by combining subscores for the first, second, and third triplets. The selecting, determining and calculating steps are then repeated at least once on a further potential target site comprising first, second and third triplets at first, second and third positions of the further potential target site to determine a further score. Output is then provided of at least one potential target site with its score. In some methods, output is provided of the potential target site with the highest score. In some methods, output is provided of the n potential target sites with the highest scores, and the method further comprises providing user input of a value for n. In some methods, the subscores are combined by forming the product of the subscores. In some methods, the correspondence regime comprises 64 triplets, each having first, second, and third corresponding positions, and 192 subscores.
In some methods, the subscores in the correspondence regime are determined by assigning a first value as the subscore of a subset of triplets and corresponding positions, for each of which there is an existing zinc finger protein that comprising a finger that specifically binds to the triplet from the same position in the existing zinc finger protein as the corresponding position of the triplet in the correspondence regime; assigning a second value as the subscore of a subset of triplets and corresponding positions, for each of which there is an existing zinc finger protein that comprises a finger that specifically binds to the triplet from a different position in the existing zinc finger protein than the corresponding position of the triplet in the correspondence regime; and assigning a third value as the subscore of a subset of triplets and corresponding positions for which there is no existing zinc protein comprising a finger that specifically binds to the triplet.
In some methods, a context parameter with the subscore of at least one of the first, second and third triplets to give a scaled subscore of the at least one triplet. In some methods the context parameter is combined with the subscore when the target site comprises a base sequence 5xe2x80x2NNGK3xe2x80x2, wherein NNG is the at least one triplet.
In another aspect, the invention provides methods of designing a zinc finger protein. Such methods use a database comprising designations for a plurality of zinc finger proteins, each protein comprising at least first, second and third fingers, and subdesignations for each of the three fingers of each of the zinc finger proteins; a corresponding nucleic acid sequence for each zinc finger protein, each sequence comprising at least first, second and third triplets specifically bound by the at least first, second and third fingers respectively in each zinc finger protein, the first, second and third triplets being arranged in the nucleic acid sequence (3xe2x80x2-5xe2x80x2) in the same respective order as the first, second and third fingers are arranged in the zinc finger protein (N-terminal to C-terminal). A target site is provided for design of a zinc finger protein, the target site comprising continuous first, second and third triplets in a 3xe2x80x2-5xe2x80x2 order. For the first, second and third triplet in the target site, first, second and third sets of zinc finger protein(s) in the database are identified, the first set comprising zinc finger protein(s) comprising a finger specifically binding to the first triplet in the target site, the second set comprising zinc finger protein(s) comprising a finger specifically binding to the second triplet in the target site, the third set comprising zinc finger protein(s) comprising a finger specifically binding to the third triplet in the target site. Designations and subdesignations of the zinc finger proteins in the first, second, and third sets identified in step (c) are then output. Some method further comprise producing a zinc finger protein that binds to the target site comprising a first finger from a zinc finger protein from the first set, a second finger from a zinc finger protein from the second set, and a third finger from a zinc finger protein from the third set
Some methods further comprises identifying subsets of the first, second and third sets. The subset of the first set comprising zinc finger protein(s) comprising a finger that specifically binds to the first triplet in the target site from the first finger position of a zinc finger protein in the database. The subset of the second set comprising zinc finger protein(s) comprises a finger that specifically binds to the second triplet in the target site from the second finger position in a zinc finger protein in the database; the subset of the third set comprises a zinc finger protein(s) comprising a finger that specifically binds to the third triplet in the target site from a third finger position in a zinc finger protein in the database. Designations and subdesignations of the subset of the first, second and third sets are output. A zinc finger protein comprising a first finger from the first subset, a second finger from the second subset, and a third finger from the third subset is then produced. In some of the above methods of design, the target site is provided by user input. In some methods, the target site is provided by one of the target site selection methods described above.
The invention further provides computer program products for implementing any of the methods described above. One computer program product implements methods for selecting a target site within a polynucleotide for targeting by a zinc finger protein. Such a product comprises (a) code for providing a polynucleotide sequence; (b) code for selecting a potential target site within the polynucleotide sequence; the potential target site comprising first, second and third triplets of bases at first, second and third positions in the potential target site; (c) code for calculating a score for the potential target site from a combination of subscores for the first, second, and third triplets, the subscores being obtained from a correspondence regime between triplets and triplet position, wherein each triplet has first, second and third corresponding positions, and each corresponding triplet and position has a particular subscore; (d) code for repeating steps (b) and (c) at least once on a further potential target site comprising first, second and third triplets at first, second and third positions of the further potential target site to determine a further score; e) code for providing output of at least one of the potential target site with its score; and (f) a computer readable storage medium for holding the codes.
The invention further provides computer systems for implementing any of the methods described above. One such system for selecting a target site within a polynucleotide for targeting by a zinc finger protein, comprises (a) a memory; (b) a system bus; and (c) a processor. The processor is operatively disposed to: (1) provide or receive a polynucleotide sequence; (2) select a potential target site within the polynucleotide sequence; the potential target site comprising first, second and third triplets of bases at first, second and third positions in the potential target site; (3) calculate a score for the potential target site from a combination of subscores for the first, second, and third triplets, the subscores being obtained from a correspondence regime between triplets and triplet position, wherein each triplet has first, second and third corresponding positions, and each corresponding triplet and position has a particular subscore; (4) repeat steps (2) and (3) at least once on a further potential target site comprising first, second and third triplets at first, second and third positions of the further potential target site to determine a further score; (5) provide output of at least one of the potential target site with its score
A further computer program product for producing a zinc finger protein comprises: (a) code for providing a database comprising designations for a plurality of zinc finger proteins, each protein comprising at least first, second and third fingers; subdesignations for each of the three fingers of each of the zinc finger proteins; a corresponding nucleic acid sequence for each zinc finger protein, each sequence comprising at least first, second and third triplets specifically bound by the at least first, second and third fingers respectively in each zinc finger protein, the first, second and third triplets being arranged in the nucleic acid sequence (3xe2x80x2-5xe2x80x2) in the same respective order as the first, second and third fingers are arranged in the zinc finger protein (N-terminus to C-terminus); (b) code for providing a target site for design of a zinc finger protein, the target site comprising at least first, second and third triplets; (c) for the first, second and third triplet in the target site, code for identifying first, second and third sets of zinc finger protein(s) in the database, the first set comprising zinc finger protein(s) comprising a finger specifically binding to the first triplet in the target site, the second set comprising a finger specifically binding to the second triplet in the target site, the third set comprising a finger specifically binding to the third triplet in the target site; (d) code for outputting designations and subdesignations of the zinc finger proteins in the first, second, and third sets identified in step (c) and, (e) a compute readable storage medium for holding the codes.
The invention further provides a system for producing a zinc finger protein. The system comprises (a) a memory; (b) a system bus; and (c) a processor. The processor is operatively disposed to: (1) provide a database comprising designations for a plurality of zinc finger proteins, each protein comprising at least first, second and third fingers, subdesignations for each of the three fingers of each of the zinc finger proteins; a corresponding nucleic acid sequence for each zinc finger protein, each sequence comprising at least first, second and third triplets specifically bound by the at least first, second and third fingers respectively in each zinc finger protein, the first, second and third triplets being arranged in the nucleic acid sequence (3xe2x80x2-5xe2x80x2)in the same respective order as the first, second and third fingers are arranged in the zinc finger protein (N-terminus to C-terminus); (2) provide a target site for design of a zinc finger protein, the target site comprising at least first, second and third triplets, (3) for the first, second and third triplet in the target site, identify first, second and third sets of zinc finger protein(s) in the database, the first set comprising zinc finger protein(s) comprising a finger specifically binding to the first triplet in the target site, the second set comprising a finger specifically binding to the second triplet in the target site, the third set comprising a finger specifically binding to the third triplet in the target site; and (4) output designations and subdesignations of the zinc finger proteins in the first, second, and third sets identified in step (3).