The present invention relates to nucleic acid binding proteins. In particular, the invention relates to a method for designing a protein which is capable of binding to any predefined nucleic acid sequence.
Protein-nucleic acid recognition is a commonplace phenomenon which is central to a large number of biomolecular control mechanisms which regulate the functioning of eukaryotic and prokaryotic cells. For instance, protein-DNA interactions form the basis of the regulation of gene expression and are thus one of the subjects most widely studied by molecular biologists.
A wealth of biochemical and structural information explains the details of protein-DNA recognition in numerous instances, to the extent that general principles of recognition have emerged. Many DNA-binding proteins contain independently folded domains for the recognition of DNA, and these domains in turn belong to a large number of structural families. such as the leucine zipper, the xe2x80x9chelix-turn-helixxe2x80x9d and zinc finger families.
Despite the great variety of structural domains, the specificity of the interactions observed to date between protein and DNA most often derives from the complementarity of the surfaces of a protein xcex1-helix and the major groove of DNA [Klug, (1993) Gene 135:83-92]. In light of the recurring physical interaction of xcex1-helix and major groove, the tantalising possibility arises that the contacts between particular amino acids and DNA bases could be described by a simple set of rules; in effect a stereochemical recognition code which relates protein primary structure to binding-site sequence preference.
It is clear, however, that no code will be found which can describe DNA recognition by all DNA-binding proteins. The structures of numerous complexes show significant differences in the way that the recognition xcex1-helices of DNA-binding proteins from different structural families interact with the major groove of DNA, thus precluding similarities in patterns of recognition. The majority of known DNA-binding motifs are not particularly versatile. and any codes which might emerge would likely describe binding to a very few related DNA sequences.
Even within each family of DNA-binding proteins. moreover, it has hitherto appeared that the deciphering of a code would be elusive. Due to the complexity of the protein-DNA interaction. there does not appear to be a simple xe2x80x9calphabeticxe2x80x9d equivalence between the primary structures of protein and nucleic acid which specifies a direct amino acid to base relationship.
International patent application WO 96/06166 addresses this issue and provides a xe2x80x9csyllabicxe2x80x9d code which explains protein-DNA interactions for zinc finger nucleic acid binding proteins. A syllabic code is a code which relies on more than one feature of the binding protein to specify binding to a particular base, the features being combinable in the forms of xe2x80x9csyllablesxe2x80x9d, or complex instructions, to define each specific contact.
However, this code is incomplete, providing no specific instructions permitting the specific selection of nucleotides other than G in the 5xe2x80x2 position of each triplet. The method relies on randomisation and subsequent selection in order to generate nucleic acid binding proteins for other specificities. Even with the aid of partial randomisation and selection, however, neither the method reported in WO 96/06166 nor any other methods of the prior art have succeeded in isolating a zinc finger polypeptide based on the first finger of Zif268 capable of binding triplets wherein the 5xe2x80x2 base is other than G or T. This is a serious shortfall in any ability to design zinc finger proteins.
Moreover, this document relies upon the notion that zinc fingers bind to a nucleic acid triplet or multiples thereof, as does all of the prior art. We have now determined that zinc finger binding sites are determined by overlapping 4 bp subsites, and that sequence-specificity at the boundary between subsites arises from synergy between adjacent fingers. This has important implications for the design and selection of zinc fingers with novel DNA binding specificities.
The present invention provides a more complete code which permits the selection of any nucleic acid sequence as the target sequence. and the design of a specific nucleic acid-binding protein which will bind thereto. Moreover, the invention provides a method by which a zinc finger protein specific for any given nucleic acid sequence may be designed and optimised. The present invention therefore concerns a recognition code which has been elucidated for the interactions of classical zinc fingers with nucleic acid. in this case a pattern of rules is provided which covers binding to all nucleic acid sequences.
The code set forth in the present invention takes account of synergistic interactions between adjacent zinc fingers. thereby allowing the selection of any desired binding site.
According to a first aspect of the present invention. therefore, we provide a method for preparing a nucleic acid binding protein of the Cys2-His2 zinc finger class capable of binding to a nucleic acid quadruplet in a target nucleic acid sequence, wherein binding to base 4 of the quadruplet by an xcex1-helical zinc finger nucleic acid binding motif in the protein is determined as follows:
a) if base 4 in the quadruplet is A. then position +6 in the xcex1-helix is Glu. Asn or Val;
b) if base 4 in the quadruplet is C, then position +6 in the xcex1-helix is Ser, Thr, Val, Ala, Glu or Asn.
Preferably, binding to base 4 of the quadruplet by an xcex1-helical zinc finger nucleic acid binding motif in the protein is additionally determined as follows:
c) if base 4 in the quadruplet is G, then position +6 in the xcex1-helix is Arg or Lys;
d) if base 4 in the quadruplet is T. then position +6 in the xcex1-helix is Set, Thr, Val or Lys.
The quadruplets specified in the present invention are overlapping, such that, when read 3xe2x80x2 to 5xe2x80x2 on the -strand of the nucleic acid, base 4 of the first quadruplet is base 1 of the second, and so on. Accordingly, in the present application, the bases of each quadruplet are referred by number, from 1 to 4, 1 being the 3xe2x80x2 base and 4 being the 5xe2x80x2 base. Base 4 is equivalent to the 5xe2x80x2 base of a classical zinc finger binding triplet.
All of the nucleic acid-binding residue positions of zinc fingers, as referred to herein, are numbered from the first residue in the xcex1-helix of the finger, ranging from +1 to +9. xe2x80x9cxe2x88x921xe2x80x9d refers to the residue in the framework structure immediately preceding the xcex1-helix in a Cys2-His2 zinc finger polypeptide.
Residues referred to as xe2x80x9c++2xe2x80x9d are residues present in an adjacent (C-terminal) finger. They reflect the synergistic cooperation between position +2 on base 1 (on the + strand) and position +6 of the preceding (N-terminal) finger on base 4 of the preceding (3xe2x80x2) quadruplet, which is the same base due to the overlap. Where there is no C-terminal adjacent finger, xe2x80x9c++xe2x80x9d interactions do not operate.
Cys2-His2 zinc finger binding proteins, as is well known in the art, bind to target nucleic acid sequences via xcex1-helical zinc metal atom coordinated binding motifs known as zinc fingers. Each zinc finger in a zinc finger nucleic acid binding protein is responsible for determining binding to a nucleic acid quadruplet in a nucleic acid binding sequence. Preferably, there are 2 or more zinc fingers, for example 2, 3, 4, 5 or 6 zinc fingers, in each binding protein. Advantageously, there are 3 zinc fingers in each zinc finger binding protein.
The method of the present invention allows the production of what are essentially artificial nucleic acid binding proteins. In these proteins, artificial analogues of amino acids may be used, to impart the proteins with desired properties or for other reasons. Thus, the term xe2x80x9camino acidxe2x80x9d, particularly in the context where xe2x80x9cany amino acidxe2x80x9d is referred to, means any sort of natural or artificial amino acid or amino acid analogue that may be employed in protein construction according to methods known in the an. Moreover, any specific amino acid referred to herein may be replaced by a functional analogue thereof, particularly an artificial functional analogue. The nomenclature used herein therefore specifically comprises within its scope functional analogues of the defined amino acids.
The xcex1-helix of a zinc finger binding protein aligns antiparallel to the nucleic acid strand, such that the primary nucleic acid sequence is arranged 3xe2x80x2 to 5xe2x80x2 in order to correspond with the N terminal to C-terminal sequence of the zinc finger. Since nucleic acid sequences are conventionally written 5xe2x80x2 to 3xe2x80x2, and amino acid sequences N-terminus to C-terminus, the result is that when a nucleic acid sequence and a zinc finger protein are aligned according to convention, the primary interaction of the zinc finger is with thexe2x80x94strand of the nucleic acid. since it is this strand which is aligned 3xe2x80x2 to 5xe2x80x2. These conventions are followed in the nomenclature used herein. It should be noted, however, that in nature certain fingers, such as finger 4 of the protein GLI, bind to the + strand of nucleic acid: see Suzuki et al., (1994) NAR 22:3397-3405 and Pavletich and Pabo, (1993) Science 261:1701-1707. The incorporation of such fingers into nucleic acid binding molecules according to the invention is envisaged.
The invention provides a solution to a problem hitherto unaddressed in the art, by permitting the rational design of polypeptides which will bind nucleic acid quadruplets whose 5xe2x80x2 residue is other than G. In particular, the invention provides for the first time a solution for the design of polypeptides for binding quadruplets containing 5xe2x80x2 A or C.