Significant progress has been made over the last years in the way genomes can be investigated and modified in living cells. The main challenge in this matter is to transfect the living cells with enzyme molecules that are able to process targeted genetic sequences in a sequence specific manner, without inducing toxicity. This goal has been reached using enzymes derived from natural proteins, for instance by creating variants of homing endonucleases, also called meganucleases (Stoddard, Monnat et al. 2007; Arnould, Delenda et al. 2011), but also by creating fusion proteins, such as for instance the fusion of TALE DNA binding domain with a catalytic domain (Christian, Cermak et al. 2010; Li, Huang et al. 2011)
Transcription Activator Like Effectors (TALE) has been widely used for several applications in the field of genome engineering. The sequence specificity, of this family of proteins used in the infection process by plant pathogens of the Xanthomonas genus, is driven by an array of motifs of 33 to 35 amino acids repeats, differing essentially by the two positions 12 and 13 (Boch, Scholze et al. 2009; Moscou and Bogdanove 2009). The recent achievement of the high resolution structure of TAL effectors bound to DNA showed that each single base of the same strand in the DNA target is contacted by a single repeat (Deng, Yan et al. 2012; Mak, Bradley et al. 2012), with the specificity resulting from the two polymorphic amino acids of the repeat; the so-called RVDs (repeat variable dipeptides). The modularity of these DNA binding domains has been confirmed by assembly of repeats designing TALE-derived protein with new sequence specificities.
TALE proteins has so far been described as containing: (i) an N-terminal domain including a translocation signal, (ii) a central DNA-binding domain, and (iii) a C-terminal domain including a nuclear localization signal (NLS) and an acidic activation domain (AD). A representative member of this family is AvrBs3 from Xanthomonas vesicatoria (SWISSPROT P14727) that has a 1164 amino acid sequence comprising a N-terminal domain of 288 amino acids (position 1 to 288), a central domain of 593 amino acids (positions 289 to 881), and a C-terminal domain of 283 amino acids (positions 882 to 1164) comprising a NLS and AD (transcription activation domain). The DNA-binding domain which determines the target specificity of each TALE consists of a variable number (generally 12 to 27) of tandem, nearly identical, 33-35 amino acid repeats, followed by a single truncated repeat. For example, AvrBs3 DNA-binding domain (SEQ ID NO. 1) comprises 17 repeats of 34 amino acids and a truncated repeat of 15 amino acids. The “repeat-variable di-residue” (RVD), which represents the variable residues in the repeat determines the specificity of interaction with the nucleotide base of the DNA target, in a code-like fashion with some degeneracy. The four most common RVDs are HD with respect to c, NI with respect to a, NG with respect to t and NN with respect to g ((Boch, Scholze et al. 2009; Moscou and Bogdanove 2009; Bogdanove and Voytas 2011), WO 2011/072246).
This straightforward sequence relationship between RVDs and nucleotide bases allows the production of custom TAL effectors that bind DNA sequences of interest by assembling an array of repeats that corresponds to the intended target site. Such engineered TALE proteins have improved gene-editing technology (Baker 2012). A variety of rapid construction methods for custom TALE fusion proteins have recently been developed based on the protein scaffold of AvrBs3-like proteins by adding catalytic protein domains to the C-terminal. (US 2011/0145940; Cermak, Doyle et al. 2010; Weber, Gruetzner et al. 2011; Zhang, Cong et al. 2011; Doyle, Booher et al. 2012). TAL effectors have been, for instance fused to a nuclease catalytic head to form specific nucleases (TALE-Nuclease) creating thereby new tools, especially for genome engineering applications, that have proven efficiency in cell-based assays in yeast, mammalian cells and plants (Cermak, Doyle et al. 2010; Christian, Cermak et al. 2010; Geissler, Scholze et al. 2011; Huang, Xiao et al. 2011; Li, Huang et al. 2011; Mahfouz, Li et al. 2011; Miller, Tan et al. 2011; Morbitzer, Elsaesser et al. 2011; Mussolino, Morbitzer et al. 2011; Sander, Cade et al. 2011; Tesson, Usal et al. 2011; Weber, Gruetzner et al. 2011; Zhang, Cong et al. 2011; Li, Piatek et al. 2012; Mahfouz, Li et al. 2012).
Meanwhile, the Transcription Activator Like Effectors so far described in the literature (AvrXa7, Hax, PthXo1, . . . ) are highly similar to the protein AvrBs3 and all originate from Xanthomonas or its closely related Ralstonia bacterial genus.
One of the drawbacks of the Transcription Activator Like Effectors from Xanthomonas lies in the fact that they mostly consists of highly repetitive motifs, nearly identical to each other. The high identity of these repeats is prompted to create genetic recombination or instability when the repeats are assembled to form engineered nucleic acid binding domains.
A first level of difficulty occurs at the polynucleotide level to clone the repeat sequences due to the fact that restriction sites and PCR primers are basically the same for each repeat. Under these conditions, it gets difficult to perform routine lab procedures to check that the repeats have been cloned properly, in the good number and in the right order. This is although essential to achieve proper expression of a DNA binding protein that is expected to show specificity with a desired nucleic acid sequence.
A second level of difficulty occurs when the polynucleotide sequences are included in vectors for heterologous expression, in particular when using viral vectors. As recently reported by Holkers et al. (2012), it appears that DNA tandem repeat motifs from TALE scaffold are generally incompatible with lentiviral vector system due to some internal sequence recombinations. This particularly limits the current use of TALE proteins into primary cells, which are generally not permissive towards classical gene transfer technologies.
Lower efficiencies of TALE derived proteins have also been reported in certain cell types, like for instance in mice, or in relation with epigenetic modifications, so that alternative or complementary solutions to improve TALE derived protein are still actively sought.
Unexpectedly, the present inventors have identified putative proteins from the bacterial endosymbiont Burkholderia rhizoxinica and others from a marine organism, displaying highly polymorphic modules having specific DNA binding activity, while having very different sequence (less than 40% identity) in comparison with TALE repeats. These proteins have also completely different N and C terminal domains. The modules found in these proteins have higher sequence variability than TALE repeats and can although be assembled to engineer new base per base specific binding domains (MBBBD) to target nucleic acid sequences in genomes. These modules confer better sequence stability when they are assembled and expressed in living cells as nucleic acid binding domains.