Transcription activator-like effectors (TALEs), a group of bacterial plant pathogen proteins have recently emerged as new engineerable scaffolds for production of tailored DNA binding domains with chosen specificities (1, 2). TALE DNA binding domain is composed by a variable number of 33-35 amino acid repeat modules. These repeat modules are nearly identical to each other except for two variable amino acids located at positions 12 and 13 (i.e. Repeat Variable Di residues, RVD). The nature of residues 12 and 13 determines base preferences of individual repeat module. Moscou M. J and Bogdanove A. J and Boch et al. described the following code: HD for recognizing C; NG for recognizing T; NI for recognizing A; NN for recognizing G or A; NS for recognizing A or C or G or T; HG for recognizing T; IG for recognizing T; NK for recognizing G; HA for recognizing C; ND for recognizing C; HI for recognizing C; HN for recognizing G; NA for recognizing G; SN for recognizing G or A; and YG for recognizing T (International PCT Applications WO 2011/072246 and 3, 4). This remarkably simple cipher, consisting in a one-repeat-to-one-base pair code, allowed for prediction of TAL effector binding site and more importantly for construction of custom TAL effector repeat domains that could be tailored to bind DNA sequence of interest. This unprecedented feature unmasked exciting perspectives to develop new molecular tools for targeted genome applications and within the past two years, TALE-derived proteins have been fused to transcription activator/repressor or nuclease domains and successfully used to specifically regulate transcription of chosen genes (5) or to perform targeted gene modifications and insertions (6-9).
Critical to the efficiency of engineered TALE-derived proteins is their ability to access and efficiently bind their chromosomal target sites. Numerous factors may hinder binding, including DNA packaging into chromatin, position of nucleosomal proteins with respect to the target site and chemical DNA modifications such as methylation. In higher eukaryotes, DNA methylation is involved in the regulation of genes expression and predominantly occurs at the C5 position of cytosine found in the dinucleotide sequence CpG (10) and also CpA, CpT and CpC (11). The presence of such additional methyl moiety may hinder recognition of modified cytosine by RVD HD that is commonly used to target cytosine. This feature may represent an important epigenetic drawback for genome engineering applications using TALE-derived proteins.
There remains a need for designing new RVDs, repeat sequences and TALE derived proteins comprising RVDs to overcome chemical DNA modifications and to efficiently detect, target and process nucleic acids comprising these chemical modifications.
Unexpectedly, the inventors have found as part of their laboratory intensive research that shorter TAL repeats including a gap at the level of amino acid positions 12 and/or 13 (which could be regarded as forming “incomplete RVDs”) can better accommodate chemically modified nucleic acid bases in particular methylated bases. Based on this finding, they have synthetized TALEs that can efficiently target methylated target nucleic acid sequences, and more generally chemically modified bases, as a way to overcome the above limitations of current TALE-derived proteins.