Transcription activator-like effectors (TALEs) are proteins originating from bacterial plant pathogens of the Xanthomonas genus with the ability to target specifically DNA sequences in a base-dependent manner. The binding domains of these proteins are composed by an array of highly similar 33 to 35 amino acids tandem repeats, which differ essentially by their residues 12 and 13 (variable di-residues or RVDs). In the wild, TALE proteins are able to selectively target DNA promoter sequences during the process of infection of the plants by Xanthomonas. The study of these RVDs in relation with the natural promoter DNA sequences recognized by the protein AvrBs3, a representative protein of this family, has revealed a specific correlation between the RVDs found within the TAL effector DNA binding domain and the nucleic acid bases present in the nucleic acid sequences. As a result, a code has been established between amino acids and nucleic acid bases, so that it is now possible, by following said code, to engineer TAL effector DNA binding domains by assembly of selected RVDs to target specific DNA sequences (Moscou and Bogdanove, 2009, Boch et al., 2009, Scholze et al. 2009). The remarkably high specificity of TALE repeats and the apparent absence of context dependence effects among repeat in an array, allow modular assembly of TALE DNA binding domains able to recognize almost any nucleic acid sequence of interest.
The recent achievement of the high resolution structure of TAL effectors bound to DNA confirmed that each single base of the same strand in the DNA target is contacted by a single repeat motif and that the specificity results from the two polymorphic amino acids in positions 12 and 13. In addition to the central core mediating sequence-specific DNA interaction, TALE proteins are composed of a N-terminal translocation domain responsible for the requirement of a first thymine base (T0) of the targeted sequence and a C-terminal domain that containing a nuclear localization signals (NLS) and a transcriptional activation domain (AD). It has to be noted that the last repeated motif is only composed of the first conserved 20 amino acids (terminal half repeat).
Despite the fact that natural TAL effector are composed of repeated motifs arrays greatly varying in term of number (ranging from 5.5 to 33.5), it has been shown that at least 10.5 repeats are required for maximal transcriptional activation activity and that the number of repeated motif do not directly correlate with a stronger activity.
The remarkably simple one-to-one repeat/base association found in TALE proteins has been used to create and engineer arrays that were subsequently fused to various catalytic heads, such as transcription factors and non-specific nuclease domains (TALE-Nuclease). TALE-Nucleases are using the non-specific nuclease domain of the restriction enzyme FokI. Since FokI is activated upon dimerization, TALE-nucleases have to function by pairs, the double strand break (DSB) occurring within the spacer sequence separating the two opposing targets. Taking advantage of the two conserved pathways, non-homologous end joining (NHEJ) and homologous recombination (HR), that are used by nearly all organism to repair DSBs, one can introduce, at a desired specific location in the genome, small insertion or deletion (indels) within a gene leading to gene disruption (through NHEJ) or completely introduce/replace a gene of interest (through HR). The modularity of use of TALE-nucleases has been confirmed to a certain extent by the assembly of designed molecules and the resulting detection of alteration at endogenous genes in various organisms such as yeast, plants, nematodes and mammalian cells.
Nevertheless, up to now, researchers have mainly published successful use of TALE-nucleases without reporting how frequently a TALE-nuclease fails to work. The designs of these arrays still relies on the published code (Moscou and Bogdanove 2009, Boch et al. 2009) represented in FIG. 1, which in fact provides different RVDs for different nucleic acid bases and vice-versa. In practice, it is observed that in a number of cases, engineered TALE proteins do not work or don't have the expected level of specificity or activity towards their nucleic acid target sequence. Under these conditions, it remains difficult to predict the level of specificity of an engineered TALE binding domain until it is assayed. For TALE-nucleases, cleavage assays are generally performed according to the so-called SSA protocol in yeast cells as described in WO 2004/067736, which requires transformation of yeast with both plasmid encoding the engineered TALE-nuclease and the nucleic acid target sequence to be cleaved in order to measure cleavage activity. It is notably time and money consuming to perform such assays.
On another hand, due to their sequence similarity, it is also time consuming and expensive to assemble tandem repeats when constructing expression plasmids encoding TALE binding domains.
Thus, there remains a need for methods improving the design of TALE domains that would ideally involve smaller set of repeat domains and have a predicted specificity.
This is particularly important to predict the targeting specificity of TALE proteins when creating TALE-nucleases or transcription activators, because these later are used to modify cell lines, which may be used in cell therapy, bioproduction, plant or animal transgenesis. In such applications, it is crucial to control or model off-targeting to reduce potential cell cytotoxicity or side effects.
In order to define rules allowing optimizing activity and/or specificity and/or flexibility of target/TALE-nuclease pairs, the inventors have performed an extensive study of activity, specificity and context dependence of four different RVDs (NN, NG, NI and HD) on the first 7 RVDs/base positions in a context of a TALE-nuclease of various length. This study, which is detailed in the experimental part of the present disclosure, systematically tested all possible combinations of the four RVDs with respect to the four bases A, T, C and G within triplets.
Accordingly a collection containing the 64 possible combination of three RVDs, either on position 1/2/3 or 3/4/5 or 5/6/7, was screen for activity on a collection of 64 targets containing all combination of A, T, C and G bases, either on position 1/2/3 or 3/4/5 or 5/6/7. Hitmaps of the mutant versus the targets allows visualization of the frequency and intensity of cutting of mutants on their respective targets (diagonal) but also of the frequency and intensity of off-site activity of the mutant. In addition the same study was performed on the first 3 RVDs/base positions (1/2/3) in a context of a TALE-nuclease of 18.5 repeats in total.
To the knowledge of the inventors, this study is the first involving a systematic approach involving synthetic DNA targets containing all combination of A, T, C and G bases. The previous studies that led to establish the basic RVD code were based on statistical analysis of RVDs naturally occurring in the wild with respect to natural DNA targets (i.e. based on natural diversity).
The data from these experiments are bringing information on context dependency for a RVD/base pair, relative to their position in an array but also on the context dependency for off-target of a particular RVD for all targets, relative to its position in the array.
The collected data has permitted to establish methods and procedures to design repeat sequences with improved or modulated specificity (e.g. avoid RVDs which recognize AA and CA in position 1 and 2 to optimize activity), which are the subject-matter of the present invention.
Contrarily to the teaching of the prior art, it results from the data obtained by the inventors that the RVD NI can be used to target T or G and HD or NG to target G. Accordingly many “non-standard” RVD triplets (by reference to the standard code established by Bogdanove) could be introduced in the repeat sequences resulting in an equivalent or improved specificity with respect to a given target sequence. These alternative RVD sequences allow the design of active TALE-nucleases starting with only subsets of the 64 tri-RVDs or 16-diRVDs (NN, NG, NI, and HD combinations). They are also interesting, for instance, to design strategies to target DNA sequences homologous to a given target DNA sequence, without this later being itself targetable.
Data from these experiments also bring information on context dependency for a RVD/base pair, relative to their position in an array but also on the context dependency for off-target of a particular RVD for all targets, relative to its position in the array.
In a general aspect, the present invention relates to method allowing the design of repeat arrays with modular activities. The invention allows increasing or reducing the activity on certain targets (activity), increasing the specificity of a repeat array to one target compared to all other possible targets, reducing off-target events (specificity) and decreasing the specificity to have one array of repeat targeting more than one targets or only a certain set of desired targets (flexibility).