The subject of the present invention is a process for the synthesis of polynucleotides enabling to introduce random sequences along more or less extended tracts of the molecule, in such a way that randomness refers to units of three adjacent nucleotides, and that each one of the said units is fit in so as to match a limited number of codons, predefined in number and sequence, and in order to eliminate the effects of the genetic code degeneracy.
The applicative potentialities of a polynucleotide synthesis process with the above features are undoubtedly remarkable. Indeed, in recent years applications requiring its use have taken on an ever-increasing importance in many fields of the scientific research. It is the case, for instance, of site-specific mutagenesis, operated on a gene coding for a known protein in presumedly key positions in order to verify their actual role in the molecule structure or function. Another example is provided by libraries, containing xe2x80x9cboxesxe2x80x9d of random sequence synthetic oligonucleotides, that are realized in order to select molecules capable to carry out new biological functions.
In all these cases it is of the utmost importance that the randomness of the sequence is somehow controlled, so that only the desires codons shall be inserted, besides eliminating the effects of the genetic code degeneracy. Of equal importance is, obviously, the fact that said polynucleotide synthesis is carried out with a simple, cost-effective and efficient process.
It is useful to specify the terms hereinafter:
Support=the term support refers to a solid phase material to which monomers are bound in order to realize a chemical synthesis; said support is usually composed of resin or porous glass grains, but can also be made of any other material known to the man skilled in the art. The term is meant to comprise one or more monomers coupled to the support for the additional reactions of polynucleotides synthesis.
Coniugate or condense: these terms refer to the chemical reactions carried out in order to bind a monomer to a second monomer or to a solid support. These reactions are known to the man skilled in the art and are usually realized in an automated DNA synthetizer, following the instructions provided by the maker.
Monomers or mononucleotides: the terms monomer or mononucleotide refer to individual nucleotides utilized in the chemical synthesis of oligonucleotides. Monomers that can be utilized comprise both ribo- and deoxyribo-forms of each of the five standard nucleotides (derived from the bases adenine (respectively A or dA), guanine (G or dG), cytosine (C or dC), thymine (T) and uranic (U)). Base derivatives or precursors like inosine are also comprised in monomers, as well as chemically modified nucleotides, such as those for instance with a reversible blocking group in any position on the purinic or pyrimidinic bases, on ribose or deoxyribose or on hydroxylic or phosphate groups of the monomer. Those blocking groups comprise e.g. dimethoxytrityl, benzoyl, isobutyryl, beta-cyanoethyl and diisopropylamin groups, and are used to protect hydroxylic groups, phosphates and hexocyclic amines. However, other blocking agents known to the man skilled in the art may be adopted.
Dimers or dinucleotides: the terms dimers or dinucleotides refer to molecular units derived from the condensation of two monomers or mononucleotides as aforespecified.
Synthesis monomeric units: this term indicates unite utilized as essential elements in the synthesis process. In the process subject of the present invention they can consist of monomers or dimers; they can also be constituted of trinucleotide units in other processes known in art.
Codon or triplet: the term codon or triplet refers to a sequence of three adjacent desoxyribonucleotide monomers that specify one of the 20 natural amino acids utilized in a polypeptide biosynthesis. The term comprises also nonsense codons, codons that do not encode any amino acid.
Codon or randomized triplet: these terms refer to the case where the same sequence position corresponds to more than one codon in a polynucleotides set. The number of different codons can vary from 2 to 64 for each specific position.
Anticodon: the term anticodon refers to a sequence of three adjacent ribonucleotidic monomers that specify for a corresponding codon according to the known rule of purinic and pyrimidinic bases coupling.
Polynucleotides or Randomized oligonucleotides: this term refers to a set of oligonucleotides having randomized codons at one or more positions. For example, if the randomized oligonucleotides consist of six nucleotides in length (i.e. two codons), and both the first and the second position of the sequence are randomized so as to code for all of the twenty amino acids, then the population of randomized oligonucleotides shall comprise an oligonucleotide set with every possible combination of the twenty triplets in the first and second position. In this case, therefore, the number of possible codon combinations is 400. Analogously, if 15 nucleotide-long randomized oligonucleotides are synthetized in such a way as to be randomized in every position, then all the triplets coding for each of the twenty amino acids will be found in every position. On this case, the randomized oligonucleotides population shall contain 205 different possible oligonucleotide species.
When not clearly defined, other terms in use in, the present description are meant to be known to men skilled in the field, to whom the invention is aimed at.
For some terms pertaining molecular biology techniques, cfr. the Sambrook et al. manual (Sambrook et al, 1989). Other terms referred to substances of chemical nature not clearly defined are meant to be known to men skilled in field of the invention, and anyhow their definitions can be found in manuals like Gait, M. J. et al, 1984.
In general, applications that utilize synthetic oligonucleotides are of two kinds: those requiring the use of known sequence oligonucleotides, and those requiring the use of oligonucleotides with an at least partly degenerated or random sequence.
As for the first group of applications, the usual synthesis methods are based on the principle of building the polynucleotide condensing mononucleotides one at the time, starting from the first at the 3xe2x80x2-terminus, and choosing each mononucleotide for every reaction cycle so as to synthetize a polynucleotide with a desired and unambiguous sequence.
As for the second group of applications, the synthesis follows the same modalities, but in the positions along the sequence where one needs to insert variability the synthetic cycle goes on using mixtures of two or more different monomers. In every cycle oligonucleotide mixtures differing in the monomer added to the 5xe2x80x2-terminus are thus created. For instance, if in a cycle 4 different mononucleotides are employed as monomers, a mixture containing 4 different polynucleotides differing among themselves only for the last nucleotide inserted is obtained. If a synthetic cycle of the same kind is repeated, a mixture of 16 polynucleotides that differ in the last two inserted nucleotides is obtained, and so on.
In general, applications utilizing synthetic polynucleotides provide for a direct or indirect insertion of said polynucleotides in genetic material that will be translated into polypeptides in a certain living organism (in vitro translation seldom occurs). As it is known, DNA-translating genetic code is partly degenerated, i.e. as the 64 possible codons formed by groups of three nucleotides code for 20 amino acids only (plus three terminating or stop signals), more than one codon code for a single amino acid.
Oligonucleotides having an at least partly random sequence as aforedescribed (where for random sequence polynucleotide is meant a more or less complex mixture of polynucleotides having different sequences), code for random sequence peptides (i.e. for a mixture of peptides, each peptide being coded by one or more polynucleotides).
In fact, genetic code degeneracy entails three important consequences on the random sequence oligonucleotides that are to be used for the random polypeptides derivation:
a) any mixture of oligonlicleotides having an at least partly random sequence, codes for a much simpler polypeptide mixture. For instance, a mixture of oligonucleotides wherein 6 positions are randomly filled by one of the four natural nucleotides is made of 4096 different molecules (46 if single nucleotides are considered, or 642 if codons are considered), but exactly by virtue of the code degeneracy, these code only for 400 different polypeptides (i.e. 202).
This phenomenon would be irrelevant by itself, provided that the different polynucleotides that code for the polypeptide had the same physical and chemical features, but different sequences can confer different properties concerning for example solubility, stability and static charge in different conditions, adsorption with filtering means and so on.
b) in the mixture of polypeptides originated by random sequence polynucleotides translation, there will be a percentage of truncated sequence peptides. As a matter of fact, during the random incorporation of the codons also those indicating a stop signal are necessarily inserted, and truncated sequence polypeptides formation is therefore unavoidable.
In the preceeding example, out of 4096 oligonucleotides, 375 (9%) will code for polypeptides truncated at the first or second position (i.e. 3 possible termination codons at the first position for each of the 64 possible codons at the second position, and 3 possible stop codons at the second for each of the 61 possible codons at the first). Therefore, together with 400 possible polypeptides, 21 truncated polypeptides will be found (one at the first position and 20 at the second position). This phenomenon acquires a particular significance when libraries of polynucleotides possessing a longish random sequence are created. For instance, in a 27-nucleotides library (coding for nonapeptides libraries, as described in many applications) as much as 35% of the polynucleotides contain a stop codon (or [649-619]/649). Longer sequences will contain a higher percentage of molecules coding for a premature termination of the polypeptide chain.
c) The existence of a dissimilar translating efficiency of the different codons coding for the same amino acid in different organisms, becomes evident in the derivation of polypeptide mixtures with complexities different from that of the starting polynucleotide mixtures. Although the genetic code is unique in nature, as a matter of fact, there is a difference in the various living organisms in the efficiency with which different codons coding for the same amino acid are translated. For instance, in E. Coli serin is coded 18 times more by codon UCU than by codon UCA. It follows that two different polynucleotides at equimolar concentration into the initial mixture will be translated with different efficiency, and the resulting polypeptide mixture will contain a different molar ratio of the two molecular species. It is of the utmost relevance therefore, in order to maximize the efficiency of the selected cellular system, that the coding sequences contain the very codons that are primarily utilized by the cellular system itself.
All three of these factors exert a strong influence on the efficiency of systems that utilize random sequence polynucleotides, both in applications that provide for the randomization in just one position, and in applications whose randomization refers to longer sequences. This influence however, is directly proportional to the length and complexity of the random sequence adopted.
This fact interferes especially with the preparation of completely homogeneous mixtures (i.e. those containing the same concentration of every possible molecular species) of random sequence polynucleotides, finalized to the preparation of equally complex and homogeneous polypeptide mixtures. Actually, every effort in this direction is partly thwarted when translating polynucleotides in polypeptide molecules, exactly because of the combination of those three factors, and that cannot possibly leave unaffected a considerable series of applications.
Such is the case for example of the efficiency of expression libraries created with such a homogeneous mixture of polynucleotides.
In connection with all these problems, synthesis processes were developed over time aimed at their overcoming and at the efficiency improvement of the various systems that utilize random sequence polypeptides.
A first solution (perhaps the most obvious from a theoretical point of view) is a polynucleotide synthesis that provides for the utilization as monomeric units of preformed trinucleotides (corresponding to codons), instead of the individual mononucleotides (Virkenas, B. et al, 1994; Lyttle, M. U. et al, 1995; Ono, A. et al, 1994). Thus the 20 trimers corresponding to the desired codons can first be synthetized, and polynucleotide synthesis is carried out only later by condensing at each synthesis cycle the monomeric units made of trimers instead of monomers. This solution is apparently simple and effective, but actually requires a complex, expensive and inefficient process, for the reasons hereinafter:
1. Although the initial trinucleotides synthesis is easily achievable by the condensing of three blocked nucleotides carried out in accordance with the regular polynucleotide synthesis process (therefore by a relatively simple and efficient process), there is a number of problems strictly inherent to the detaching phase of the newly formed trinucleotide from the synthesis matrix.
Actually, in the normal processes this operation is concurrent to the lysis of all the groups protecting the various bases, but in this case, as in view of the subsequent use in polynucleotide synthesis the bond between nucleotides and lateral protective groups must remain intact, attempts were made to allow the lysis of the 3xe2x80x2-5xe2x80x2 bond with the support matrix without involving bonds with lateral protective groups.
From this the necessity of using unusual protective groups and of having to reckon with production yields varying from one codon to another, hardly reproducible and in any case low.
Completely analogous difficulties arise when the synthesis is carried out in solution, rather than on resin. In this case as well, the individual trinucleotides need to be selectively unblocked before use, exclusively at 3xe2x80x2 position (in order to make them reactive), while all other functions muse remain blocked.
2. In the normal synthesis of random sequence polynucleotides, based on mononucleotides use, in each synthesis cycle a mixture composed of at least two nucleotides is used. In the knottiest chemical condition, all 4 possible nucleotides are used, but even if each of them possesses a reactivity slightly different from the others, there being only 4 components, the optimal molar ratio conditions that will foster the equimolar incorporation of each nucleotide in the forming polynucleotide chain are not difficult to find.
Of much greater importance are the difficulties one finds when as many as 20 different trinucleotides have to be incorporated in an equimolar quantity. Firstly, the fact that among all possible trinucleotides there exists a difference in the relative chemical reactivity, markedly greater than that among the four simple mononucleotides, has to be reckoned with.
Moreover, while nucleotides are easily available in pure form and with a controlled and reproducible reactivity, trinucleotides, for the aforementioned difficulties, will be available in solutions whose qualitative and quantitative content is not easily verifiable. Lastly, it will be obviously difficult to find the right molar ratios of the 20 components forming the synthesis mixture, sufficient to grant an equimolar incorporation. Of course all these difficulties are minimized by the adoption of less complex mixtures.
A second approach, much simpler from the point of view of chemical synthesis, is based on the fact that when more codons code just for one amino acid, the first two codon bases are often constant, differing only in the third codon base.
The difference among codons represented in the polynucleotide can therefore be reduced if, during the synthesis of each trinucleotide unit, In the first cycle (that will give the 3xe2x80x2-terminus nucleotide, i.e. the third in codon) a mixture of guanine and thymine (or uracil)-derived nucleotides is used, while in the two condensing cycles hereinafter mixtures of the four mononucleotides are used. Thus, polynucleotides are synthetized that may not contain 64 possible codons but only the degenerated 32 of the kind NNK, where N is any one of the four nucleosides, and K is guanosine or thymidine. It follows that of the 20 coded amino acids, 12 are coded by only one codon, 5 are coded by two possible codons and 3 are coded by three possible codons. Finally, only one codon out of 32 codes for a stop signal.
This method, if compared to the usual synthesis methods, holds the remarkable advantage of requiring no change whatsoever, but does not solve if not partially the aforediscussed problems. Specifically, although in comparison to the usual methods it gives a partial solution, does not solve the problem of the stop codons introduction, and of the resulting formation of truncated polypeptides. (Huang, W. e Santi, D. V., 1994).
Another method described in the art is based on the principle of subdividing the synthesis support in as many synthesis containers (usually columns), as the different codons that will be inserted in a predetermined position in the oligonucleotide are. Single codons are then synthetized on every support, and the various supports are then mixtured in order to obtain a randomized polynucleotide mixture (U.S. Pat. No. 5,523,388). For instance, if four codons coding for four amino acids have to be inserted in a predetermined position, synthesis resin is subdivided in four portions, the first codon is synthetized on the first one, the second codon on the second one, and so on. Once the synthesis has ended the four supports are mixtured, thus obtaining a support resin that bears a conjugated polynucleotide whose 5xe2x80x2 terminal codon is randomized for the four codons.
This method has the advantage of allowing an exact selection of the codons that have to be inserted in a predetermined position. Its main limitation derives from the necessity of having to redivide the synthesis resin in as many portions as the desired codons are. Synthesis becomes then relatively simple if the number of codons is small but extremely complex if it is high, when up to twenty different synthesis supports must be prepared for every position intended for randomization. As it is necessary to work with relatively small amount of resin in order to contain production costs, therefore it becomes extremely cumbersome to subdivide the resin in 10 or more different amounts difficult to handle in the complicated operations of chemical reactions and washings needed in every synthesis cycle. Moreover, it must be noted that the synthesis scale cannot be increased by more than a few micromoles (about 10-15 micromoles) without running into considerable efficiency losses of the coupling reactions.
The present invention aims at overcoming the aforementioned difficulties by a process ensuring at the same time a-remarkable simplicity and cost-effectiveness in the synthesis. The invention is based on the observation that every trinucleotide composing a codon can be considered as constituted of a monoucleotide and of a dinucleotide that follows it or comes first in the sequence.
The distinctive features of this approach can be evidenced by a simple comparison of the codons shown in the usual way (table I), with the same shown to point out the mononucleotide-dinucleotide (table II) and dinucleotide-mononucleotide (table III) combinations.
In Table II specifically, each codon is shown as resulting from the combination of the first nucleotide plus one dinucleotide (hereinafter referred to also as B+D, where for B the single nucleotide is meant, and for D the dinucleotide), while Table III (also derived from I) represents codons as derived from dinucleotides this time corresponding to the first and second codon base, plus a single nucleotide corresponding to the third base (hereinafter referred to also as D+B, in accordance with the terminology adopted earlier).
A thorough examination of both these alternative representations of the genetic code, enabled the inventor to observe that in comparison to other approaches, the minimum number of monomeric units (constituted by dinucleotides) needed to code for all of the amino acids can be consistently reduced. As a matter of fact, according to the representation D+B, it is equivalent to the 13 dinucleotides (highlighted by hatching in Table III), a very low number that drops even lower at 7 (also highlighted by hatching in table II), if the B+D code representation is followed. The B+D combination must therefore be considered as the most favourable one.
Furthermore, other combinations can be obtained from Tables II and III that, though being overall less favourable than the D+B combination, by virtue of their low number of needed dinucleotides present nevertheless the advantage of allowing the introduction in the sequence of codons favoured in the genetic expression in different organisms. On the basis of the present knowledge in the differential use of the various codons in E. Coli, yeasts and eucaryotic cells, always keeping minimal the number of dimers needed for each synthesis mixture formation (for the detailed description of the invention see infra), it is possible to derive, from Table II, Tables IV, V and VI respectively, wherein usage frequences of the single codons are shown, while the most convenient selections are highlighted by hatching.
The chemical synthesis process is organized consequently to the selected combination. In accordance with the features of the selected approach, the process proposed as preferred is the one based on the nucleotide-dinucleotides combination shown in Table II (i.e. the B+D one) described hereinafter.
The process provides for the preparation of 4 identical synthesis columns, containing the common resin used for this purpose, marked with the names of the four nucleotides, i.e. T (or U where a polyribonucleotide is to be synthetized), C, A, G. Then a mixture of opportunely selected dinucleotides is condensed on the resin inside an automated synthetizer. In the first column (T) the mixture is constituted by the dinucleotides that in Table II are hatched correspondingly (TT; CT; AT; GT; GG). In the second column (C) the mixture is constituted by the dinucleotides that in Table II are hatched correspondingly (TT; CT; AT; AA; GT). In the third column (A) the mixture is constituted by the dinucleotides that in Table II are hatched correspondingly (TT; TG; CT; AT; AA). In the fourth (G) the mixture is constituted by the dinucleotides that in Table II are hatched correspondingly (TT; CT; AT; AA; GT). To this synthesis cycle there follows a second cycle, where a single nucleotide (and specifically the one shown with the symbol of the column, i.e. T in the first, C in the second, A in the third and G in the fourth) is additioned to each column. At the end of the second cycle, all twenty of the preselected codons will have been inserted in the resin of the four columns, but in each column will be present only the codons hatched in Table II. In order to further randomize the sequence, the columns are now opened, the synthesis resin is recovered and the four resins are carefully mixtured.
The mixtured resin is redistributed into four columns, the columns are reconnected to the synthesis apparatus, and the two synthesis cycles are repeated as aforedescribed. In practice, in every double synthesis cycle three new units are added to the forming polynucleotide chain in order to form only the preselected codons, but in a totally random way, i.e. regardless of the selected codons.
This synthetic method presents remarkable advantages relative over those described in the state of the art, that are summed up hereinafter.
Dinucleotide synthesis is carried out by methods that are well described in literature, therefore by using low-cost, commercially available reagents, in solution and with product yields of 85-90% (Kumar, G. 1984).
In most cases difference in the reactivity of different dinucleotides is expected to be inferior to reactivity differences peculiar instead of trinucleotides. The main consequence is that homogeneous incorporations in the forming polynucleotide chain of all molecular species present in the synthesis mixture are easier to obtain. Reagents purity is a determining factor for this aspect of the reaction.
The total number of dinucleotides required to cover all possible combinations is extremely low. Actually, it varies from a minimum of 7 to a maximum of 20, and in the more usual cases, as for those described herein, 11 dimers are sufficient.
The selection of dinucleotides to be used can be done so as to minimize the number of molecular species forming the synthesis mixture. An example is that shown in Table II, where dimers were selected so that each synthesis mixture contains only 5 dinucleotides. This makes the search for suitable reaction conditions and for relative molar concentrations of the reagents much easier, in order to optimize the homogeneous incorporation of all components.
A synthesis carried out according to this approach, enables to incorporate complete codons in the forming chain. In fact, a careful selection of dinucleotides and mononucleosides, enables to direct the synthesis so to leave out undesired codons such as stop codons. A combination specifically excluding only stop codons is for instance the one shown in Table II, but it is also possible to modify the combination so as to leave out, in one or more positions of the final sequence, any undesired codon.
If for example some amino acids were to be left off from a certain position of the polypeptidic chain, for instance the acid ones (glutamic acid, Glu or E, and aspartic acid, Asp o D), leaving out of the mixture applied to column G at the synthesis cycle corresponding to the desired position dimers AT and AA would suffice. However, according to the same principle, other countless combinations are possible.
In relation to each amino acid synthesis, the possibility of selecting a suitable codon among the many possible ones allows the insertion of only those codons that are preferentially used in the proteinic synthesis of the selected micro-organism. Therefore, by leaving out of the random sequence oligonucleotide mixture those translated in the system with a lesser efficiency, it is possible to maximize genetic expression, thus obtaining a better correspondence between the homogeneities of the oligonucleotide and of the resulting oligopeptide mixture.
All these considerations highlight the remarkable advantages deriving from such an approach As a matter of fact, they do not pertain exclusively to the process based on a B+D combination, on the contrary they are valid for any kind of process deriving from the general approach and therefore inferable from the aforementioned one.
In fact, for instance on the basis of the combination shown in Table III, it is possible to infer a synthesis process differing from the preceeding one only for the aspect that the order of the two synthetic cycles must be inverted: first of all single mononucleotides are condensed on synthesis resins, and dinucleotide mixtures only in the second round.
This second process, like the others deriving from further possible combinations, although comprised in the intentions of the inventor, shall not be specified further, because those are essentially inferable from the first process aforedescribed.