This invention relates to new techniques for the sequencing of nucleic acids based upon a general approach in which labelled adaptor molecules are employed. The invention facilitates the large scale analysis of populations of nucleic acids, for example populations of sequences as produced in the Human Genome Project (HGP). Its applicability is, of course, not limited to HGP or its like.
Conventional analysis of nucleic acid sequences has hitherto depended largely on the base specific fragmentation of the original nucleic acid sample into two or more parts differing in size by one or more bases. Sequencing is effected by separation of the resultant fragments followed by their analysis.
In relatively low throughput sequence analysis of RNA, base specific fragmentation has been effected by ribonucleases with base specific activities, followed by thin layer chromatographic separation of the products. Higher throughput sequence analysis, especially of DNA, generates the fragments to be analyzed by base specific chemical cleavage (Maxam, A. M. and Gilbert, W Proc. Natl. Acad Sci. 74 p560 (1977) or by terminating, in a base specific manner, synthesis catalysed by a suitable nucleic acid polymerase (Sanger, F., Nicklen, S and Coulson, A. R., Proc. Natl. Acad Sci. 74 p5463 (1977)). Separation of the resultant fragments is achieved by denaturing gel electrophoresis through ultra thin slabs or capillaries containing a suitable polymer like polyacrylamide. This can resolve of the order of about a thousand bases per suitably prepared sample at a resolution of one base, and can handle tens of samples simultaneously. Detection (Smith, L., M. and Youvan, D., C. Biotechnolgy 7 p576-580 (1989)) (Yang, M., M., and Youvan, D., C., Biotechnology 7 p576-580 (1989)) has been direct or indirect through radioactive, chemiluminescent or fluorescent labelling or by stable isotopes (Human Genome 1991-1992 Program Report p18 and p22 U.S. department of Energy 1992)).
There is a great deal of interest in achieving greater rates of sequencing at reduced cost. It will then be feasible to analyze completely the genomes of organisms, in particular those of higher eucaryotes which are commonly over 3,000,000,000 bases in size per haploid genome. Furthermore, methods which are suitable for such analysis will also make it possible to perform high resolution linkage analysis on many individuals in a population. This will be important for identifying the phenotypes, especially common diseases, associated with genes, and to trace gene flow in humans. Analyzing the expressed sequences in a population of cDNAs or mRNAs would also become possible. It would also be possible repeatedly to sequence the same region or multiple regions from many different individuals for the purposes of comparisons related to for example diagnosis.
Very high throughput methods of sequence analysis are therefore being investigated (desirably one or more orders of magnitude greater than achievable with current, conventional, commercially available sequencing apparatus, such as the ABI 373 DNA sequencing System which can read not more than 1000 bases a day from 72 samples). Scanning tunnelling electron microscopy can directly visualise the bases in individual molecules. Lasers might also be usable to sort individual molecules, which can then be analyzed by degrading them from one end, a base at a time (Harding, J. D. and Keller, R. A. Trends in Biotech 10 p55-58 (1992).
However, there is a further problem when it is desired to conduct sequence analysis at a rate adequate for analyzing whole genomes or adequate for comparing many selected sequences from many individuals (for example, when using family studies to identify the locus of an inherited trait), namely many samples need to be simultaneously analyzed. This is currently being approached through sequencing by hybridisation.
There are two formats for sequencing analysis by hybridisation. One format (Drmanac, R., et al Genomics 4 p114-128 (1989) and Stretzoska, Z., et al Proc. Natl. Acad. Sci USA 88 p10, 089-10,093 (1991)) immobilises many samples (perhaps numbering hundreds of thousands) separately on a large array. The array is probed in turn by each of many different labelled oligonucleotides of known sequence. Identification of samples which have hybridised to each of the probes, indicates those which have complementary sequences to the probe. Use of multiple probes covering all possible sequences allows the complete sequences of the samples to be assembled. This method is, however, limited by the requirement for oligonucleotides of at least 5 bases to achieve specific hybridisation, which in turn dictates that large numbers of probes (4n where n is the length of the oligonucleotide) are required to cover all possible sequence combinations.
The alternative format (Fodor, S. et al Nature 364 P555-556 (1993), Kharpko, K., R., et al DNA Sequence 1 p375-388 (1991), Southern, E., M., Maskos, U., and Elder, J., K. Genomics 13 p1008-1017 (1992)) requires many thousands of different oligonucleotides, each with different known sequence covering together all possibilities, to be immobilised on a suitable array. Probing the array with a labelled nucleic acid sample whose sequence is to be analyzed identifies the oligonucleotides which share homology with the sample. This is usually achieved through synthesis of the oligonucleotides in situ with masking, for example by a lithograph, of those not requiring the specific base being added at any given time. The sample is labelled and hybridised to the array. The positions of hybridisation indicate where sequence homologies are shared between the sample and the detected oligonucleotides. Therefore the sequences of the sample can be deduced from those of the detected oligonucleotides.
In either format for sequencing by hybridisation, it is difficult in practice to synthesise oligonucleotides of adequate length. When oligonucleotides are immobilised and probed with sample, in practice only short oligonucleotides can be synthesised on arrays of necessarily limited size.
Alternatively, and as mentioned above, when oligonucleotides are synthesised independently to probe an array of samples, the number required to cover all sequence possibilities is 4n, where n is the length of each oligonucleotide. It is logistically challenging both to produce and to use the number required to accurately detect all possible sequences. For example, the number required to make all possible 5 mers is 1024.
The length of the oligonucleotides determines their fidelity of hybridisation, and also the ease with which full sequence can be assembled from the component oligonucleotide sequences. In each case longer oligonucleotides are better. Greater fidelity of hybridisation is achieved the longer the oligonucleotides used since more stringent washing can be performed when the oligonucleotides are as long as possible. When full length sequence is being assembled from overlapping component sequences, the longer the component sequences the fewer possible xe2x80x9csolutionsxe2x80x9d that there are likely to be.
A further problem associated with the sequencing by hybridisation format where probe oligonucleotides are immobilised is that as the size of the target increases the proportion of any given region within that target decreases. This reduces signal to noise, and therefore has the effect of limiting the size of target, which can be analyzed.
Hybridisation used alone, is in general not a good means of analyzing sequences because not all oligonucleotides hybridise with equal efficiency or specificity under a given range of conditions. There are therefore associated interpretational and/or practical difficulties.
The possibility for enzymatic sequencing in situ on arrays of immobilised samples has also been reported (Rosenthal, A. and Brenner, S. 1993 Meeting on Genome maping and sequencing page 222 Cold Spring Harbor Laboratory Press (1993). Each base is labelled differently and added to the samples such that extension is terminated at a given base. The number and type of added bases is recorded for each sample. The block to extension is removed so that the exercise can be repeated for the next base to be so tested. Cycles of testing each base in turn produces complete sequence for each sample. This method suffers the difficulties of distinguishing the number of members in a homopolymeric sequence and that different molecules within a given sample become out of phase with each other with respect to the position of the bases being analyzed.
WO 94/01582 discloses a process for the categorization of nucleic acid sequences using a population of adaptor molecules which include predetermined nucleotide bases, categorization of the nucleic acid sequences being achieved by linkage between them and the adaptor molecules and selection of the resulting ink sequences in a base-specific manner. The adaptors in question are preferably short double stranded oligonucleotides which have an extending strand to allow base-specific ligation to nucleic acid sequences which have been produced by cleavage involving a nuclease in which the recognition and cleavage sites are displaced from each other.
A method of sequentially determining the order of bases one or more bases at a time on many samples simultaneously would be attractive if available because it could be automated, would require few reagents and might allow of the order of tens of bases to be determined which would facilitate assembly of full length sequence. Each sequence of 17 bases, for example, excepting the repetitive elements which comprises a low complexity special case, is likely to be unique in the human genome. Producing overlapping sequences of 17 or more bases from the human would therefore facilitate assembly of the unique human sequences. In order for such a process to be successful it is necessary to determine the order of bases on all samples one or more at a time without allowing molecules within any of the samples to become out of phase during the process. This is achieved for the first time by the present invention.
The present invention is based upon the use of specific adaptors including oligonucleotide sequence comprising one or more predetermined bases. In some embodiments of the invention, use is also made of restriction enducleases having a recognition site displaced from the cleavage site. All embodiments depend, however, upon the use of the aforementioned adaptors.
Thus, the present invention provides a method of sequencing a nucleic acid, comprising either sequentially removing bases from the sequence of the nucleic acid a predetermined number at a time, with the product remaining from each step of predetermined base removal being ligated to a labelled adapter specific for said bases and including oligonucleotide sequence, or hybridising a primer to the nucleic acid to be sequenced and sequentially extending said primer a predetermined number of bases at a time, said added base(s) being complementary to base(s) in the nucleic acid being sequenced, and each of said base addition steps being achieved by the use of a labelled adaptor specific for said bases and including oligonucleotide sequence containing said predetermined base(s); in either case, the label of said labelled adaptor being specific for its respective predetermined base(s).
The predetermined base removal embodiments are best suited to double stranded nucleic acids, and the technique can use nucleases as described herein. Of course, any other appropriate method of base specific cleavage can be used if desired.
Thus, a further aspect of the invention is a method of sequencing a population of double stranded nucleic acids, comprising:
(a) ligating to said nucleic acids adaptors which include double stranded oligonucleotide sequence which incorporates a predetermined nuclease recognition sequence for a nuclease whose recognition site is displaced from its cleavage site, said displacement being such as to create, as a result of said ligation, cleavage sites in the resulting ligation products which, upon cleavage thereat, result in removal of a base or bases from one strand of said nucleic acids;
(b) cleaving ligation products from (a) with said nuclease to produce double stranded products of unequal strand length;
(c) subjecting said products from (b) to ligation with a population of adaptors which include double stranded oligonucleotide sequence having extending single strands wherein said population of adaptors includes molecules having in their extending single strands at least a predetermined subset of all possible permutations of a base or bases constituting a predetermined number of bases, and wherein each permutation is provided with a respective unique and detectable label, each adaptor in said population having a nuclease recognition sequence for a nuclease whose recognition site is displaced from its cleavage site, said displacement being such as to create, as a result of the ligation of this step (c), upon cleavage thereat, result in removal of a base or bases from one strand of said products from (b);
(d) separating the ligation products from (c);
(e) cleaving the separated ligation products from (c) with the nuclease of (c) to produce a population of fragments carrying the recognition site of the nuclease of (c);
(f) either analyzing the labels carried by ligation products separated in (d), or analyzing the labels carried by fragments from (e); and
(g) repeating steps (c) to (f) as often as necessary to determine the desired sequence, but with the final repeat optionally omitting step (e).
Preferably, in (c) above, all possible base permutations would have a unique label, but it is sufficient to label a subset of the permutations as long as analysis is not wished to proceed at a rate greater than determined by the proportion of the permutations which are labelled. For example, a 4 base extension has 256 permutations. If 16 xe2x80x9ccoloursxe2x80x9d were available as labels, all of the permutations of possible bases at 2 of the 4 bases in the extension could be labelled and deleted independently. Of course, only xe2x80x9cbases worthxe2x80x9d of information would be determined.
As will be clear from the description hereinafter, it will be appreciated that the xe2x80x9cpredetermined number of basesxe2x80x9d referred to in (c) above is the base or bases which are being monitored for sequencing purposes. The number of such bases can be one or more.
Although the above process is defined by reference to nucleases and nuclease cleavage and recognition sites, other means of achieving the same effect of stepwise base removal are expressly envisaged by the invention and not excluded. Obviously, the use of particular restriction endonucleases (see below) is very convenient and preferred but is not absolutely essential.
Preferably, the above process is preceded by treatment of the population of nucleic acids with the nuclease(s) to be used later in the process.
Other aspects of the invention are the use of the nuclease having a recognition site displaced from its cleavage site in the sequencing of nucleic acid, and a kit for sequencing nucleic acid which comprises at least one nuclease having its recognition site displaced from its cleavage site and/or a population of double stranded oligonucleotides in which the strands are of unequal length with one or more predetermined bases in the extending strand and with the double stranded portion including a recognition site for a nuclease having its recognition site displaced from its cleavage site.
Preferably, there is a recognition site for more than one nuclease because the choice can be exercised as to which nuclease is to be used for base specific removal. This would be an advantage, for example, when there is already a site for one nuclease in the sample being sequenced, but not the other. It is practical to fit more than one recognition site in the oligonucleotides or adaptors provided the sites do not overlap. Alternatively, a plurality of sites works if the sequences of the recognition sites either are partially the same in a way which will accommodate partial overlap without either recognition site being altered. The same xe2x80x9ctypesxe2x80x9d of cut ends must also be generated by the enzymes. For example, the recognition site for a nuclease which produces a 3xe2x80x2 overhang would preclude the simultaneous use of a recognition site for a nuclease which produces a 5xe2x80x2 overhang.
The predetermined base addition embodiments of the invention are best suited to sequencing a single stranded nucleic acid provided with at least some known sequence. Accordingly, another aspect of the present invention is a process for sequencing single stranded nucleic acid having or being provided with at least some known sequence, comprising:
(a) annealing an oligonucleotide primer to said known sequence immediately adjacent to the unknown sequence to be determined in said nucleic acid;
(b) subjecting the end of said oligonucleotide immediately adjacent to the unknown sequence to ligation with a population of labelled adaptors having oligonucleotide sequence including all possible permutations of a predetermined number of bases positioned at the end thereof which is so-ligated, the adaptors of said population being employed simultaneously, in preselected groups, or one by one, as desired;
(c) detecting the specific adaptor from said population which was ligated in (b);
(d) removing all of said specific ligated adaptor except for said one or more predetermined bases thereby to extend the double stranded region of the resulting product; and
(e) repeating steps (b) to (d) to the necessary extent to determine the unknown sequence, but with the final repeat optionally omitting step (d).
Since all processes in accordance with the present invention require the use of labelled adaptor molecules which are preferably, but not essentially, entirely constituted by an oligonucleotide, it is important to note the nature of the label in question is not significant to the invention. Any workable means of detecting with specificity particular adaptors, whether in ligated condition or not, and hence the particular predetermined bases they carry, is adequate for the purposes of the present invention. Useable labels include those known to the skilled person, for example,radioactive isotopes, stable isotopes, homologous or similar sequences, dyes, fluorescent compounds, enzymes, biotin, carbohydrates. The term xe2x80x9clabelxe2x80x9d is to be broadly construed to cover an entity which can be detected by any means without undue interference with the sequencing process.
This invention will now be further described in detail with reference to the various categories of embodiment discussed above.
Turning first to the aspect of the invention which is constituted by the predetermined base removal process, it will be noted that this process takes advantage of the certain category of restriction endonucleases selectively to degrade all samples simultaneously by a predetermined number of bases from one end, and to record the bases at each modified end either just before or just after degradation. Cyclical repetition of the process generates lengthy sequence information of the order of tens of bases from the sample ends.
Nucleases which can be employed in this process include restriction endonucleases the cleavage sites of which are asymmetrically spaced across the two strands of a double stranded substrate, and the specificity of which is not affected by the nature of the bases adjacent to a cleavage site. Type II restriction endonucleases of these types together cover a wide range of specificities, are readily available, and are highly specific and efficient in their action (Review: Roberts, R. J. Nucl. Acids res. 18, 1990, p2331-2365).
Thus, the predetermined base removal process makes use of base specific cleavage towards the end of samples to be analysed. Of course, it is possible (and this is generally likely to be the case) that the samples being analysed will include sequences having a nuclease cleavage site internally. Such samples must be pre-prepared such that the base specific cleavage employed does not occur internally as well as at the desired end. One means of achieving this is to pretreat sample with the appropriate nuclease or nucleases such that the resulting fragments cannot thereafter be cleaved by such nuclease(s). In effect, sequence analysis is then confined to the ends of the resulting fragments. If desired, a known pattern of pre-cleavage involving selected nucleases can be employed before the performance of the present process, using not only nuclease enzymes subsequently to be employed in the process but other nucleases in addition.
Additionally, nucleic acid samples to be sequenced can be prepared so that they can be simultaneously treated by the process and analysed without interference between individual nucleic acids. One means of achieving this is to have each nucleic acid in a separate reaction vessel. The invention, however, readily lends itself to preferred simultaneous processing and analysis of many samples in the same reaction vessel, with nucleic acids distinguishable in that vessel by the use of independent immobilisation.
Preferably ligation reactions used in the processes of the present invention are catalyzed by DNA ligase, which enzyme is, of course, readily available and easy to use.
The general scheme of the predetermined base addition method is illustrated in the attached FIG. 1. In the scheme shown in FIG. 1, for purposes of illustration a single restriction endonuclease is employed, namely Bsa I. However, the predetermined base removal aspects of the present invention are not limited to the use of a single predetermined nuclease. If desired, a predetermined pattern of use of different nucleases can be employed at different stages during the sequencing operation.
In the scheme shown in the attached FIG. 1, fragments to be analysed are first created by Bsa I, which is also utilized for the stepwise base specific analysis of the ends. This avoids the possibility of the enzyme cutting internally during analysis until such time as the sequence is xe2x80x9cused upxe2x80x9d as a result of stepwise degradation.
In a large nucleic acid, on average the fragments can be classified into three types dependent on whether (and how) or not they retain the Bsa1 recognition site. One type will have Bsa1 at neither of its ends. One type will have Bsa1 recognition sequence at one of its ends, one type will have Bsa1 recognition sequence at both of its ends. On average they will be in the proportion 1:2:1, respectively. In this case analysis is confined to those fragments which completely lack Bsa1 recognition sequence. There are many ways that one skilled in the art can select for the required fragments and instances of these can be found in the Examples hereinafter. Additionally, there are ways that one skilled in the art can select for the removal of Bsa1 recognition sequence from ends where such sequence does occur. One such method would be to ligate to Bsa1 cut DNA, in the presence of active Bsa1, adaptors with a Bsa1 recognition sequence whose use will result in removal of bases from the nucleic acid sample being sequenced. Once an adaptor has ligated there are two possible outcomes at each cleavage which follows. Either the Bsa1 site in the fragment is used, in which case part of the adaptor is cleaved off. Alternatively, the Bsa1 site of the adaptor is utilised in which case bases are removed from the sample. Cycles of addition and cleavage will ensue. Eventually by chance the Bsa1 site of the sample will be removed and further cleavages will be from the sample. Suitable titration will determine the level of treatment required to give a population sufficiently depleted in Bsa1, but not overly reduced in average size by digestion from the adaptor. This is in fact, a general way of exposing internal sequences to the sequencing process. Other such methods are known (for example treatment with DNAse 1 in the presence of manganese2+, treatment with Ba131 or by random shearing (Sambrook, J., Fritsch, E. G. and Maniatis, T. ed (1989). xe2x80x9cMolecule Cloningxe2x80x9d. Cold Spring Harbor Laboratory Press, New York)).
Importantly, in the scheme shown in FIG. 1, two general types of adaptor molecule are utilized.
The first type of adaptor molecule, shown in FIG. 1 as an oligonucleotide as such, contains base sequence which includes the recognition site for Bsa I e.g., nucleotides 1-10 of 5-8 and its complement. The location of the Bsa I recognition site within the adaptor is such that upon ligation with blunt ended nucleic acid sequences of interest and subsequent cleavage by Bsa I, a selected number of bases will be removed from the end of the nucleic acid being analyzed, thus exposing complementary bases for analysis. This requires that the number of bases in the adaptor between the recognition site and the point of cleavage is fewer, by the number of bases to be removed from the nucleic acid being sequenced (the predetermined number of bases), than the maximum cutting distance of the enzyme Bsa I from its recognition site.
Of critical importance for the continuing cyclical nature of the process is that whichever endonuclease is employed, it should not cut to leave a blunt end. The overhang or extending strand which remains can be either 3xe2x80x2 or 5xe2x80x2 depending upon the nature of the cleavage which is produced.
In FIG. 1 it can be seen from the first stage that the adaptor molecules used have a recognition site for Bsa I which is situated four bases from the oligonucleotide sequence end which is to be ligated to the nucleic acid to be sequenced. Since Bsa I cuts five bases away from its recognition site to leave a four base 5xe2x80x2 overhang, upon cleavage one predetermined base is therefore effectively removed from nucleic acid being sequenced. Of course, if required, more than one base may be removed, with the number of the bases at the end of the adaptor molecules being reduced by the number of additional bases (the number of additional predetermined bases) that it is required to remove. Thus, in the case of Bsa I a maximum of five bases can be removed. As will be seen below, later detection steps can, however, only analyze the bases in the overhanging strand and it is therefore appropriate not to leave less than one base beyond the recognition site. The number of new bases exposable for analysis in subsequent cycles is equal to the shortest distance between the recognition site and the cleavage site. In the case of Bsa1 this is one, but it is more that one in the cases of other enzymes, e.g. Fok1 where it is nine.
In FIG. 1, the nucleic acids to be sequenced in the population which is being examined (two for the purposes of illustration) have been independently immobilised to solid phase, exposing the non-immobilised ends to sequence analysis. The first stage in the overall process is that the thus-exposed ends are ligated to the adaptor molecules and residual adaptor molecules washed away. Bsa I is then added, and this effectively removes both the ligated adaptor and the preselected number of bases (as shown in the Figure, one base is removed). Enzyme and cleaved adaptor are then washed away.
In the next stage, a different population of adaptor molecules is employed. These adaptors are of the second of the two types mentioned above. These adaptor molecules have an extending strand in that portion of the molecule which is an oligonucleotide sequence, with the extending strand of each adaptor having a known and different base specificity. A population of adaptor molecules is employed that, in effect, is capable of reporting all possible combinations of permutations of predetermined base specificities. Moreover, each adaptor has both a detectable label which is specific for the particular base or bases which are predetermined in each adaptor and a nuclease recognition site as described above.
Preferably, the entire population of the second type of adaptors are then ligated to the cleavage product resulting from the previous stage of the process under conditions such that only adaptors where the extending strand exhibits actual complementarity for the extending overhang in the cleavage products will ligate. Such conditions, for example (but not essentially), could utilise 1 pmole of cleavage product, 200 pmoles of adaptors, and 0.25 units of T4 DNA ligase, at a temperature of 16xc2x0 C. for 4 to 16 hours in a 50 ul reaction volume also containing 20 mM Tris-HC1 pH7.5 @ 24xc2x0 C., 50 mM sodium chloride. 10 mM magnesium chloride, 1 mM adenosine triphosphate and 1 mM dithiothreitol. The conditions of time, temperature and ionic strength may be varied by one skilled in the art to achieve the required rates of ligation and specificity.
Of course, in the alternative each adaptor molecule at this stage could be ligated in turn with each nucleic acid sample being examined to determine whether it ligates or not. However, it is preferred that the population of adaptor molecules employed comprise molecules each having a different base specificity with a corresponding specific label. In this way, the adaptor molecules can be ligated simultaneously and, after washing away unused (unligated) adaptors, those adaptor molecules which have actually ligated can be determined and distinguished.
For the purposes of illustration in FIG. 1, the uppermost nucleic shown becomes green by ligating to the base C-specific adaptor molecule, while the lowermost nucleic acid shown (SEQ ID NOS: 36-37) becomes red by ligating to the base A-specific adaptor molecule.
Essentially, there are two ready options for analysis. In the first option, detection of the specific adaptor molecules which have successfully ligated with nucleic acid sequences can be performed whilst these molecules remain ligated. This is shown as Analysis Option 1 in FIG. 1. Such an option is preferred when many samples are being analyzed in the same reaction vessel, and the process can be both sensitive and inexpensive. Thus, nucleic acid samples could be immobilised each to separate one to five micron diameter beads which are generally commercially available. Over one million beads could then comfortably be analyzed using standard fluorescence microscopy coupled with image analysis. Reaction volumes would be very small, with consequent reduction in reagent costs. An alternative analysis option, Analysis Option 2 as shown in FIG. 1, exists once the products of ligation including the labelled adaptors are subjected to further action of the restriction enzyme, Bsa I. Because the adaptor molecules which are labelled also carry the recognition site for Bsa I, cleavage is again possible. As before, the recognition site is deliberately positioned in the oligonucleotide portion of such adaptor molecules such that one or more predetermined bases are removed from the end of each nucleic acid sequence being analyzed to leave an extending strand. In FIG. 1, Bsa I removes the adaptor molecules together with the end base from each nucleic acid. The number of bases removed at this stage of the process can obviously depend upon the positioning of the enzyme recognition sequence in much the same way as described above in relation to the first stages in the process.
In any event, as a result of the immobilisation of the original sequences to be determined, after the second Bsa I cleavage in the overall process a population of specific adaptors is released which can be analyzed for their particular labels in Analysis Option 2. Analysis of the labels produced by this process obviously gives base specific information derived from the nucleic acid sequences being analyzed.
In Analysis option 2, adaptor molecules may, if desired, be detected by the use of robotically controlled sampling and off-line detection. Robotic liquid handling is becoming commonplace in molecular biology applications (Uhlen, M., et al Trends in Biotech 10 p52-55 (1992)).
As can be seen from FIG. 1, the first cycle of ligation and analysis is now complete. Thus, after the first stage, each cycle of the process thus comprises ligation of labelled adaptors, followed by either: (a) detection of the particular label followed by removal of the adaptors plus a predetermined number of end bases from the nucleic acid sequence; or (b) removal of the adaptors plus one or more predetermined end bases from the nucleic acid followed by label detection.
To continue the process, a new cycle must be started. A new cycle of ligation of adaptor molecules is therefore performed as described before to determine which bases are now present at the degraded nucleic acid sequence ends. In FIG. 1, in the second cycle, the uppermost nucleic acid turns cyan through ligation of a base G-specific adaptor, and the lowermost nucleic acid turns blue through ligation of a base T-specific adaptor.
The process is repeated with cycles of ligation of labelled adaptors, washing and detection of labels and removal of adaptors to expose the next base or bases until the desired number of bases have been analyzed at the ends of the nucleic acids being examined or the entirety of the sequences have been determined.
At the very last stage, when the last base or bases is/are being determined it is, of course, optional and dependent upon other features of the process whether or not a final cleavage step is employed. Using Analysis Option 1, no final cleavage step is necessary.
It will be appreciated that the structure of the adaptor molecules which comprise oligonucleotide sequence is important to the sequencing process just described. In practice, the only limitation on the number of different adaptor molecules that can be employed is the number of distinguishably different labels that are available for determination of adaptor specificity at subsequent stages in the process. Availability of a large number of adaptor molecules which are individually specifically labelled has the advantage that more than one base at a time can be analyzed per cycle. Thus, by way of example, removing two bases at a time would require the use of 16 different adaptor molecules each having a different and distinguishable label. When 16 different labels are available, it is possible simultaneously to analyze all the possible products. In general terms, the number of adaptor molecules required is 4n, where n is the number of bases to be analyzed per nucleic acid per cycle.
It is also possible to analyze each base in the sequences more than once. This can be achieved by using more adaptor molecules than there are bases removed per cycle. For example, if during each cycle 16 different distinguishably labelled adaptors are used, each adaptor recognizing a unique combination of two different bases, then on the cycle that a given base is first exposed at the end of the nucleic acid being degraded and sequenced it is detected as a result of the specificity of the base at the extreme 5xe2x80x2 end of the complementary bases in the labelled adaptor (see FIG. 1). However, one cycle later the same base will be detected by the penultimate base in the adaptor molecule.
The precise structure of the (second typexe2x80x94see above) adaptor molecules used in the above process is not critical, except that an oligonucleotide portion must obviously be included which has appropriate sequence to provide nuclease recognition site and one or more predetermined bases, and the adaptors must carry predetermined base-specific labels.
It is not essential that bases in adaptor molecules that are used to detect exposed bases in the nucleic acid sequence being degraded be at the extreme ends of the extending strands in the adaptors, merely that they are contained within the extending strand. The precise position of such base or bases merely determines when, in the overall process, they will be read.
Most preferably, adaptors in the invention are short double-stranded oligonucleotides which can be joined to the ends of cleavage products. They will have been chemically synthesised so that their sequence can be predetermined and so that large concentrations can be easily produced. They may also be chemically modified in a way which allows them to be easily purified during the process. Ideally their 5xe2x80x2 ends will be unphosphorylated so that once joined to degraded nucleic acid fragments, the adaptored end of the latter will no longer be able to participate in further ligation reactions. The risk of inappropriate ligation involving adaptors is thus avoided.
Occasionally in the processes of the present invention which operate by sequential predetermined base removal, instances could arise where a new cleavage site for the restriction endonuclease(s) will be created by ligation of labelled adaptor to degraded nucleic acid sequence. This will be detected when more than one type of adaptor from the range of adaptors used will be able to ligate to the nucleic acid, unless the same bases are exposed by the respective cleavages which are occurring. In the latter case, this eventuality will be detected by the process during the cycle when the sequences diverge.
To eliminate the above mentioned possibility of new cleavage site formation, the use of enzyme recognition sites is avoided which can donate one or more bases in the direction of cleavage to one or more bases and create in the process an additional recognition site like the original but displaced (in the direction of cleavage) from the original. Furthermore, it is desirable to avoid placing, in the part of the adaptor which is between the recognition site and the cleavage site, one or more bases from the side of the recognition site which is away from the cleavage site in the order in which they occur in the recognition site, thus preventing the possibility of the nucleic acid being sequenced donating the necessary bases to create a new recognition site like the original recognition site but displaced from the original in the direction of cleavage. Other similar measures would be effective.
Moving on now to predetermined base addition processes of the invention, as has been indicated above the invention includes embodiments in which bases are added one or more at a time to an oligonucleotide primer which is annealed to a known sequence immediately adjacent to unknown nucleic acid sequence to be determined. This is generally illustrated in FIG. 2 (which shows sequencing of SEQ ID NO: 40), and is, of course, suited to single-stranded nucleic acids. After such annealing, the next stage in this particular set of embodiments is exposure of the duplex thus-created to ligation with a population of adaptors carrying one or more predetermined bases at the end of an oligonucleotide sequence. As with other embodiments of the invention, there is an interrelationship between the number of predetermined bases and the number of available labels used to detect the particular predetermined base or bases.
Apart from the oligonucleotide end of the adaptor molecules (which is critical to the extension process at the heart of such base addition embodiments for sequencing nucleic acids), the remainder of the structure of these particular adaptor molecules should ideally be non-specific to facilitate ligation, or need not even be nucleotide sequence provided that the actual nature of the molecule is such as not to interfere with the process of the invention.
As will be recalled, the next stage in the process is detection of the specific label or labels following adaptor ligation. This, of course, identifies the particular base or bases which have been added to the primer and, in turn, identifies the complementary bases in the nucleic acid strand which is being sequenced.
The final step in a cycle of this process is removal of all of the adaptor molecule except for the one or more predetermined bases which have extended the double stranded region of the primer/nucleic acid duplex. As will be appreciated, repeating cycles generates sequence information for the single stranded sequence being determined.
At the stage in each cycle when removal of the non-specific part of the adaptor molecules is effected, the means for doing this can be enzymatic or chemical with adaptor molecules designed accordingly. For example, positioning a phosphothionate linkage or linkages between the base(s) to be added to the duplex (the predetermined bases) and the non-specific part of the adaptors can be utilized (see Example 3) to permit an exonuclease to remove all but the predetermined bases.
The embodiments of the invention permit extremely high throughput, allowing hundreds of thousands of samples to be simultaneously processed. Applications therefore include, for example, analysis of highly complex nucleic acid samples up to whole genomes, or studying many different nucleic acids from many different individuals, for example when performing population or evolutionary studies or when studying complex linkage, especially of disease-associated traits, classifying microorganism types, or when determining total specific transcriptional activity of a cell or tissue. Diagnosis based on small percentages of base differences is also facilitated.
Preferably multiple nucleic acids to be sequenced are simultaneously and independently immobilised. A preferred way is to use adaptors which are oligonucleotides immobilised on beads or on a plate format, in particular glass beads or plates. Glass beads have the advantages that they are available in a wide range of mean diameters allowing optimum size to be selected, that conventional chemistries, especially oligonucleotide syntheses, can be used to attach labels, that once reacted they can be rendered inert, and that their shape can be highly irregular (allowing easy and repeated identification by image analysis). Plates have similar chemical advantages, and offer the advantage that a high density of samples can be arranged on a plate which is then a convenient format for reading in a scanning instrument.
It is generally impractical to subdivide large populations of nucleic acid fragments a sufficient number of times to allow individual fragments to be immobilised on a single type of bead. A mixed population of beads, synthesised such that each bead recognises only one type of fragment, therefore has to be prepared.
The presence of different oligonucleotides of sufficient length on each bead allows each bead to capture a different sequence by hybridisation. Methods well known in the art, if required, can be used to covalently link the captured sequences onto the oligonucleotides. Plates, or other materials in sheet format, can be derived/adapted to bind or covalently attach samples under investigation.
Ligations in the predetermined base addition overall process, as in other aspects of the invention, can be effected using DNA ligase.
In order to synthesise many different oligonucleotides simultaneously on glass beads so that only one type of oligonucleotide is found on a given bead a cyclical process is used. This is achieved by performing on beads a separate synthesis for each of the first bases required. The products of these syntheses are then mixed together and then divided into four separate synthesis reactions, one for each of the bases to be added. This cycle is repeated for as many positions as it is required to vary on the beads. A given bead can only have one combination of bases in its attached oligonucleotides because it is only ever exposed to one type of base addition per synthesis cycle. The actual order of bases is determined by the actual base additions to which a bead has been exposed. Cycles of this general type have been reported for simultaneously synthesising many different peptides on beads such that each bead has a single peptide (Lan, K., S., et al Nature 354 p82-84 (1991)).
To ensure that the oligonucleotide on each bead hybridises to only a single unique nucleic acid sequence, many more permutations of bases on the beads would be used than would be expected to occur in the set of fragments to be sequenced. Few beads would, therefore, detect a sequence in actuality. Thus, for practical purposes there would only be one type of fragment per occupied bead.
In relation to the kits of the invention, such kits can, of course, include other items as appropriate or desired, such as DNA ligase or such chemicals as may be required for effectively using oligonucleotide labels. The kits can, of course, also include written instructions.
The invention also includes any of the adaptor molecules described above in connection with the predetermined base addition process, and adaptors as described above for use in the predetermined base removal process of the invention.