This invention relates in general to methods and apparatus for nucleic acid analysis,and, in particular to, methods and apparatus for DNA sequencing.
The rate of determining the sequence of the four nucleotides in DNA samples is a major technical obstacle for further advancement of molecular biology, medicine, and biotechnology. Nucleic acid sequencing methods which involve separation of DNA molecules in a gel have been in use since 1978. The only other proven method for sequencing nucleic acids is sequencing by hybridization (SBH).
The array-based approach of SBH does not require single base resolution in separation degradation, synthesis or imaging of a DNA molecule. In the most commonly discussed variation of this method, using mismatch discriminative hybridization of short oligonucleotides K bases in length, lists of constituent K-mer oligonucleotides may be determined for target DNA. The sequence may be assembled through uniquely overlapping scored oligonucleotides.
In SBH sequence assembly, Kxe2x88x921 oligonucleotides which occur repeatedly in analyzed DNA fragments due to chance or biological reasons may be subject to special consideration. If there is no additional information, relatively small fragments of DNA may be fully assembled in as much as every base pair (bp) is read several times. In assembly of relatively longer fragments, ambiguities may arise due to repeated occurrence of a Kxe2x88x921 nucleotide. This problem does not exist if mutated or similar sequences have to be determined. Knowledge of one sequence may be used as a template to correctly assemble a similar one.
There are several approaches for sequencing by hybridization. In SBH Format 1, DNA samples are arrayed and labelled probes are hybridized with the samples. Replica membranes with the same sets of sample DNAs may be used for parallel scoring of several probes and/or probes may be multiplexed. Arraying and hybridization of DNA samples on the nylon, membranes are well developed. Each array may be reused many times. Format 1 is especially efficient for batch processing large numbers of samples.
In SBH Format 2, probes are arrayed and a labelled DNA sample fragment is hybridized to the arrayed probes. In this case, the complete sequence of one fragment may be determined from simultaneous hybridization reactions with the arrayed probes. For sequencing other DNA fragments, the same oligonucleotides array may be reused. The arrays may be produced by spotting or in situ variant of Format 2, DNA anchors are arrayed and ligation is used to determine oligosequences present synthesis. Specific hybridization has been demonstrated. In a variant of Format 2, DNA anchors are arrayed and ligation is used to determine oligosequences present at the end of target DNA.
In Format 3, two sets of probes are used. One set may be in the form of arrays and another, labelled set is stored in multiwell plates. In this case, target DNA need not be labelled. Target DNA and one labelled probe are added to the arrayed set of probes. If one attached probe and one labelled probe both hybridize contiguously on the target DNA, they are covalently ligated, producing a sequence twice as long to be scored. The process allows for sequencing long DNA fragments, e.g. a complete bacterial genome, without DNA subcloning in smaller pieces.
In the present invention, SBH is applied to the efficient identification and sequencing one or more DNA samples in a short period of time. The procedure has many applications in DNA diagnostics, forensics, and gene mapping. It also may be used to identify mutations responsible for genetic disorders and other traits, to assess biodiversity and to produce many other types of data dependent on DNA sequence.
As mentioned above, Format 1 SBH is appropriate for the simultaneous analysis of a large set of samples. Parallel scoring of thousands of samples on large arrays may be applied to one or a few samples are in thousands of independent hybridization reactions using small pieces of membranes. The identification of DNA may involve 1-20 probes and the identification of mutations may in some cases involve more than 1000 probes specifically selected or designed for each sample. For identification of the nature of the mutated. DNA segments, specific probes may be synthesized or selected for each mutation detected in the first round of hybridizations.
According to the present invention, DNA samples may be prepared in small arrays which may be separated by appropriate spacers, and which may be simultaneously tested with probes selected from a set of oligonucleotides kept in multiwell plates. Small arrays may consist of one or more samples. DNA samples in each small array may consist of mutants or individual samples of a sequence. Consecutive small arrays which form larger arrays may represent either replication of the same array or samples of a different DNA fragment. A universal set of probes consists of sufficient probes to analyze any DNA fragment with prespecified precision, e.g. with respect to the redundancy of reading each bp. These sets may include more probes than are necessary for one specific fragment, but fewer than are necessary for testing thousands of DNA samples of different sequence.
DNA or allele identification and a diagnostic sequencing process may include the steps of:
1) Selection of a subset of probes from a dedicated, representative or universal set to be hybridized with each of a plurality small arrays;
2) Adding a first probe to each subarray on each of the arrays to be analyzed in parallel;
3) Performing hybridization and scoring of the hybridization results;
4) Stripping off previously used probes and repeating remaining probes that are to be scored;
5) Processing the obtained results to obtain a final analysis or to determine additional probes to be hybridized;
6) Performing additional hybridizations for certain subarrays; and
7) Processing complete sets of data and computing obtaining a final analysis.
The present invention solves problems in fast identification and sequencing of a small number of nucleic acid samples of one type (e.g. DNA, RNA) and in parallel analysis of many sample types by using a presynthesized set of probes of manageable size and samples attached to a support in the form of subarrays. Two approaches have been combined to produce an efficient and versatile process for the determination of DNA identity, for DNA diagnostics, and for identification of mutations. For the identification of known sequences a small set of shorter probes may be used in place of a longer unique probe. In this case, there may be more probes to be scored, but a universal set of probes may be synthesized to cover any type of sequence. For example, a full set of 6-mers or 7-mers are only 4,096 and 16,384 probes, respectively.
Full sequencing of a DNA fragment may involve two levels. One level is hybridization of a sufficient set of probes that cover every base at least once. For this purpose, a specific set of probes may be synthesized for a standard sample. This hybridization data reveals whether and where mutations (differences) occur in non-standard samples. To determine the identity of the changes, additional specific probes may be hybridized to the sample. In another embodiment, all probes from a universal set may be scored.
A universal set of probes allows scoring of a relatively small number of probes per sample in a two step-process without unacceptable expenditure of time. The hybridization process involves successive probings, in a first step of computing an optimal subset of probes to be hybridized first and, then, on the basis of the obtained results, a second step of determining additional probes to be scored from among those in the existing universal set.
The use of an array of sample arrays avoids consecutive scoring of many oligonucleotides on a single sample or on a small set of samples. This approach allows the scoring of more probes in parallel by manipulation of only one physical object. By combining the use of the subarray formed with the universal set of probes and the four step hybridization process, a DNA sample 1000 bp in length may be sequenced in a relatively short period of time. If the sample is spotted at 50 subarrays in an array and the array is reprobed 10 times, 500 probes may be scored. This number of probes is highly sufficient. In screening for the occurrence of a mutation, approximately 335 probes may be used to cover each base three times. If a mutation is present, several covering probes will be affected. These negative probes may map the mutation with a two base precision. To solve a single base mutation mapped with this precision, an additional 15 probes may be employed. These probes cover any base combination for the two questionable positions (assuming that deletions and insertions are not involved). These probes may be scored in one cycle on 50 subarrays which contain the given sample. In the implementation of a multiple label color scheme (multiplexing), two to six probes labelled with different fluorescent dyes may be used as a pool, thereby reducing the number of hybridization cycles and shortening the sequencing process.
In more complicated cases, there may be two close mutations or insertions. They may be handled with more probes. For example, a three base insertion may be solved with 64 probes. The most complicated cases may be approached by several steps of hybridization, and the selecting of a new set of probes on the basis of results of previous hybridizations.
If subarrays consists of tens or hundreds of samples of one type, then several of them may be found to contain one or more changes (mutations, insertions, or deletions). For each segment where mutation occurs, a specific set of probes may be scored. The total number of probes to be scored for a type of sample may be several hundreds. The scoring of replica arrays in parallel allow scoring of hundreds of probes in a relatively small number of cycles. In addition, compatible probes may be pooled. Positive hybridizations may be assigned to the probes selected to check particular DNA segments because these segments usually differ in 75% of their constituent bases.
By using a larger set of longer probes, longer targets may be conveniently analyzed. These targets may represent pools of shorter fragments such as pools of exon clones.
The multiple step approach, which minimizes the number of necessary probes, may employ a specific hybridization scoring method to define the presence of heterozygotes (sequence variants) in a genomic segment to be sequenced from a diploid chromosomal set. There are two possibilities: i) the sequence from one chromosome represents a basic type and the sequence from the other represents a new variant; or, ii) both chromosomes contain new, but different variants. In the first case, the scanning step designed to map changes gives a maximal signal difference of two-fold at the heterozygotic position. In the second case, there is no masking; only a more complicated selection of the probes for the subsequent rounds of hybridizations may be required.
Scoring two-fold signal differences required in the first case may be achieved efficiently by comparing corresponding signals with controls containing only the basic sequence type and with the signals from other analyzed samples. This approach allows determination of a relative reduction in the hybridization signal for each particular probe in the given sample. This is significant because hybridization efficiency may vary more than two-fold for a particular probe hybridized with different DNA fragments having its full match target. In addition, heterozygotic sites may affect more than one probe depending on the number of oligonucleotide probes. Decrease of the signal for two to four consecutive probes produces a more significant indication of heterozygotic sites. The leads may be checked by small sets of selected probes among which one or few probes are suppose to give full match signal which is on average eight-fold stronger than the signals coming from mismatch-containing duplexes.
Partitioned membranes allow a very flexible organization of experiments to accommodate relatively larger numbers of samples representing a given sequence type, or many different types of samples represented with smaller number of samples. A range of 4-256 samples can be handled with particular efficiency. Subarrays within this range of numbers of dots may be designed to match the configuration and size of standard multiwell plates used or storing and labelling oligonucleotides. The size of the subarrays may be adjusted for different number of samples, or a few standard subarray sizes may be used. If all samples of one type do not fit in one subarray, additional subarrays or membranes may be used and processed with the same probes. In addition, by adjusting the number of replicas for each subarray, the time for completion of identification or sequencing process may be varied.