This patent application claims the foreign priority of Canadian Patent Application Number 2,256,128, Dec. 29, 1998.
United States Federal sponsorship was not involved in this work.
Not applicable.
1. Brown, T. A., xe2x80x9cDNA Sequencing: The Basicsxe2x80x9d, Oxford University Press, New York, 1994.
2. Tibbetts, C., Bowling, J., xe2x80x9cMethod and Apparatus for Automatic Nucleic Acid Sequence Determinationxe2x80x9d, U.S. Pat. No. 5,365,455, Nov. 15, 1994.
3. Lee, E., Messerschchmitt, D., xe2x80x9cDigital Communicationxe2x80x9d, (2nd Ed.), Kluwer, New York, 1994.
4. Proakis, J. G., xe2x80x9cDigital Communicationsxe2x80x9d, (3rd Ed.), McGraw-Hill Inc., New York, 1995.
5. Blahut, R. E., xe2x80x9cTheory and Practice of Error Control Codesxe2x80x9d, Addison-Wesley Publishing Co., Reading, Mass., 1983.
DeoxyriboNucleic Acid (DNA) encodes genetic information by specific base type at each point in a sequence of bases. For research and medical purposes it is desirable to recover the sequence, x={xi,i=1, . . . , N}, where xi is one of the four bases {adenine(A), cytosine(C), guanine(G), thymine(T)} that encode the genetic information; for some medical tests, it is not necessary to recover the whole sequence but rather identify the base type at certain key locations in the sequence.
In Sanger sequencing [1], the DNA template to be sequenced is chemically processed to encode sequence position by molecular weight and base type by the presence or absence of a fluorescent or radioactive marker. Gel electrophoresis is used to separate the molecules by length, translating molecular size into time of passage past a detector in the case of automated DNA sequencing [2]. Four time-series yn,k, where n={A,C,G,T} and k is the time sample index, are recorded, each of which corresponds to one of the four possible chemical base types. At a given time, a high-level signal (peak) should appear in only one of the series; this indicates the base type at that point in the sequence. We shall refer to the recorded time series as the xe2x80x98DNA time-seriesxe2x80x99 for the remainder of this document.
The fragment of DNA to be sequenced and the starting position for sequencing are identified through the use of primers [1]. Primers are short strands of DNA that are complementary to the target DNA sequence at the point of interest. Primers bind to the DNA template at that point and permit copying of the DNA using a DNA polymerase. This copying process is used in fragment selection and in sequencing as part of the process that encodes sequence position by molecular weight. In the later case, the recovered sequence position would be relative to the primer""s location with respect to the original DNA template.
In practice, the recovery of the sequence is complicated by undesirable signal features. Errors in DNA sequencing can have dangerous implications for the pharmaceutical and medical communities. To reduce errors, the entire sequencing process is repeated until a consensus sequence may be reached. This process is costly. Thus, there exists a need for a method to reduce error rate so that the costs and risks of DNA sequencing and testing may be minimized.
In data communications [3],[4] time-series similar to the DNA time-series described above are used to represent sequential information such as the text of a document. A receiving device will examine the time-series to recover an estimate of the original text. However, noise and distortion imposed on the time-series during its passage through a transmission medium such as a radio link or telephone wires can lead to errors in the recovery of the original information. To reduce the chance of error, the original data may first be passed through a coder that imposes a mathematical code on the data [3][5]. This introduces redundant information that a decoder added to the receiver uses to identify and correct errors. A large variety of codes have been created [5].
With a goal of reducing errors, this invention imposes a code by creating a new family of molecules from the DNA fragment of interest. This new family of molecules consists of fragments offset from the start of the original fragment by using different primers to achieve different offsets.
Standard codes may then be implemented by combining different proportions of the different fragments. This mixture is then used in the usual testing or sequencing process, such as gel electrophoresis, to recover the coded DNA time-series. The sequencer or tester then decodes the time-series by hypothesizing what the time series should have been for each possible sequence and choosing the sequence that yields the best match to the observed time-series.
Not applicable.