A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the xerographic reproduction by anyone of the patent document or the patent disclosure in exactly the form it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
Software Appendices A and B comprising six (6) sheets are included herewith.
The present invention relates to the field of computer systems. More specifically, the present invention relates to computer systems for sequencing biological molecules including nucleic acids.
Devices and computer systems for forming and using arrays of materials on a substrate are known. For example, PCT application Ser. Nos. WO92/10588 and 95/11995, incorporated herein by reference for all purposes, describe techniques for sequencing or sequence checking nucleic acids and other materials. Arrays for performing these operations may be formed in arrays according to the methods of, for example, the pioneering techniques disclosed in U.S. Pat. Nos. 5,445,934 and 5,384,261, and U.S. patent application Ser. No. 08/249,188, each incorporated herein by reference for all purposes.
According to one aspect of the techniques described therein, an array of nucleic acid probes is fabricated at known locations on a chip or substrate. A labeled nucleic acid is then brought into contact with the chip and a scanner generates an image file (also called a cell file) indicating the locations where the labeled nucleic acids are bound to the chip. Based upon the image file and identities of the probes at specific locations, it becomes possible to extract information such as the nucleotide or monomer sequence of DNA or RNA. Such systems have been used to form, for example, arrays of DNA that may be used to study and detect mutations relevant to genetic diseases, cancers, infectious diseases, HIV, and other genetic characteristics.
The VLSIPS(trademark) technology provides methods of making very large arrays of oligonucleotide probes on very small chips. See U.S. Pat. No. 5,143,854 and PCT patent publication Nos. WO 90/15070 and 92/10092, each of which is incorporated by reference for all purposes. The oligonucleotide probes on the DNA probe array are used to detect complementary nucleic acid sequences in a sample nucleic acid of interest (the xe2x80x9ctargetxe2x80x9d nucleic acid).
For sequence checking applications, the chip may be tiled for a specific target nucleic acid sequence. For example, the chip may contain probes that are perfectly complementary to the target sequence and probes that differ from the target sequence by a single base mismatch. These probes are tiled on a chip in rows and columns of cells, where each cell includes multiple copies of a particular probe. Additionally, xe2x80x9cblankxe2x80x9d cells may be present on the chip which do not include any probes. As the blank cells contains no probes, labeled targets should not bind specifically to the chip in this area. Thus, a blank cell provides a measure of the background intensity.
For de novo sequencing applications, the chip may include all the possible probes of a specific length. These probes are synthesized on the chip at known locations, typically with multiple copies of a particular probe in a cell. Blank cells may also be utilized to provide a measure of the background intensity.
The present invention provides an improved computer-aided system for sequencing sample nucleic acid sequences from nucleic acid hybridization information. The accuracy of nucleic acid sequencing is increased by analyzing the hybridization strength of related probes, where the related probes are identified according to mismatch information among the probes. The related probes may include single base mismatches or otherwise have identical subsequences. The methods of the present invention allow sequencing under conditions that do not allow identification of all of the probes that are perfectly complementary to part of the target nucleic acid sequence.
According to one aspect of the present invention, a computer system is used to sequence a nucleic acid by a method including the steps of: inputting hybridization intensities for a plurality of nucleic acid probes, the nucleic acid probes hybridizing with the nucleic acid sequence under conditions that do not allow identification of all of nucleic acid probes that are perfectly complementary to part of the nucleic acid sequence; and sequencing the nucleic acid sequence according to selected nucleic acid probes.
According to another aspect of the present invention, a computer system is used to sequence a nucleic acid by a method including the steps of: inputting hybridization intensities for a plurality of nucleic acid probes; selecting nucleic acid probes with highest numbers of single base mismatch neighbors among the probes, a single base mismatch neighbor being another probe that has the same sequence except for a single base that is different; and sequencing the nucleic acid sequence according to the selected nucleic acid probes.
According to another aspect of the present invention, a computer system is used to sequence a nucleic acid by a method including the steps of: inputting hybridization intensities for a plurality of nucleic acid probes; selecting nucleic acid probes that have fewer than a predetermined number of base mismatches with another probe; and sequencing the nucleic acid sequence according to the selected nucleic acid probes.
According to another aspect of the present invention, a nucleic acid is sequenced by a method including the steps of: contacting a set of oligonucleotide probes of predetermined sequence and length with the nucleic acid under hybridization conditions that do not allow differentiation between (i) those probes of the set which are perfectly complementary to part of the nucleic acid and (ii) those probes that are not perfectly complementary to part of the nucleic acid; selecting a subset of oligonucleotide probes that includes probes that are perfectly complementary to part of the nucleic acid and probes that are not perfectly complementary to part of the nucleic acid; and determining the sequence of the nucleic acid by compiling overlapping sequences of the subset of probes.
A further understanding of the nature and advantages of the inventions herein may be realized by reference to the remaining portions of the specification and the attached drawings.