The present invention relates to detecting differences in polymers. More specifically, the present invention relates to detecting polymorphisms in sample nucleic acid sequences by clustering hybridization affinity information.
Devices and computer systems for forming and using arrays of materials on a chip or substrate are known. For example, PCT applications WO92/10588 and 95/11995, both incorporated herein by reference for all purposes, describe techniques for sequencing or sequence checking nucleic acids and other materials. Arrays for performing these operations may be formed according to the methods of, for example, the pioneering techniques disclosed in U.S. Pat. Nos. 5,445,934, 5,384,261 and 5,571,639, each incorporated herein by reference for all purposes.
According to one aspect of the techniques described therein, an array of nucleic acid probes is fabricated at known locations on a chip. A labeled nucleic acid is then brought into contact with the chip and a scanner generates an image file indicating the locations where the labeled nucleic acids are bound to the chip. Based upon the image file and identities of the probes at specific locations, it becomes possible to extract information such as the nucleotide or monomer sequence of DNA or RNA. Such systems have been used to form, for example, arrays of DNA that may be used to study and detect mutations relevant to genetic diseases, cancers, infectious diseases, HIV, and other genetic characteristics.
The VLSIPS(trademark) technology provides methods of making very large arrays of oligonucleotide probes on very small chips. See U.S. Pat. No. 5,143,854 and PCT patent publication Nos. WO 90/15070 and 92/10092, each of which is incorporated by reference for all purposes. The oligonucleotide probes on the DNA probe array are used to detect complementary nucleic acid sequences in a sample nucleic acid of interest (the xe2x80x9ctargetxe2x80x9d nucleic acid).
For sequence checking applications, the chip may be tiled for a specific target nucleic acid sequence. As an example, the chip may contain probes that are perfectly complementary to the target sequence and probes that differ from the target sequence by a single base mismatch. For de novo sequencing applications, the chip may include all the possible probes of a specific length. The probes are tiled on a chip in rows and columns of cells, where each cell includes multiple copies of a particular probe. Additionally, xe2x80x9cblankxe2x80x9d cells may be present on the chip which do not include any probes. As the blank cells contain no probes, labeled targets should not bind specifically to the chip in this area. Thus, a blank cell provides a measure of the background intensity.
The interpretation of hybridization data from hybridized chips can encounter several difficulties. Random errors, such as physical defects on the chip, can cause individual probes or spatially related groups of probes exhibit abnormal hybridization (e.g., by abnormal fluorescence). Systematic errors, such as the formation of secondary structures in the probes or the target, can also cause reproducible, but still misleading hybridization data.
For many applications, it is desirable to determine if there are differences between and among sample nucleic acid sequences, such as polymorphisms at a base position. It would be desirable to have systems and methods of detecting these differences in a way that is not overly affected by random and systematic errors.
The present invention provides innovative systems and methods for detecting differences in sample polymers, such as nucleic acid sequences. Hybridization affinity information for the sample polymers is clustered so that the differences, if any, between or among the sample polymers can be readily identified. By clustering the hybridization affinity information of the sample polymers, differences in the sample polymers can be accurately achieved even in the presence of random and systematic errors. Additionally, polymorphisms can be detected in sample nucleic acids regardless of what basecalling has reported. Several embodiments of the invention are described below.
In one embodiment, the invention provides a method of detecting differences in sample polymers. Multiple sets of hybridization affinity information are input, where each set of hybridization affinity information includes hybridization affinities between a sample polymer and polymer probes. The multiple sets of hybridization affinity information are clustered into multiple clusters such that all sets of hybridization affinity information in each cluster are more similar to each other than to the sets of hybridization affinity information in another cluster. The multiple clusters can then be analyzed to detect if there are differences in the sample polymers. For example, if the multiple clusters do not form clusters where subclusters are very similar yet very different from other clusters, this can indicate that the sample polymers are the same. Otherwise, the sample polymers can be different.
In another embodiment, the invention provides a method of detecting polymorphisms in sample nucleic acid sequences. Multiple sets of hybridization affinity information are input, where each set of hybridization affinity information includes hybridization affinities between a sample nucleic acid sequence and nucleic acid probes. The multiple sets of hybridization affinity information are hierarchically clustered into a plurality of clusters such that all sets of hybridization affinity information in each cluster are more similar to each other than to the sets of hybridization affinity information in another cluster. The multiple clusters can then be analyzed to detect if there are polymorphisms in the sample polymers. The polymorphisms can include mutations, insertions and deletions.
Other features and advantages of the invention will become readily apparent upon review of the following detailed description in association with the accompanying drawings.