The traditional approach to genome sequence analysis requires a primary sequence to be determined by conventional gel-based methods (typically using Applied Biosystems DNA sequencers). In this type of approach, the amount of work increases in proportion to both the length of sequence and the number of organisms tested and becomes impractical for large stretches of DNA or large numbers of organisms. For this reason, relatively few individuals within a species have been sequenced to look for polymorphic variation. Furthermore, only a few exemplary species, such as humans and E. coli, have been subject to large-scale sequencing.
Arrays of probes provide a more efficient means of analyzing variant sequences once a prototypical or reference sequence has been determined. Analysis of the hybridization pattern of probes to a target nucleic acid reveals the position, and optionally the nature, of differences between the target and reference sequence. For example, WO 95/11995 describes arrays comprising four probe sets. Comparison of the intensities of four corresponding probes from the four sets to a target sequence reveals the identity of a corresponding nucleotide in the target sequences aligned with an interrogation position of the probes. The corresponding nucleotide is the complement of the nucleotide occupying the interrogation position of the probe showing the highest intensity.
The existence of variation between a target and reference sequence can also be identified by differences in normalized hybridization intensities of probes flanking the variation when the probes are respectively hybridized to target and reference sequences. Relative loss of hybridization intensity is manifested as a “footprint” of probes flanking the point of variation between target and reference sequence (see EP 717,113, incorporated by reference in its entirety for all purposes). Additionally, hybridization intensities for multiple targets from different sources can be classified into groups or clusters suggested by the data, not defined a priori, such that isolates in a give cluster tend to be similar and isolates in different clusters tend to be dissimilar (see WO 97/29212, incorporated by reference in its entirety for all purposes).
Array-based resequencing has been used, for example, in the identification of large numbers of human polymorphisms in mitochondrial DNA and ESTs, the identification of drug-induced mutations in HIV, and analysis of mutations in p53 correlated with human cancer.