Nucleic acids (or polynucleotides) are linear polymers composed of covalently linked nucleotides. In tun, nucleotides are small organic compounds composed of phosphoric acid, a carbohydrate, and a purine such as adenine (A) or guanine (G) or a pyrimidine such as cytosine (C), thymidine (T), or uracil (U).
Nucleic acids may be single-stranded or double-stranded, where double-stranded nucleic acids are composed of two single-stranded nucleic acids bound to one another through noncovalent base-pairing interactions to form a hybrid. Such binding or hybridization will occur if the sequences of the single-stranded nucleic acids are "complementary" (or nearly complementary), so that for example wherever there is an A in one strand there is a T or a U in the other, and wherever there is a G in one strand there is a C in the other.
Nucleic acids in the form of deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) encode genetic information that controls cellular function and heredity in biological systems. DNA encodes information at least in part in the form of genes, where genes are sequences of nucleotides that encode information for constructing a polypeptide. The sequence of nucleotides in a gene may vary due to insertions, deletions, repeats, inversions, translocations, and/or single and multiple nucleotide substitutions, among others. These variations may be termed polymorphisms, and genes that differ by polymorphisms may be termed alleles.
Polymorphisms and other genetic factors appear to contribute to virtually every human disease, conferring susceptibility or resistance and affecting both progression and severity. For example, variations in the apoE gene are associated with Alzheimer's disease, variations in the CCR5 chemokine receptor gene are associated with resistance to HIV infection, variations in the hemoglobin gene are associated with sickle cell anemia, and variations in glycosyltransferase genes are associated with the ABO blood groups. Thus, an understanding of the genetic contribution to disease may greatly impact the diagnosis, treatment, and prevention of disease. Moreover, an understanding of this genetic contribution also may help in identifying and understanding nongenetic (e.g., environmental) influences on disease.
Analysis of DNA sequence variations is becoming increasingly important in identifying the genes involved in both disease and normal biological processes, including development aging, and reproduction. For example, to understand disease, it is important to understand how genetic variation affects gene function. Response to therapies also can be affected by genetic differences. Thus, information about variations in DNA sequence may assist in the analysis of disease and in the development of diagnostic, therapeutic, and preventative strategies.
Efforts are now underway to sequence the human genome through a combination of public and private effort. However, these efforts will not yield significant information regarding variations in DNA sequence within he human population, because the DNA sequence that is being produced will for most sequence sites come only from a single individual. (An exception is regions where overlapping clones from different chromosomes will be sequenced; however, this overlap will include input only from two individuals and will amount to less than 10% of the complete sequence.) Thus, additional work is needed to discover the number and distribution of variations in human DNA.
As described above, there are several types of variations in DNA sequence, including insertions and deletions, differences in the copy number of repeated sequences, and single base pair differences. The latter most variations are the most frequent These variations are termed single nucleotide polymorphisms (SNPs) when the variant sequence type has a frequency of at least 1% in the population. SNPs have many properties that make them attractive as the primary analytical reagent for the study of human sequence variation. In addition to their frequency, SNPs are stable, having much lower mutation rates than repeat sequences. More importantly, SNPs will be often be the nucleotide sequence variations that are responsible for functional changes of interest.
SNPs are very common in human DNA. Any two random chromosomes differ at about 1 in 1000 bases. However, only about half or fewer of random pairs of chromosomes will differ for any particular polymorphic base (i.e., for any base for which the least common variant has a frequency of at least 1% in the population). Thus, there actually are more polymorphic sites in the human population, viewed in its entirety, than there are sites that differ in any particular pair of chromosomes. Altogether, there may be anywhere from 6 million to 30 million nucleotide positions in the genome at which variation can occur in the human population. Thus, overall, approximately one in every 100 to 500 bases in human DNA may be polymorphic.
Information about SNPs may be used in various ways in genetic analysis. First, SNPs can be used as genetic makers in mapping studies. For example, SNPs can be used for whole-genome scans in pedigree-based linkage analysis of families; for this purpose, a map of about 2000 SNPs has the same analytical power as a map of about 800 microsatellite markers, currently the most frequently used type of marker. Second, when disease genetics is studied in individuals in a population, rather than in families, the haplotype distributions and linkage disequilibria can be used to map genes by association methods. For this purpose, it has been estimated that 30,000 to as many as 300,000 mapped SNPs will be needed. Third, genetic analysis can be used in case-control studies to identify functional SNPs contributing to a particular phenotype. Most SNPs are located outside of coding sequences, because only three to five percent of the human DNA sequence encodes proteins. However, SNPs located within protein-coding sequences ("cSNPs") are of particular interest because they are more likely than a random SNP to have functional significance. It also is likely that some of the SNPs in noncoding DNA will have functional consequences, such as those in sequences that regulate gene expression. Discovery of SNPs that affect biological function should become increasingly important over the next several years, and should be greatly facilitated by the availability of a large collection of SNPs, from which candidates for polymorphisms with functional significance can be identified. Accordingly, SNPs discovery is an important objective of SNPs research.
SNPs will be particularly important for mapping and discovering the genes associated with common diseases. Many processes and diseases are caused or influenced by complex interactions among multiple genes and environmental factors. These include processes such as development and aging, and diseases such as diabetes, cancer, cardiovascular and pulmonary disease, neurological diseases, autoimmune diseases, psychiatric illnesses, alcoholism, common birth defects, and susceptibility to infectious diseases, teratogens, and environmental agents. Many of the alleles associated with health problems are likely to have a low penetrance, meaning that only a small percentage of individuals carrying the alleles will develop disease. However, because such polymorphisms are likely to be very common in the population, they may make a significant contribution to the health burden of the population. Examples of common polymorphisms associated with an increased risk of disease include the ApoE4 allele and Alzheimer's disease, and the APC I1307K allele and colon cancer.
Most of the successes to date in identifying (a) the genes associated with diseases inherited in a Mendelian fashion, and (b) the genetic contribution to common diseases, e.g., BRCA1 and 2 for breast cancer, MODY 1, 2, and 3 for type 2 diabetes, and HNPCC for colon cancer, have been of genes with relatively rare, highly penetrant variant alleles. These genes are well-suited to discovery by linkage analysis and positional cloning techniques. However, the experimental techniques and strategies usefull for finding low penetrance, high frequency alleles involved in disease are usually not the same, and not as well developed, as those that have been applied successfully in positional cloning. For example, pedigree analysis of families often does not have sufficient power to identify common, weakly contributing loci. The types of association studies that do have the power to identify such loci efficiently require new approaches, techniques, and scientific resources to make them as robust and powerful as positional cloning. Among the resources needed is a genetic map of much higher density than the existing, microsatellite-based map. Association studies using a dense map should allow the identification of disease alleles even for complex diseases. SNPs are well suited to be the basis of such a map.
Available technologies can be used in SNPs analysis. For example, U.S. Pat. No. 5,888,819 to Goelet et al. describes a technique involving first binding a primer to a single-stranded polynucleotide immediately adjacent a polymorphic site of interest, and extending the primer by a terminating nucleotide such as a labeled ddNTP. Incorporation of the labeled base is then detected indicating what allele is present in the sample at the polymorphic site. A similar technique is described in U.S. Pat. No. 5,302,509 to Cheeseman. A significant drawback with the single-base extension methods described in Goelet et al. and Cheeseman is that they require labor-intensive affinity or physical separation steps to remove all nonterminating labeled nucleotides prior to detection, so that signal from bound nucleotide can be detected without interference with signal from unbound labeled nucleotides. The complexity of these single-base extension methods renders them impractical for some applications, such as SNPs testing procedures that require rapid testing of large numbers of samples. Thus, there is a significant need for simpler methods of detecting single-base variability in polynucleotides, in particular methods that are capable of detecting incorporated labeled nucleotides in the presence of unbound nucleotides, homogeneously, without labor-intensive physical separation steps. Such novel methods and the associated apparatus would be useful among other places in the high-throughput, large-scale discovery of SNPs, where "discovery" refers to finding new SNPs. Moreover, such methods and apparatus would be useful for scoring known SNPs in genotyping assays, where "scoring" refers to methods of determining the genotypes of individuals for particular SNPs that already have been discovered.