The identification and analysis of a particular gene or protein, or of a single nucleotide polymorphism (SNP), has generally been accomplished by experiments directed specifically towards that gene or protein, or SNP. With the recent advances, however, in the sequencing of the human genome, the challenge is to decipher the expression, function, and regulation of thousands of genes that can contain intragenic SNPs, which cannot be realistically accomplished by analyzing one gene or protein, or SNP at a time. To address this situation, the data mining methodologies of the present invention have been developed and proven to be a valuable tool.
Information is accumulating about the normal variation among human genomes. During the course of evolution, spontaneous mutations appear in the genomes of organisms. Variations in genomic DNA sequences have been estimated as being created continuously at a rate of about 100 new single base changes per individual. Kondrashow, 175 Theor. BIOL. 583-94 (1995); Crow, 12 EXP. CLIN. IMMUNOGENET. 121-28 (1995). These changes in the progenitor nucleotide sequences can confer an evolutionary advantage that likely increases the frequency of the mutation, an evolutionary disadvantage that likely decreases the frequency of the mutation, or the mutation will be neutral. In many cases, equilibrium is established between the progenitor and mutant sequences so that both are present in the population. The presence of both forms of the sequence results in genetic variation or polymorphism. Over time, a significant number of mutations can accumulate within a population such that considerable polymorphism can exist between individuals within the population.
Numerous types of polymorphisms are known to exist. There are several sources of sequence variation, such as when DNA sequences are either inserted or deleted from the genome, for example, by viral insertion. The presence of repeated sequences in the genome can also cause sequence variation and is variously termed short tandem repeats (STRs), variable number tandem repeats (VNTRs), short sequence repeats (SSRs) or microsatellites. These repeats can be dinucleotide, trinucleotide, tetranucleotide, or pentanucleotide repeats. Polymorphism results from variation in the number of repeated sequences found at a particular locus.
Most commonly, sequence differences between individuals involve differences in single nucleotide positions. SNPs account for approximately 90% of human DNA polymorphism. Collins et al., 8 GENOME RES. 1229-31 (1998). SNPs include single base pair positions in genomic DNA at which different sequence alternatives (alleles) exist in a population. In addition, the least frequent allele generally must occur at a frequency of 1% or greater. DNA sequence variants with a reasonably high population frequency are observed approximately every 1,000 nucleotide across the genome, with estimates as high as 1 SNP per 350 base pairs. Wang et al., 280 SCIENCE 1077-82 (1998); Harding et al., 60 AM. J. HUMAN GENET. 772-89 (1997); Taillon-Miller et al., 8 GENOME RES. 748-54 (1998); Cargill et al., 22 Nat. GENET. 231-38 (1999); and Semple et al., 16 BIOINFORM. DISC. NOTE 735-38 (2000). The frequency of SNPs varies with the type and location of the change. In base substitutions, two-thirds of the substitutions involve the C-T and G-A type. This variation in frequency can be related to 5-methylcytosine deamination reactions that occur frequently, particularly at CpG dinucleotides. Regarding location, SNPs occur at a much higher frequency in non-coding regions than in coding regions. Information on over one million variable sequences is already publicly available via the Internet and more such markers are available from commercial providers of genetic information. Kwok and Gu, 5 MED. TODAY 538-53 (1999).
Several definitions of SNPs exist. See, e.g., Brooks, 235 GENE 177-86 (1999). As used herein, the term “single nucleotide polymorphism” or “SNP” includes all single base variants, thus including nucleotide insertions and deletions in addition to single nucleotide substitutions. There are two types of nucleotide substitutions. A transition is the replacement of one purine by another purine or one pyrimidine by another pyrimidine. A transversion is the replacement of a purine for a pyrimidine, or vice versa.
The inheritance patterns of most common diseases are complex, indicating that the diseases are probably caused by mutations in one or more genes and/or through interactions between genes and environment. Many known human DNA sequence variants are known to be associated with particular diseases or to influence an individual's response to a particular drug. See, e.g., Drysdale et al., 12 PROC. NAT. ACAD. SCI. 10483-84 (2000). Because of the high frequency of SNPs within the genome, there is a greater probability that a SNP will be linked to a genetic locus of interest than other types of genetic markers.
Numerous methods exist for detecting SNPs within a nucleotide sequence. A review of many of these methods can be found in Landegren et al., 8 GENOME RES. 769-76 (1998). For example, a SNP in a genomic sample can be detected by preparing a Reduced Complexity Genome (RCG) from the genomic sample, then analyzing the RCG for the presence or absence of a SNP. See, e.g., WO 00/18960. Multiple SNPs in a population of target polynucleotides in parallel can be detected using, for example, the methods of WO 00/50869. Other SNP detection methods include the methods of U.S. Pat. Nos. 6,297,018 and 6,322,980. Furthermore, SNPs can be detected by restriction fragment length polymorphism (RFLP) analysis. See, e.g., U.S. Pat. Nos. 5,324,631; 5,645,995. RFLP analysis of SNPs, however, is limited to cases where the SNP either creates or destroys a restriction enzyme cleavage site. SNPs can also be detected by direct sequencing of the nucleotide sequence of interest. In addition, numerous assays based on hybridization have also been developed to detect SNPs and mismatch distinction by polymerases and ligases.
SNPs can be a powerful tool for the detection of individuals whose genetic make-up alters their susceptibility and/or predisposition to certain diseases. Genotyping of such markers therefore can be valuable to characterize patient populations. DNA sequence variants with no known functional consequences can also be useful in association and linking analyses. For example, information may be revealed that can then be used to detect individuals at risk for pathological conditions based on the presence of SNPs.
SNPs can be directly or indirectly associated with disease conditions in humans or animals. In a direct association, the alteration in the genetic code caused by the SNP directly results in the disease condition. Sickle cell anemia and cystic fibrosis are examples of direct SNP association with a disease. In an indirect association, the SNP does not directly cause the disease, but may alter the physiological environment such that there is an increased likelihood that the patient is susceptible to develop the disease as compared to an individual without the SNP. Additionally, SNPs can also be associated with disease conditions, without a direct or an indirect association with the disease. In this case, the SNP may be located in close proximity to the defective gene, usually within 5 centimorgans, such that there is a strong association between the presence of the SNP and the disease state.
Disease-associated SNPs can occur in coding and non-coding regions of the genome. When located in a coding region, a SNP can result in the production of a protein that is non-functional or that has decreased functionality. More frequently, SNPs may occur in non-coding regions. If a SNP occurs in a regulatory region, it can affect expression of the protein. For example, the presence of a SNP in a promoter region can alter the expression of a protein. If the protein is involved in protecting the body against development of a pathological condition, this decreased expression can make the individual more susceptible to the condition.
In association studies, the frequency of variants of individual genetic markers are compared between healthy persons and patient populations, anticipating that an observed difference in frequency can be the direct effect of the sequence difference. Also, co-inheritance with nearby unknown genetic variants can have such an effect. Associated markers with no direct effect on disease are referred to as being in linkage disequilibrium with the disease-related changes. Chapman and Thompson, 42 ADV. GENET. 413-37 (2001). These variants may, therefore, provide a guide to the gene that is directly involved in the disease. If the DNA sequence is derived from an individual in families where the particular disease is known to segregate, then the location of the disease-associated genetic changes among the chromosomes can be pinpointed by genetic linkage analysis, using the same types of genetic markers. This methodology has proven valuable for defining the nature of conditions primarily influenced by single or a limited number of genes. See, e.g., Alizadeh et al., 403 NATURE 503-11 (2000).
SNPs are well-suited for identifying genotypes that predispose an individual to develop a disease condition for several reasons. First, SNPs are the most common polymorphisms present in the genome, and are frequently located in or near any locus of interest. Because SNPs located in genes can be expected to directly affect protein structure or expression levels, they not only serve as markers but also as candidates for gene therapy treatments to cure or prevent a disease. SNPs also show greater genetic stability than repeated sequences and thus are less likely to undergo changes that would complicate diagnosis.
In particular diseases, single or small sets of genes have been identified that are typically altered by mutations. The identification of such disease-genes and their associated SNPs provides insights into the causes of common diseases and promotes the development of highly specific diagnostic and therapeutic products. Identifying and characterizing candidate genes and SNPs is critical for defining disease pathways, disease stages, drug effect pathways, and drug metabolic pathways. Sequence variation, as it relates to drug response, can aid in predicting the safety, toxicity, and/or efficacy of drugs. Along these lines, correlating SNPs with drug effects, therapy, and clinical outcome can significantly improve productivity and increase the efficiency of the development or improvement of drugs. Besides advancing drug development, SNPs can further facilitate developments in and improvements of methods and products, such as gene and antisense therapies, molecular diagnostics for predicting drug responses, and molecular diagnostics for selecting drug dosing regimens based upon genotype.
The increased efficiency of SNP detection methods makes them especially suitable for high-throughput typing systems, which are necessary to screen large populations. Information about hundreds of pathologic alterations that have been observed are already archived in mutation databases, some of which are available via the Internet. By taking advantage of the sequence information obtained from such databases, the successful application of large-scale biological analyses for annotating thousands of SNPs in genomic and cDNA sequences provides for the better understanding of the association of SNPs to a pathological condition. The data mining methodology of the present invention promises new opportunities in genetic research, thus adding value to the existing and forthcoming large-scale projects aiming to discover sequence variations in the human and other genomes. With these tools, the increasing number of publicly or privately available SNPs can be validated and assessed for their intragenic context and redundancy. The data mining methodology of the present invention is useful in the selection process of intragenic SNPs, thus providing a new tool for genotyping in genetic studies, which are effective for establishing the research, diagnostic, and treatment value of SNPs.
Implementing high-throughput SNP genotyping as a tool in genetic research projects preferably requires the availability of databases comprising high quality annotation data on known SNPs. Such resources are especially important when the selection of SNPs assayed in a genotyping facility is based upon SNP database information. Indeed, high quality SNP annotation avoids costly SNP assay development and genotyping of SNPs that later turn out to be invalid SNPs or not located at the expected chromosomal region. The methodology of the present invention also can filter out SNPs that map within regions of repeat sequences thus discarding a number of intragenic SNPs annotated by other SNP databases that are typically less relevant for genotyping purposes.
Following annotation by the methods of the present invention, the genetic context and redundancy of the SNPs can be efficiently and effectively assessed. The nucleotide sequences searched by the methods of the present invention, and annotated SNP IDs, can be matched and the genomic region defined, e.g., by repeat, promoter, coding sequence, and so forth. This data mining methodology can reveal additional and high quality SNPs compared to the SNPs that are annotated by the respective databases. Among the other advantages, the data mining methodology of the present invention can prevent problems arising in case of short flanking regions in the databases. Thus, this new technology offers a more effective tool in the process of selecting validated intragenic SNPs from databases that, for example, can be used in candidate gene association studies and for linkage analysis.