The present invention is in the field of human disease diagnosis and therapy. The present invention specifically provides previously unknown single nucleotide polymorphisms (SNPs) in genes that have been identified as being involved in pathologies associated with human disease. The diseases/pathologies that each gene is known in the art to be associated with is specifically indicated in Table 1. Since these genes are known to be associated with human disease, the presently disclosed naturally occurring polymorphisms (variants) are valuable for association and linkage analysis. Specifically, the identified SNPs are useful for such applications as screening for human disease susceptibility, prevention of human disease, development of diagnostics and therapies for human disease, development of drugs for human disease, and development of individualized drug treatments based on an individual""s SNP profile. The SNPs provided by the present invention are also useful for human identification. Methods and reagents for detecting the presence of these polymorphisms are provided.
SNPs
The genomes of all organisms undergo spontaneous mutation in the course of their continuing evolution, generating variant forms of progenitor sequences (Gusella, Ann. Rev. Biochem. 55, 831-854 (1986)). The variant form may confer an evolutionary advantage or disadvantage relative to a progenitor form or may be neutral. In some instances, a variant form confers a lethal disadvantage and is not transmitted to subsequent generations of the organism. In other instances, a variant form confers an evolutionary advantage to the species and is eventually incorporated into the DNA of many or most members of the species and effectively becomes the progenitor form. Additionally, the effect of a variant form may be both beneficial and detrimental, depending on the circumstances. For example, a heterozygous sickle cell mutation confers resistance to malaria, but a homozygous sickle cell mutation is usually lethal. In many instances, both progenitor and variant form(s) survive and co-exist in a species population. The coexistence of multiple forms of a sequence gives rise to polymorphisms, such as SNPs.
The reference allelic form is arbitrarily designated and may be, for example, the most abundant form in a population, or the first allelic form to be identified, and other allelic forms are designated as alternative, variant or polymorphic alleles. The allelic form occurring most frequently in a selected population is sometimes referred to as the xe2x80x9cwild typexe2x80x9d form.
Approximately 90% of all polymorphisms in the human genome are single nucleotide polymorphisms (SNPs). SNPs are single base pair positions in DNA at which different alleles, or alternative nucleotides, exist in some population. The SNNP position, or SNP site, is usually preceded by and followed by highly conserved sequences of the allele (e.g., sequences that vary in less than 1/100 or 1/1000 members of the populations). An individual may be homozygous or heterozygous for an allele at each SNP position. As defined by the present invention, the least frequent allele at a SNP position can have any frequency that is less than the frequency of the more frequent allele, including a frequency of less than 1% in a population. A SNP can, in some instances, be referred to as a xe2x80x9ccSNPxe2x80x9d to denote that the nucleotide sequence containing the SNP is an amino acid coding sequence.
A SNP may arise due to a substitution of one nucleotide for another at the polymorphic site. Substitutions can be transitions or transversions. A transition is the replacement of one purine nucleotide by another purine nucleotide, or one pyrimidine by another pyrimidine. A transversion is the replacement of a purine by a pyrimidine, or vice versa. A SNP may also be a single base insertion/deletion variant (referred to as xe2x80x9cindelsxe2x80x9d). A substitution that changes a codon coding for one amino acid to a codon coding for a different amino acid is referred to as a non-synonymous codon change, or missense mutation. A synonymous codon change, or silent mutation, is one that does not result in a change of amino acid due to the degeneracy of the genetic code. A nonsense mutation is a type of non-synonymous codon change that results in the formation of a stop codon, thereby leading to premature termination of a polypeptide chain and a defective protein.
SNPs, in principle, can be bi-, tri-, or tetra- allelic. However, tri- and tetra-allelic polymorphisms are extremely rare, almost to the point of non-existence (Brookes, Gene 234 (1999) 177-186). For this reason, SNPs are often referred to as xe2x80x9cbi-allelic markersxe2x80x9d, or xe2x80x9cdi-allelic markersxe2x80x9d.
Causative SNPs are those SNPs that produce alterations in gene expression or in the expression or function of a gene product, and therefore are most predictive of a possible clinical phenotype. One such class includes SNPs falling within regions of genes encoding a polypeptide product, i.e. cSNPs. These SNPs may result in an alteration of the amino acid sequence of the polypeptide product (i.e., non-synonymous codon changes) and give rise to the expression of a defective or other variant protein. Furthermore, in the case of nonsense mutations, a SNP may lead to premature termination of a polypeptide product. Such variant products can result in a pathological condition, e.g., genetic disease. Examples of genes in which a polymorphism within a coding sequence gives rise to genetic disease include sickle cell anemia and cystic fibrosis. Causative SNPs do not necessarily have to occur in coding regions; causative SNPs can occur in any region that can ultimately affect the expression and/or activity of the protein encoded by the nucleic acid. Such gene areas include those involved in transcription, such as SNPs in promoter regions, in gene areas involved in transcript processing, such as SNPs at intron-exon boundaries that may cause defective splicing, or SNPs in mRNA processing signal sequences such as polyadenylation signal regions. For example, a SNP may inhibit splicing of an intron and result in mRNA containing a premature stop codon, leading to a defective protein. Consequently, SNPs in regulatory regions can have substantial phenotypic impact.
Some SNPs that are not causative SNPs nevertheless are in close association with, and therefore segregate with, a disease-causing sequence. In this situation, the presence of the SNP correlates with the presence of, or susceptibility to, the disease. These SNPs are invaluable for diagnostics and disease susceptibility screening.
Clinical trials have shown that patient response to treatment with pharmaceuticals is often heterogeneous. Thus there is a need for improved approaches to pharmaceutical agent design and therapy. SNPs can be used to help identify patients most suited to therapy with particular pharmaceutical agents (this is often termed xe2x80x9cpharmacogenomicsxe2x80x9d). Pharmacogenomics can also be used in pharmaceutical research to assist the drug selection process. (Linder et al. (1997), Clinical Chemistry, 43, 254; Marshall (1997), Nature Biotechnology, 15, 1249; International Patent Application WO 97/40462, Spectra Biomedical; and Schafer et al. (1998), Nature Biotechnology, 16, 3.).
Population Genetics
Population Genetics is the study of how Mendel""s laws and other genetic principles apply to entire populations. Such a study is essential to a proper understanding of the genetic basis of human disease and SNP-based association studies and linkage disequilibrium mapping. Population genetics thus seeks to understand and to predict the effects of such genetic phenomena as segregation, recombination, and mutation; at the same time, population genetics must take into account such ecological and evolutionary factors as population size, patterns of mating, geographic distribution of individuals, migration and natural selection.
Linkage is the coinheritance of two or more nonallelic genes because their loci are in close proximity on the same chromosome, such that after meiosis they remain associated more often than the 50% expected for unlinked genes. During meiosis, there is a physical crossing over, it is clear that during the production of germ cells there is a physical exchange of maternal and paternal genetic contributions between individual chromatids. This exchange necessarily separates genes in chromosomal regions that were contiguous in each parent and, by mixing them with retained linear order, results in xe2x80x9crecombinantsxe2x80x9d. The process of forming recombinants through meiotic crossing-over is an essential feature in the reassortment of genetic traits and is central to understanding the transmission of genes.
Recombination generally occurs between large segments of DNA. This means that contiguous stretches of DNA and genes are likely to be moved together. Conversely, regions of the DNA that are far apart on a given chromosome are more likely to become separated during the process of crossing-over than regions of the DNA that are close together.
It is possible to use polymorphic molecular markers, such as SNPs, to clarify the recombination events that take place during meiosis. They are used as position markers and regional identifying characters along chromosomes and can also be used to distinguish paternally derived gene regions from maternally derived gene regions.
The pattern of a set of markers along a chromosome is referred to as a xe2x80x9cHaplotypexe2x80x9d. Therefore sets of alleles on the same small chromosomal segment tend to be transmitted as a block through a pedigree. By analyzing the haplotypes in a series of offspring of parents whose haplotypes are known, it is possible to establish which parental segment of which chromosome was transmitted to which child. When not broken up by recombination, haplotypes can be treated for mapping purposes as alleles at a single highly polymorphic locus.
The existence of a preferential occurrence of a disease gene in association with specific alleles of linked markers, such as SNPs, is called xe2x80x9cLinkage Disequilibriumxe2x80x9d(LD). This sort of disequilibrium generally implies that most of the disease chromosomes carry the same mutation and the markers being tested are quite close to the disease gene. For example, there is considerable linkage disequilibrium across the entire HLA locus. The A3 allele is in LD with the B7 and B14 alleles, and as a result B7 and B14 are also highly associated with hemochromatosis. Thus, HLA typing alone can significantly alter the estimate of risk for hemochromatosis, even if other family members are not available for formal linkage analysis. Consequently, by using a combination of several markers surrounding the presumptive location of the gene, a haplotype can be determined for affected and unaffected family members.
SNP-Based Association Analysis and Linkage Disequilibrium Mapping
SNPs are useful in association studies for identifying particular SNPs, or other polymorphisms, associated with pathological conditions, such as human disease. Association studies may be conducted within the general population and are not limited to studies performed on related individuals in affected families (linkage studies). An association study using SNPs involves determining the frequency of the SNP allele in many patients with the disorder of interest, such as human disease, as well as controls of similar age and race. The appropriate selection of patients and controls is critical to the success of SNP association studies. Therefore, a pool of individuals with well-characterized phenotypes is extremely desirable. For example, blood pressure and heart rate can be correlated with SNP patterns in hypertensive individuals in whom these physiological parameters are known in order to find associations between particular SNP genotypes and known phenotypes. Significant associations between particular SNPs or SNP haplotypes and phenotypic characteristics can be determined by standard statistical methods. Association analysis can either be direct or LD based. In direct association analysis, causative SNPs are tested that are candidates for the pathogenic sequence itself.
In LD based SNP association analysis, random SNPs are tested over a large genomic region, possibly the entire genome, in order to find a SNP in LD with the true pathogenic sequence or pathogenic SNP. For this approach, high density SNP maps are required in order for random SNPs to be located close enough to an unknown pathogenic locus to be in linkage disequilibrium with that locus in order to detect an association. SNPs tend to occur with great frequency and are spaced uniformly throughout the genome. The frequency and uniformity of SNPs means that there is a greater probability, compared with other types of polymorphisms such as tandem repeat polymorphisms, that a SNP will be found in close proximity to a genetic locus of interest. SNPs are also mutationally more stable than tandem repeat polymorphisms, such as VNTRs. LD-based association studies are capable of finding a disease susceptibility gene without any a priori assumptions about what or where the gene is.
Currently, however, it is not feasible to do SNP association studies over the entire human genome, therefore candidate genes associated with human disease are targeted for SNP identification and association analysis. The candidate gene approach uses a priori knowledge of disease pathogenesis to identify genes that are hypothesized, to directly influence development of the disease. The candidate gene approach may focus on a gene that is directly targeted by a drug used to treat the disorder. To discover SNPs associated with an increased susceptibility to human disease, candidate genes can be selected from systems physiologically implicated in the disease pathway. SNPs found in these genes are then tested for statistical association with disease in individuals who have the disease compared with appropriate controls. The candidate gene approach has the advantages of drastically reducing the number of candidate SNPs, and the number of individuals, that need to be typed, compared with LD-based association studies of random SNPs over large areas of, or complete, genomes. Furthermore, in the candidate gene approach, no assumptions are made about the extent of LD over any particular area of the genome.
Combined with the use of a high density map of appropriately spaced, sufficiently informative SNP markers, association studies, including linkage disequilibrium-based genome wide association studies, will enable the identification of most genes involved in complex disorders, such as human disease. This will enhance the selection of candidate genes most likely to contain causative SNPs associated with a particular disease. All of the SNPs disclosed by the present invention can be employed as part of genome-wide association studies or as part of candidate gene association studies.
The present invention advances the state of the art and provides commercially useful embodiments by providing previously unidentified SNPs in genes known in the art to be associated with human disease. The diseases/pathologies that each gene is associated with is specifically indicated in Table 1.
The present invention is based on the identification of novel SNPs and previously unknown haplotypes in genes known in the art to be associated with the pathologies of human disease. Such polymorphisms/haplotypes can lead to a variety of pathologies and disorders associated with human disease that are mediated/modulated by a variant gene/protein. The diseases/pathologies that each gene is known in the art to be associated with is specifically indicated in Table 1. Further, such polymorphisms are important reagents in studying the pathologies of human disease.
Based on these identifications, the present invention provides methods of detecting these novel variants as well as reagents needed to accomplish this task. The invention specifically provides novel SNPs in genes known to be involved in human disease, variant proteins encoded by the novel SNP forms of these genes, antibodies to the variant proteins, computer-based and data storage systems containing the novel SNP information, methods of detecting these SNPs in a sample, methods of determining a risk of having or developing a disorder mediated by a variant gene/protein, methods of screening for compounds used to treat disorders mediated by a variant gene/protein, methods of treating disorders mediated by a variant gene/protein, and methods of using the novel SNPs of the present invention for human identification. The present invention also provides genomic nucleotide sequences, transcript sequences, encoded amino acid sequences, and context sequences that contain the SNPs of the present invention.
CL001307CDR
NOTE: Two duplicate copies of the CD-R are submitted herewith, labeled xe2x80x9cCopy 1xe2x80x9d and xe2x80x9cCopy 2xe2x80x9d. The material on each of the duplicate CD-R""s is identical. Thus, descriptions or references herein to the CD-R labeled CL001307CDR and the files contained thereon apply to both xe2x80x9cCopy 1xe2x80x9d and xe2x80x9cCopy 2xe2x80x9d.
The CD-R labeled CL001307CDR contains the following file:
File TABLE1xe2x80x941307.txt provides Table 1 in text (ASCII) format, which discloses the SNP and associated gene information (including nucleic acid and amino acid sequences) of the present invention as indicated below in the xe2x80x9cDetailed Description of Table 1xe2x80x9d, including the context sequences (SEQ ID NOS:17,614-207,012) flanking each SNP, and the transcript (SEQ ID NOS:1-5871), protein (SEQ ID NOS:5872-11,742), and genomic sequences (SEQ ID NOS:11,743-17,613) of the human disease-associated genes that contain each SNP. File TABLE1xe2x80x941307.txt is 457,667 KB in size and was created on Sep. 10, 2001.
The material contained on the CD-R labeled CL001307CDR is hereby incorporated by reference pursuant to 37 CFR 1.77(b)(4).
Description of Table 1
Table 1 discloses the SNP and associated gene information of the present invention. For each SNP, Table 1 provides gene information followed by SNP information. The sequence information provided in Table 1 includes the transcript sequences (SEQ ID NOS:1-5871), protein sequences (SEQ ID NOS:5872-11,742), and genomic sequences (SEQ ID NOS:11,743-17,613) for each human disease-associated gene that contains a SNP of the present invention. Also provided are the context sequences (SEQ ID NOS:17,614-207,012) that flank each SNP of the present invention. The context sequences generally provide about 300 bp upstream (5xe2x80x2) and 300 bp downstream (3xe2x80x2) of each SNP, with the SNP about in the middle of the sequence, for a total of about 600 bp of context sequence surrounding each SNP. These sequences (transcript, protein, genomic, and context) may interchangeably be referred to herein as the sequences of Table 1 or the sequences of the Sequence Listing.
The gene information includes: a gene number, a Celera hCT number and/or a RefSeq NM number (the NM number is a reference number to an annotated human gene that is publicly known and whose role in disease processes is understood to the point of providing commercial uses for the naturally occurring variants herein described; the public gene identified by the NM number may be the same as the gene identified by the hCT number, or may be a homolog, or paralog thereof), the art-known gene name, the art-known protein name, Celera genomic axis position and chromosomal position/cytoband of the gene where available, a public reference (e.g., OMIM reference information, which can readily be used by one of ordinary skill in the art to associate the allelic variants of each gene provided herein with medically significant disease conditions and pathologies, thus providing readily apparent commercial utilities for the SNP information of the present invention) to the gene/protein name supporting the medical significance of the gene/protein (diseases/pathologies associated with each gene are specifically provided in Table 1 in the xe2x80x9cOMIM Informationxe2x80x9d section), transcript sequence (corresponding to SEQ ID NOS:1-5871 of the Sequence Listing), protein sequence (corresponding to SEQ ID NOS:5872-11,742 of the Sequence Listing), and genomic sequence (corresponding to SEQ ID NOS:11,743-17,613 of the Sequence Listing) of the assembled genomic region containing the gene. NOTE: the genomic sequences always correspond to Celera genomic sequences; where both a Celera hCT number and an NM number are provided for a gene, the transcript and protein sequences correspond to the Celera sequences identified by the hCT number; where only an NM number is provided for a gene, the transcript and protein sequences correspond to the public sequences identified for the NM number.
The SNP information includes: 300 bp of 5xe2x80x2 and 3xe2x80x2 context sequence (corresponding to SEQ ID NOS:17,614-207,012 of the Sequence Listing; in some instances, the context sequences may be reverse complemented relative to the gene/transcript sequences), Celera CV identification number for internal tracking, identified alleles, populations seen with alleles (xe2x80x9ccauxe2x80x9d=Caucasian, xe2x80x9chisxe2x80x9d=Hispanic, xe2x80x9cchnxe2x80x9d=Chinese, and xe2x80x9cafrxe2x80x9d=African, xe2x80x9cjpnxe2x80x9d=Japanese, xe2x80x9cindxe2x80x9d=Indian, xe2x80x9cmexxe2x80x9d=Mexican, xe2x80x9cainxe2x80x9d=xe2x80x9cAmerican Indian, xe2x80x9ccraxe2x80x9d=Celera donor, xe2x80x9cno_popxe2x80x9d=no population information available), SNP type [xe2x80x9cMIS-SENSE MUTATIONxe2x80x9d=SNP causes a change in the encoded amino acid (i.e., a non-synonymous coding SNP); xe2x80x9cINTERGENIC/UNKNOWNxe2x80x9d=SNP occurs in an intergenic region of the genome; xe2x80x9cUNKNOWNxe2x80x9d=SNP is located in an uncharacterized genomic region; xe2x80x9cSILENT MUTATIONxe2x80x9d32 SNP does not cause a change in the encoded amino acid (i.e., a synonymous coding SNP); xe2x80x9cSTOP CODON MUTATIONxe2x80x9d=SNP is located in a stop codon; xe2x80x9cNONSENSE MUTATIONxe2x80x9d=SNP creates a stop codon; xe2x80x9cINTRONxe2x80x9d=SNP is located in an intron, xe2x80x9cUTR 5xe2x80x9d=SNP is located in a 5xe2x80x2 UTR of a transcript; xe2x80x9cUTR 3xe2x80x9d=SNP is located in a 3xe2x80x2 UTR of a transcript; xe2x80x9cPUTATIVE UTR 5xe2x80x9d=SNP is located in a putative 5xe2x80x2 UTR; xe2x80x9cPUTATIVE UTR 3xe2x80x9d=SNP is located in a putative 3xe2x80x2 UTR; xe2x80x9cDONOR SPLICE SITExe2x80x9d=SNP is located in a donor splice site (5xe2x80x2 intron boundary); xe2x80x9cACCEPTOR SPLICE SITExe2x80x9d=SNP is located in an acceptor splice site (3xe2x80x2 intron boundary); xe2x80x9cREPEATSxe2x80x9d=SNP is located in a repeat element; CODING REGION=generally, the SNP is an insertion/deletion (xe2x80x9cindelxe2x80x9d) polymorphism that may cause a frameshift that alters the encoded protein downstream of the SNP; EXON=SNP is located in an exon; xe2x80x9cHUMAN-MOUSE CONSERVED REGIONxe2x80x9d=SNP is located in a region of the human genome that shares a high degree of sequence similarity with the mouse; xe2x80x9cCONSERVED SEGMENT PUTATIVExe2x80x9d=generally, SNP is located in a segment of the genome that is a putative regulatory region conserved between human and mouse; xe2x80x9cCORE PROMOTER PREDICTION PUTATIVExe2x80x9d=SNP is located in a predicted core promoter, xe2x80x9cTRANSCRIPTION FACTOR SITE PUTATIVExe2x80x9d=SNP is located in a predicted transcription factor binding site; xe2x80x9cREGULATORY REGIONxe2x80x9d=SNP is located in a regulatory region; and xe2x80x9cPUTATIVE REGULATORY REGIONxe2x80x9d=SNP is located in a putative regulatory region], affected protein (including Celera hCP or Genbank GI number, position of the amino acid residue within the protein identified by the hCP or GI number that is encoded by the codon containing the SNP, and alternative amino acids represented by 1-letter amino acid codes that are encoded by the alternative SNP alleles), and source [whether the SNP is found only in Celera data and is novel to the present invention (xe2x80x9cCeleraxe2x80x9d), or at least one SNP allele has been found in a public database as well as in Celera data but the map position of the SNP may not be publicly known (xe2x80x9cCelera+xe2x80x9d)].